<h1 align="center">SIG731 2023: Task 4P</h1>

---

**Report Title:** Task 4P: Working with pandas Data Frames (Heterogeneous Data)  

**Author:** Shubham Singh Sehrawat  

**Student Number:** 224207934

**Email Address:** shubham512sehrawat@gmail.com

---


# Introduction 🧧🧧

This Jupyter/IPython notebook aims to analyze hourly meteorological data for three airports in New York (LGA, JFK, and EWR) for the year 2013. The dataset, named nycflights13_weather.csv.gz, contains various weather parameters such as temperature, dew point, humidity, wind speed, precipitation, pressure, and visibility. The primary objective is to convert the data into metric units, compute daily and monthly mean wind speeds, identify the ten windiest days at LGA, and visualize the monthly mean wind speeds for all three airports.

# Abstract 📃📃

The dataset nycflights13_weather.csv.gz provides hourly meteorological data for three New York airports for the entire year of 2013. This notebook performs the following tasks:

1. Conversion of all columns to metric units.
2. Computation of daily mean wind speeds for LGA airport.
3. Visualization of daily mean wind speeds at LGA.
4. Identification of the ten windiest days at LGA.
5. Computation of monthly mean wind speeds for all three airports.
6. Visualization of monthly mean wind speeds for LGA, JFK, and EWR.

These analyses provide insights into the wind patterns at the selected airports throughout the year 2013.


## Importing Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


---
## 1. Data Loading and Preprocessing 🎢📕📖📗📚
- We download the excerpt from the nycflights13_weather.csv.gz data file from our unit site
- Using pd.read_csv, we read the data into pandas dataframe named df.
- We will explore the basic df information before exploring it further
- The data has following columns :-
    * origin – weather station: LGA, JFK, or EWR,
    * year, month, day, hour – time of recording,
    * temp, dewp – temperature and dew point in degrees Fahrenheit,
    * humid – relative humidity,
    * wind_dir, wind_speed, wind_gust – wind direction (in degrees), speed and gust speed (in mph),
    * precip – precipitation, in inches,
    * pressure – sea level pressure in millibars,
    * visib – visibility in miles,
    * time_hour – date and hour

In [2]:
df = pd.read_csv(
    r"F:\anonymous\python\analysis\data\deakin\data_wrangling\tasks\datasets\weather.csv",
    skiprows=42, #data starts from 43rd row
    )
df.sample (5) # Visualizing a sample df

Unnamed: 0,origin,year,month,day,hour,temp,dewp,humid,wind_dir,wind_speed,wind_gust,precip,pressure,visib,time_hour
23528,LGA,2013,9,13,0,71.6,68.0,88.43,300.0,13.80936,15.891535,0.21,,2.5,2013-09-13 01:00:00
16809,JFK,2013,12,5,14,53.96,53.06,96.76,190.0,6.90468,7.945768,0.0,1019.4,0.06,2013-12-05 15:00:00
12422,JFK,2013,6,4,23,64.04,51.98,64.84,180.0,9.20624,10.594357,0.0,1017.2,10.0,2013-06-05 00:00:00
19420,LGA,2013,3,25,13,39.02,26.06,59.37,60.0,20.71404,23.837303,0.0,1002.4,10.0,2013-03-25 14:00:00
8540,EWR,2013,12,24,0,48.92,44.06,83.23,320.0,23.0156,26.485892,0.0,1018.5,10.0,2013-12-24 01:00:00


In [3]:
df.describe() # Data Stats

Unnamed: 0,year,month,day,hour,temp,dewp,humid,wind_dir,wind_speed,wind_gust,precip,pressure,visib
count,26130.0,26130.0,26130.0,26130.0,26129.0,26129.0,26129.0,25712.0,26127.0,26127.0,26130.0,23400.0,26130.0
mean,2013.0,6.505741,15.679717,11.518408,55.203515,41.385399,62.347322,198.066661,10.395868,11.963357,0.002726,1017.895175,9.204828
std,0.0,3.440031,8.765022,6.916581,17.782124,19.371649,19.196078,107.841624,8.5212,9.806027,0.019665,7.42279,2.136306
min,2013.0,1.0,1.0,0.0,10.94,-9.94,12.74,0.0,0.0,0.0,0.0,983.8,0.0
25%,2013.0,4.0,8.0,6.0,39.92,26.06,46.99,120.0,6.90468,7.945768,0.0,1012.9,10.0
50%,2013.0,7.0,16.0,12.0,55.04,42.08,61.66,220.0,9.20624,10.594357,0.0,1017.6,10.0
75%,2013.0,9.0,23.0,18.0,69.98,57.92,78.62,290.0,13.80936,15.891535,0.0,1023.0,10.0
max,2013.0,12.0,31.0,23.0,100.04,78.08,100.0,360.0,1048.36058,1206.432388,1.18,1042.1,10.0


In [4]:
df.info() # Data Facts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26130 entries, 0 to 26129
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   origin      26130 non-null  object 
 1   year        26130 non-null  int64  
 2   month       26130 non-null  int64  
 3   day         26130 non-null  int64  
 4   hour        26130 non-null  int64  
 5   temp        26129 non-null  float64
 6   dewp        26129 non-null  float64
 7   humid       26129 non-null  float64
 8   wind_dir    25712 non-null  float64
 9   wind_speed  26127 non-null  float64
 10  wind_gust   26127 non-null  float64
 11  precip      26130 non-null  float64
 12  pressure    23400 non-null  float64
 13  visib       26130 non-null  float64
 14  time_hour   26130 non-null  object 
dtypes: float64(9), int64(4), object(2)
memory usage: 3.0+ MB


In [5]:
# Calculating percentage of missing values in each column
missing_values = df.isnull().sum()

total_values = df.shape[0]

percentage_missing = (missing_values / total_values) * 100

percentage_missing



origin         0.000000
year           0.000000
month          0.000000
day            0.000000
hour           0.000000
temp           0.003827
dewp           0.003827
humid          0.003827
wind_dir       1.599694
wind_speed     0.011481
wind_gust      0.011481
precip         0.000000
pressure      10.447761
visib          0.000000
time_hour      0.000000
dtype: float64

In [6]:
#Number of dupicate values
df.duplicated().sum()

0

## 2. Converting columns to metric units.

* temp to Celsius
* dewp to Celsius
* precip to millimetres
* visib to metres
* wind_speed to metres per second
* wind_gust to metres per second

In [7]:
ds = df.copy() #Making a copy of our original dataframe
ds.sample(5)


Unnamed: 0,origin,year,month,day,hour,temp,dewp,humid,wind_dir,wind_speed,wind_gust,precip,pressure,visib,time_hour
15961,JFK,2013,10,30,22,55.04,50.0,83.08,190.0,8.05546,9.270062,0.0,1023.1,10.0,2013-10-30 23:00:00
23523,LGA,2013,9,12,19,82.04,68.0,62.53,180.0,10.35702,11.918651,0.0,1007.0,10.0,2013-09-12 20:00:00
21118,LGA,2013,6,4,8,60.98,41.0,47.7,330.0,10.35702,11.918651,0.0,1013.9,10.0,2013-06-04 09:00:00
18986,LGA,2013,3,7,11,37.04,24.98,61.35,20.0,17.2617,19.864419,0.0,1015.2,10.0,2013-03-07 12:00:00
1470,EWR,2013,3,3,10,30.02,17.06,58.08,320.0,9.20624,10.594357,0.0,1007.0,10.0,2013-03-03 11:00:00


In [8]:
ds["temp"] = round((ds["temp"] - 32) / 1.8,2)
ds["dewp"] = round((ds["dewp"] - 32) / 1.8,2)
ds["precip"] = round(ds["precip"] * 25.4,2)
ds["visib"] = round(ds["visib"] * 1609.34,2)
ds["wind_gust"] = round((ds["wind_gust"] * 1609.34) / 3600,2)
ds["wind_speed"] = round((ds["wind_speed"] * 1609.34) / 3600,2)

## 3. Making date_time column
- It is mentioned that due to a bug in the dataset, the data in this column are (incorrectly!) shifted by 1 hour. Do not rely on it unless you manually correct it.
- We are removing the time_hour completely and making our own date_time column
- We are making another column full_day that tells us the day                                                                                 
                                                                                 

In [9]:
ds.drop(columns='time_hour', inplace=True) # removing the time_hour column
ds["date_time"] = pd.to_datetime(df[["year", "month", "day", "hour"]]) # creating date_time column
ds["full_day"] = ds["date_time"].dt.strftime("%m %d %Y") # creating full_day column

## 4. Computation and Visualization of daily mean wind speeds for LGA airport.
- Here we will use plotly instead of matplotlib to make the graphs interactive for user🎢

In [10]:
# Here we first filter the data by "LGA" airport and used groupby on full_day for avg wind_speed which gives us 365 values i.e. for each day

lga_mean_wind = ds[ds["origin"] == "LGA"].groupby("full_day")["wind_speed"].mean().to_frame().sort_values(by='full_day').reset_index()

In [11]:
# Converting "full_day" column to datetime
lga_mean_wind["full_day"] = pd.to_datetime(lga_mean_wind["full_day"])

# Defining xticks_labels
xticks_labels = lga_mean_wind["full_day"].dt.strftime("%b %Y")

fig = go.Figure()

# Adding trace for wind speed
fig.add_trace(
    go.Scatter(
        x=lga_mean_wind.index,
        y=lga_mean_wind["wind_speed"],
        mode="lines",
        name="Wind Speed",
        hovertemplate="<b>Date</b>: %{text}<br><b>Wind Speed</b>: %{y:.2f} m/s",
        text=lga_mean_wind["full_day"].dt.strftime("%Y-%m-%d"),
    )
)

# Calculating mean, min, and max values
mean_wind_speed = lga_mean_wind["wind_speed"].mean()
min_wind_speed = lga_mean_wind["wind_speed"].min()
max_wind_speed = lga_mean_wind["wind_speed"].max()

# Adding horizontal lines for mean, min, and max values
fig.add_trace(
    go.Scatter(
        x=lga_mean_wind.index,
        y=[mean_wind_speed] * len(lga_mean_wind),
        mode="lines",
        line=dict(color="red", dash="dash"),
        name=f"Mean ({mean_wind_speed:.2f} m/s)",
        hoverinfo="skip",
    )
)
fig.add_trace(
    go.Scatter(
        x=lga_mean_wind.index,
        y=[min_wind_speed] * len(lga_mean_wind),
        mode="lines",
        line=dict(color="green", dash="dot"),
        name=f"Min ({min_wind_speed:.2f} m/s)",
        hoverinfo="skip",
    )
)
fig.add_trace(
    go.Scatter(
        x=lga_mean_wind.index,
        y=[max_wind_speed] * len(lga_mean_wind),
        mode="lines",
        line=dict(color="blue", dash="dashdot"),
        name=f"Max ({max_wind_speed:.2f} m/s)",
        hoverinfo="skip",
    )
)

# Updating x-axis ticks to include only month and year for one month interval
fig.update_xaxes(
    tickvals=lga_mean_wind.index[::31], ticktext=xticks_labels[::31], tickangle=45
)

# Updating layout
fig.update_layout(
    xaxis_title="Day",
    yaxis_title="Daily Avg. Wind-Speed [m/s] at LGA",
    legend=dict(x=1.1, y=1),  # Positioning legend outside the plot
    showlegend=True,
    yaxis=dict(gridcolor="lightgrey"),  # Adding light grey gridlines
    margin=dict(l=50, r=50, t=50, b=50),  # Adjusting margin for better visibility
)

fig.show()

## 5. Identifying the ten windiest days

In [12]:
top_ten_lga_windiest = lga_mean_wind.sort_values('wind_speed',ascending=False).head(10).reset_index(drop=True)
top_ten_lga_windiest

Unnamed: 0,full_day,wind_speed
0,2013-11-24,11.31875
1,2013-01-31,10.7175
2,2013-02-17,10.01
3,2013-02-21,9.192609
4,2013-02-18,9.17375
5,2013-03-14,9.109167
6,2013-11-28,8.938333
7,2013-05-26,8.85375
8,2013-05-25,8.766667
9,2013-02-20,8.660833


## 6. Computing the monthly mean wind speeds for all the three airports.
- First we will find the outlier and replace it with np.NaN

In [13]:
ds['wind_speed'].describe() # The maximum value is 468.66 which is an obvious outlier

count    26127.000000
mean         4.647448
std          3.809286
min          0.000000
25%          3.090000
50%          4.120000
75%          6.170000
max        468.660000
Name: wind_speed, dtype: float64

In [14]:
ds.replace((ds.loc[ds["wind_speed"].idxmax(), "wind_speed"]), np.NaN, inplace=True)
ds["wind_speed"].describe()

count    26126.000000
mean         4.629688
std          2.503889
min          0.000000
25%          3.090000
50%          4.120000
75%          6.170000
max         19.030000
Name: wind_speed, dtype: float64

In [15]:
# Now we will group the data by origin and month
monthly_mean_wind_speed = (
    ds.groupby(["origin", "month"])["wind_speed"].mean().reset_index()
)
monthly_mean_wind_speed

# Converting long to wide format

wide_monthly_mean_wind_speed = monthly_mean_wind_speed.pivot_table(
    "wind_speed", columns="month", index="origin"
)
wide_monthly_mean_wind_speed.reset_index(inplace=True)

wide_monthly_mean_wind_speed.rename(            # For readability
    columns={
        1: "Jan",
        2: "Feb",
        3: "Mar",
        4: "Apr",
        5: "May",
        6: "Jun",
        7: "Jul",
        8: "Aug",
        9: "Sep",
        10: "Oct",
        11: "Nov",
        12: "Dec",
        "origin": "Airport",
    },
    inplace=True,
)
wide_monthly_mean_wind_speed = wide_monthly_mean_wind_speed.set_index(
    "Airport"
).transpose()

In [16]:
fig = go.Figure()

for column in wide_monthly_mean_wind_speed.columns:
    fig.add_trace(
        go.Scatter(
            x=wide_monthly_mean_wind_speed.index,
            y=wide_monthly_mean_wind_speed[column],
            mode="markers+lines",
            name=column,
            hovertemplate="<b>Airport</b>: "
            + column
            + "<br>"
            + "<b>Month</b>: %{x}<br>"
            + "<b>Wind Speed</b>: %{y} m/s",
        )
    )

# Updating layout
fig.update_layout(
    title="Monthly Average Wind Speed for Different Airports (2013)",
    xaxis_title="Months",
    yaxis_title="Monthly Average Wind Speed [m/s]",
    hovermode="x",  # Showing closest data point on hovering
    legend=dict(x=1.1, y=1),  # Positioning legend outside the plot
    showlegend=True,
    margin=dict(l=50, r=50, t=50, b=50),  # Adjusting margin for better visibility
)

fig.show()

## 7. Observation 🧐🧐

- March is the windiest month for all airports
- August is the least windiest month for all airports
- On an average the windiest airport are as follows :-
    * JFK
    * LGA
    * EWR

# References 📚

1. Python Official Website: [Python.org](https://www.python.org/)
2. Numpy Official Website: [Numpy.org](https://numpy.org/)
3. Pandas Documentation: [pandas.pydata.org](https://pandas.pydata.org/docs/)
4. Matplotlib Official Website: [Matplotlib.org](https://matplotlib.org/)
5. Plotly Python Graphing Library Documentation: [plotly.com/python](https://plotly.com/python/)
6. ChatGPT Information: [OpenAI](https://openai.com/)
7. Gagolewski, M. (2024). *Minimalist Data Wrangling with Python*. Melbourne. DOI: [10.5281/zenodo.6451068](https://doi.org/10.5281/zenodo.6451068). ISBN: 978-0-6455719-1-2. URL: [Data Wrangling with Python](https://datawranglingpy.gagolewski.com/)
8. Gagolewski, M. [Minimalist Data Wrangling with Python](https://github.com/gagolews). GitHub Repository.
