# Future weather forecasting project

##### My project is a forecast of the weather in the future, and although it appears to be an easy project, it is not really so. Trying to predict the weather and getting an approximate value is considered an achievement!

The data will be collected (scraping) from the website [Wunderground](https://www.wunderground.com/).And i will use Selenium framework and a Chrome browser to scrape the data from the website.

New York City was chosen to be a source of data from the station (LAGUARDIA AIRPORT STATION|CHANGE) because America is one of the leading countries in meteorology

In [226]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from statsmodels.tsa.api import VAR
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from statsmodels.tools.eval_measures import rmse

In [227]:
df = pd.read_csv('weather.csv')

In [228]:
df.head()

Unnamed: 0.1,Unnamed: 0,Date,Humidity_Avg,Wind_Type,Pressure_Avg,Condition,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length
0,0,2009-01-01 00:00:00,57 %,NW,30.24 in,Fair /,21.79,36.1,0.0,5.7,4.33,29.0,30.26,9h 18m
1,1,2009-01-02 00:00:00,75 %,SW,29.74 in,Cloudy,31.23,35.9,0.0,5.7,18.37,18.0,30.18,9h 19m
2,2,2009-01-03 00:00:00,46 %,WNW,29.95 in,Fair,34.33,35.7,0.0,5.7,17.29,25.0,30.05,9h 20m
3,3,2009-01-04 00:00:00,53 %,WNW,30.01 in,Fair,34.04,35.5,0.0,5.7,13.13,23.0,30.09,9h 21m
4,4,2009-01-05 00:00:00,53 %,NW,29.90 in,Cloudy,41.89,35.4,0.0,5.7,22.33,18.0,29.98,9h 22m


In [229]:
df.tail()

Unnamed: 0.1,Unnamed: 0,Date,Humidity_Avg,Wind_Type,Pressure_Avg,Condition,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length
5447,1047,2023-12-26 00:00:00,89 %,ENE,30.32 in,Cloudy,44.76,37.4,0.0,5.7,41.65,7.0,30.35,9h 15m
5448,1048,2023-12-27 00:00:00,93 %,NE,30.13 in,Light Rain,44.53,37.2,0.0,5.7,42.08,20.0,30.17,9h 15m
5449,1049,2023-12-28 00:00:00,93 %,NE,29.67 in,Light Drizzle,49.8,37.0,1.31,5.7,47.43,29.0,29.77,9h 16m
5450,1050,2023-12-29 00:00:00,93 %,NE,29.65 in,Mostly Cloudy,48.74,36.7,0.09,5.7,43.45,17.0,29.66,9h 16m
5451,1051,2023-12-30 00:00:00,58 %,W,29.59 in,Mostly Cloudy,43.6,36.5,0.0,5.7,30.36,20.0,29.8,9h 16m


<ul>
<li>The feature "Unnamed: 0" should be removed.</li>
<li>Some properties contain units of measurement. To deal with these properties better, the units must be removed.</li>
<li>For an unclear reason, the "Condition" feature contains a slash at the index zero.</li>
</ul>

In [230]:
df.shape

(5452, 14)

In [231]:
df.columns

Index(['Unnamed: 0', 'Date', 'Humidity_Avg', 'Wind_Type', 'Pressure_Avg',
       'Condition', 'Temperature_Avg', 'Temperature_Historic',
       'Precipitation_Actual', 'Precipitation_Historic', 'Dew_Point',
       'Max_Wind_Speed', 'Sea_Level_Pressure', 'Day_Length'],
      dtype='object')

The features names do not need to be modified.

In [232]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5452 entries, 0 to 5451
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              5452 non-null   int64  
 1   Date                    5452 non-null   object 
 2   Humidity_Avg            5378 non-null   object 
 3   Wind_Type               5378 non-null   object 
 4   Pressure_Avg            5378 non-null   object 
 5   Condition               5378 non-null   object 
 6   Temperature_Avg         5361 non-null   float64
 7   Temperature_Historic    5391 non-null   float64
 8   Precipitation_Actual    5398 non-null   float64
 9   Precipitation_Historic  5399 non-null   float64
 10  Dew_Point               5399 non-null   float64
 11  Max_Wind_Speed          5399 non-null   float64
 12  Sea_Level_Pressure      5399 non-null   float64
 13  Day_Length              5394 non-null   object 
dtypes: float64(7), int64(1), object(6)
memor

<ul>
  <li>It is noted that there are some missing values.</li>
  <li>There are some features that need to have their data types converted: Date, Pressure_Avg, Humidity_Avg, and Day_length.</li>
</ul>


In [233]:
# A description for numeric datatype 
df.describe()

Unnamed: 0.1,Unnamed: 0,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure
count,5452.0,5361.0,5391.0,5398.0,5399.0,5399.0,5399.0,5399.0
mean,1116.553375,56.854587,56.910722,0.085773,4.743638,41.861221,18.085201,30.09799
std,944.522066,17.029709,15.801785,0.304303,0.758811,18.306307,5.629324,0.211808
min,0.0,9.17,33.8,0.0,2.9,-12.13,7.0,29.3
25%,340.0,43.05,41.6,0.0,4.5,26.865,14.0,29.95
50%,773.0,57.13,57.1,0.0,4.8,43.14,17.0,30.09
75%,1857.25,72.13,72.45,0.0,5.5,57.88,21.0,30.24
max,3220.0,93.83,79.6,6.86,5.7,75.83,56.0,30.82


## Feature Selection

There is no benefit from the feature "Unnamed: 0"

In [235]:
df.drop('Unnamed: 0',axis=1,inplace=True)

## Analysing missing values

In [236]:
df.isna().sum()

Date                       0
Humidity_Avg              74
Wind_Type                 74
Pressure_Avg              74
Condition                 74
Temperature_Avg           91
Temperature_Historic      61
Precipitation_Actual      54
Precipitation_Historic    53
Dew_Point                 53
Max_Wind_Speed            53
Sea_Level_Pressure        53
Day_Length                58
dtype: int64

In [237]:
df.isnull().sum()/len(df) 

Date                      0.000000
Humidity_Avg              0.013573
Wind_Type                 0.013573
Pressure_Avg              0.013573
Condition                 0.013573
Temperature_Avg           0.016691
Temperature_Historic      0.011189
Precipitation_Actual      0.009905
Precipitation_Historic    0.009721
Dew_Point                 0.009721
Max_Wind_Speed            0.009721
Sea_Level_Pressure        0.009721
Day_Length                0.010638
dtype: float64

Missing values are few for the size of the data

In [238]:
df.dropna(inplace=True)

In [239]:
# check
df.isna().sum()

Date                      0
Humidity_Avg              0
Wind_Type                 0
Pressure_Avg              0
Condition                 0
Temperature_Avg           0
Temperature_Historic      0
Precipitation_Actual      0
Precipitation_Historic    0
Dew_Point                 0
Max_Wind_Speed            0
Sea_Level_Pressure        0
Day_Length                0
dtype: int64

## Sorting the Dataset by date 

In [240]:
df = df.sort_values(by='Date').reset_index(drop=True)
df.head()

Unnamed: 0,Date,Humidity_Avg,Wind_Type,Pressure_Avg,Condition,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length
0,2009-01-01 00:00:00,57 %,NW,30.24 in,Fair /,21.79,36.1,0.0,5.7,4.33,29.0,30.26,9h 18m
1,2009-01-02 00:00:00,75 %,SW,29.74 in,Cloudy,31.23,35.9,0.0,5.7,18.37,18.0,30.18,9h 19m
2,2009-01-03 00:00:00,46 %,WNW,29.95 in,Fair,34.33,35.7,0.0,5.7,17.29,25.0,30.05,9h 20m
3,2009-01-04 00:00:00,53 %,WNW,30.01 in,Fair,34.04,35.5,0.0,5.7,13.13,23.0,30.09,9h 21m
4,2009-01-05 00:00:00,53 %,NW,29.90 in,Cloudy,41.89,35.4,0.0,5.7,22.33,18.0,29.98,9h 22m


## Parsing Date

In [241]:
df['Date'] = pd.to_datetime(df['Date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5331 entries, 0 to 5330
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Date                    5331 non-null   datetime64[ns]
 1   Humidity_Avg            5331 non-null   object        
 2   Wind_Type               5331 non-null   object        
 3   Pressure_Avg            5331 non-null   object        
 4   Condition               5331 non-null   object        
 5   Temperature_Avg         5331 non-null   float64       
 6   Temperature_Historic    5331 non-null   float64       
 7   Precipitation_Actual    5331 non-null   float64       
 8   Precipitation_Historic  5331 non-null   float64       
 9   Dew_Point               5331 non-null   float64       
 10  Max_Wind_Speed          5331 non-null   float64       
 11  Sea_Level_Pressure      5331 non-null   float64       
 12  Day_Length              5331 non-null   object  

Extract the (year), (month),(Day of the weak) and (day Of the year) for later analysis.

In [242]:
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day_of_weak'] = df['Date'].dt.dayofweek 
df['DayOfYear'] = df['Date'].dt.dayofyear

In [243]:
# check
df.head()

Unnamed: 0,Date,Humidity_Avg,Wind_Type,Pressure_Avg,Condition,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length,Year,Month,Day_of_weak,DayOfYear
0,2009-01-01,57 %,NW,30.24 in,Fair /,21.79,36.1,0.0,5.7,4.33,29.0,30.26,9h 18m,2009,1,3,1
1,2009-01-02,75 %,SW,29.74 in,Cloudy,31.23,35.9,0.0,5.7,18.37,18.0,30.18,9h 19m,2009,1,4,2
2,2009-01-03,46 %,WNW,29.95 in,Fair,34.33,35.7,0.0,5.7,17.29,25.0,30.05,9h 20m,2009,1,5,3
3,2009-01-04,53 %,WNW,30.01 in,Fair,34.04,35.5,0.0,5.7,13.13,23.0,30.09,9h 21m,2009,1,6,4
4,2009-01-05,53 %,NW,29.90 in,Cloudy,41.89,35.4,0.0,5.7,22.33,18.0,29.98,9h 22m,2009,1,0,5


## Cleaning Inconsistent Data

To make the data consistent, I will remove the units of measurement from 'Humidity_Avg' , 'Pressure_Avg', and convert them to numeric values.

In [244]:
# Remove % from 'Humidity_Avg' and convert to numeric
df['Humidity_Avg'] = df['Humidity_Avg'].str.replace('%', '').astype(float)

# Remove 'in' from 'Pressure_Avg' and convert to numeric
df['Pressure_Avg'] = df['Pressure_Avg'].str.replace(' in', '').astype(float)

Remove any special Character from 'Condition'

In [245]:
# check
df['Condition'].value_counts()

Condition
Mostly Cloudy    1782
Cloudy           1376
Fair             1256
Partly Cloudy     482
Light Rain        238
Light Snow         54
Fog                41
Fair /             37
Cloudy /           32
Light Drizzle      12
Wintry Mix          5
Haze                3
Smoke               3
Snow /              2
Heavy Rain          2
Rain                2
Rain and            1
Snow                1
Rain /              1
Mist                1
Name: count, dtype: int64

It also seems that there is an incomplete value (Rain and), but it doesn't hurt to have it too much.

In [246]:
df['Condition'] = df['Condition'].str.replace('[^\w\s]','', regex=True).str.strip()
df.head()

Unnamed: 0,Date,Humidity_Avg,Wind_Type,Pressure_Avg,Condition,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length,Year,Month,Day_of_weak,DayOfYear
0,2009-01-01,57.0,NW,30.24,Fair,21.79,36.1,0.0,5.7,4.33,29.0,30.26,9h 18m,2009,1,3,1
1,2009-01-02,75.0,SW,29.74,Cloudy,31.23,35.9,0.0,5.7,18.37,18.0,30.18,9h 19m,2009,1,4,2
2,2009-01-03,46.0,WNW,29.95,Fair,34.33,35.7,0.0,5.7,17.29,25.0,30.05,9h 20m,2009,1,5,3
3,2009-01-04,53.0,WNW,30.01,Fair,34.04,35.5,0.0,5.7,13.13,23.0,30.09,9h 21m,2009,1,6,4
4,2009-01-05,53.0,NW,29.9,Cloudy,41.89,35.4,0.0,5.7,22.33,18.0,29.98,9h 22m,2009,1,0,5


I will do the same for Day_Length converting it to minutes

In [247]:
def convert_to_minutes(day_length):
    """ 
    Converts a time duration from hours and minutes to total minutes.

    Args:
        day_length (str): The time duration in the format 'H hours M minutes'.

    Returns:
        float: The time duration converted to total minutes.
    """
    hours, minutes = day_length.split()
    total_minutes = float(hours[:-1]) * 60 + float(minutes[:-1])
    return total_minutes


In [248]:
df['Day_Length'] = df['Day_Length'].apply(convert_to_minutes)

Add units of measurement to the column names

In [249]:
df.rename({'Humidity_Avg': 'Humidity_Avg(%)','Pressure_Avg':'Pressure_Avg(in)', 'Day_Length': 'Day_Length(Minutes)'}, axis='columns',inplace=True)

In [250]:
# check
df.head()

Unnamed: 0,Date,Humidity_Avg(%),Wind_Type,Pressure_Avg(in),Condition,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length(Minutes),Year,Month,Day_of_weak,DayOfYear
0,2009-01-01,57.0,NW,30.24,Fair,21.79,36.1,0.0,5.7,4.33,29.0,30.26,558.0,2009,1,3,1
1,2009-01-02,75.0,SW,29.74,Cloudy,31.23,35.9,0.0,5.7,18.37,18.0,30.18,559.0,2009,1,4,2
2,2009-01-03,46.0,WNW,29.95,Fair,34.33,35.7,0.0,5.7,17.29,25.0,30.05,560.0,2009,1,5,3
3,2009-01-04,53.0,WNW,30.01,Fair,34.04,35.5,0.0,5.7,13.13,23.0,30.09,561.0,2009,1,6,4
4,2009-01-05,53.0,NW,29.9,Cloudy,41.89,35.4,0.0,5.7,22.33,18.0,29.98,562.0,2009,1,0,5


## EDA

#### Q1 : How does the weather change during the months?

In [251]:
# Get only the numeric features
numeric_columns = df.select_dtypes(include=np.number).columns

In [252]:
group_by_month = df.groupby(df['Month'])[numeric_columns[:9]].mean()
group_by_month

Unnamed: 0_level_0,Humidity_Avg(%),Pressure_Avg(in),Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,62.646532,30.0183,34.422953,34.411409,0.077964,5.605817,20.291969,19.61745,30.158568
2,63.683575,30.006739,36.704082,36.312802,0.066159,5.557729,21.954807,19.934783,30.16471
3,60.462555,29.886608,42.978965,43.032379,0.068414,5.134141,25.935771,20.160793,30.151211
4,61.909931,29.947621,53.113464,53.657506,0.086051,4.549885,35.228568,19.755196,30.067275
5,66.872807,29.991754,63.163772,63.771711,0.09682,4.610526,47.823443,17.52193,30.07386
6,67.851259,29.905103,72.301259,73.422654,0.071098,4.988101,56.970618,17.01373,29.977048
7,66.90989,29.867978,79.117912,79.159121,0.102835,4.957582,63.444571,16.613187,29.988308
8,67.805252,29.965098,77.478271,77.768053,0.088359,4.173523,62.52709,16.210066,30.024902
9,69.569476,30.028519,70.932711,70.792938,0.100615,3.249203,57.398542,16.202733,30.103462
10,68.958425,29.991882,60.214683,59.652516,0.09186,3.625164,46.76709,17.301969,30.093786


In [253]:
fig = px.line(group_by_month)
fig.update_layout(title='weather Over Months',
                  xaxis_title='Date',
                  yaxis_title='Weather properties')
fig.show()

- The highest humidity is in June and the lowest in March.
- The highest pressure is in November and the lowest in July.
- The highest temperature is in July and the lowest in January.
- The highest precipitation is in July and the lowest in February.
- The highest dew point is in July and the lowest in January.
- The highest wind speed is in March and the lowest in September.
- The highest sea level pressure is in November and the lowest in June.
- The values between "Temperature_Avg" and "Temperature_Historic" are very close.
- The values between "Precipitatio_Actual" and "Precipitation_Historic" are far,
This may be due to an error in calculating "Precipitation_Historic" or because the weather is very volatile and there is a deviation from the historical values.

#### Q2 : How does the weather change during the years?

In [254]:
group_by_years = df.groupby(df['Year'])[numeric_columns[:9]].mean()
group_by_years

Unnamed: 0_level_0,Humidity_Avg(%),Pressure_Avg(in),Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2009,66.665746,29.990249,54.336575,56.801381,0.0,4.746133,40.08779,18.232044,30.097459
2010,62.632597,29.90105,57.485387,56.870994,0.0,4.743923,41.010525,18.726519,29.996823
2011,67.085635,29.968674,56.215028,56.772376,0.0,4.751105,41.704309,17.483425,30.076326
2012,66.582633,29.976415,58.361653,57.077031,0.0,4.731933,43.362101,17.77591,30.071401
2013,66.169811,30.021981,55.64217,56.840566,0.0,4.749057,40.343428,17.814465,30.118585
2014,63.972299,29.995014,54.362659,57.014681,0.115706,4.741551,39.394044,18.00831,30.104377
2015,64.558333,30.030722,56.286528,56.8675,0.099167,4.746111,41.425528,17.65,30.139333
2016,64.754144,29.913674,58.17326,56.716298,0.102735,4.751105,42.82605,18.464088,30.106492
2017,67.924157,29.984382,57.657865,57.149719,0.111236,4.733427,43.648904,18.227528,30.09764
2018,68.745665,30.029884,57.044509,57.399133,0.177197,4.722254,43.718324,18.312139,30.143092


In [255]:
fig = px.line(group_by_years)
fig.update_layout(title='weather Over Years',
                  xaxis_title='Date',
                  yaxis_title='Weather properties')
fig.show()

- The lowest temperature and dew point were recorded in 2024, attributed to the worst winter storm America has faced in 20 years. Additionally, there is a noticeable difference between "Temperature_Avg" and "Temperature_Historic", particularly in the year 2014.
<br>
- From 2009, "Precipitatio_Actual" was a constant value (0.000000), then it increased from 2014, then it increased from 2014, and there remains a distinct contrast between "Precipitatio_Actual" and "Precipitation_Historic" during these years.
<br>

- It is worth noting that wind speeds remained low from 2013 until 2011, after which they began to increase. This trend is attributed to climate issues that led to an uptick in wind speed.

#### Q3: How is the number of daylight hours distributed per year? 

In [256]:
day_length_over_year = df.groupby(df['Month'])['Day_Length(Minutes)'].mean()
day_length_over_year

Month
1     577.507830
2     638.920290
3     716.852423
4     798.355658
5     867.359649
6     902.210526
7     884.962637
8     824.969365
9     747.047836
10    666.280088
11    594.904651
12    557.834071
Name: Day_Length(Minutes), dtype: float64

In [257]:
fig = px.line(day_length_over_year)
fig.update_layout(title='Day length Over Year',
                  xaxis_title='Date',
                  yaxis_title='Day length')
fig.show()

The longest days was in June and the shortest days was in December

#### Q4 : How does the weather change during the weak?

In [258]:
group_by_weak = df.groupby(df['Day_of_weak'])[numeric_columns[:9]].mean()
group_by_weak

Unnamed: 0_level_0,Humidity_Avg(%),Pressure_Avg(in),Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure
Day_of_weak,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,65.547244,29.95315,56.857218,56.914829,0.091811,4.740551,41.790669,18.203412,30.102612
1,65.918182,29.994013,57.057805,56.893636,0.088506,4.748182,42.342039,17.872727,30.093117
2,66.242464,29.985924,57.13751,56.935911,0.071337,4.743119,42.393539,17.950197,30.091678
3,66.270806,29.987926,56.992048,56.815852,0.088877,4.748745,42.297543,18.202114,30.086975
4,65.431937,29.982395,56.834974,56.97788,0.093312,4.740707,41.869935,18.352094,30.095366
5,65.41457,29.998159,56.391192,56.921325,0.087934,4.738013,41.115868,17.968212,30.105881
6,64.228947,29.928434,56.761632,56.996053,0.075553,4.745789,41.323303,18.052632,30.107632


I will replace the numbers with the names of days to make the plot easier to understand

In [259]:
day_dict = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}

# Replace the day numbers with names in the copy DataFrame
group_by_weak.index = group_by_weak.index.map(day_dict)
group_by_weak

Unnamed: 0_level_0,Humidity_Avg(%),Pressure_Avg(in),Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure
Day_of_weak,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Monday,65.547244,29.95315,56.857218,56.914829,0.091811,4.740551,41.790669,18.203412,30.102612
Tuesday,65.918182,29.994013,57.057805,56.893636,0.088506,4.748182,42.342039,17.872727,30.093117
Wednesday,66.242464,29.985924,57.13751,56.935911,0.071337,4.743119,42.393539,17.950197,30.091678
Thursday,66.270806,29.987926,56.992048,56.815852,0.088877,4.748745,42.297543,18.202114,30.086975
Friday,65.431937,29.982395,56.834974,56.97788,0.093312,4.740707,41.869935,18.352094,30.095366
Saturday,65.41457,29.998159,56.391192,56.921325,0.087934,4.738013,41.115868,17.968212,30.105881
Sunday,64.228947,29.928434,56.761632,56.996053,0.075553,4.745789,41.323303,18.052632,30.107632


In [260]:
fig = px.line(group_by_weak)
fig.update_layout(title='weather Over Weak',
                  xaxis_title='Date',
                  yaxis_title='Weather properties')
fig.show()

- **Humidity**: The average humidity seems to be fairly consistent throughout the week, ranging from around 64.23% to 66.27%. The highest average humidity is observed on Thursday, while the lowest is on Sunday.

- **Pressure**: The average pressure also remains relatively stable throughout the week, with minor fluctuations. The highest average pressure is observed on Saturday, while the lowest is on Sunday.

- **Temperature**: The average temperature varies slightly throughout the week. The highest average temperature is observed on Wednesday, while the lowest is on Saturday.

- **Precipitation**: The actual average precipitation is fairly consistent throughout the week, with minor variations. The highest average precipitation is observed on Friday, while the lowest is on Wednesday.

- **Dew Point**: It is varies throughout the week. The highest average dew point is observed on Wednesday, while the lowest is on Saturday.

- **Wind Speed**: The maximum wind speed fluctuates throughout the week. The highest average maximum wind speed is observed on Friday, while the lowest is on Tuesday.

- **Sea Level Pressure**: The sea level pressure is relatively stable throughout the week, with minor fluctuations. The highest average sea level pressure is observed on Sunday, while the lowest is on Thursday.

-  The weather in New York City does show some variability throughout the week, but these changes are relatively small and consistent these variations are normal and expected as weather is inherently variable due to the complex interactions of atmospheric conditions.


#### Q5: When does the warm season start and end in New York?

The warm season with an average daily high temperature above 76°F

In [261]:
def get_season_plot(df,seasonal_condition_df):
    """ 
    Plots the average daily temperature over time and overlays the season periods.

    Args:
        df (pd.DataFrame): The DataFrame containing the weather data. It should have columns 'DayOfYear' and 'Temperature_Avg'.
        seasonal_condition_df (pd.DataFrame): The DataFrame containing the seasonal condition data. It should have columns 'DayOfYear', 'Temperature_Avg', and 'Month'.

    Returns:
        None. The function displays a Plotly figure.
    """
    average_daily_temps = df.groupby('DayOfYear')['Temperature_Avg'].mean().reset_index()
    average_daily_condtion_temps = seasonal_condition_df.groupby('DayOfYear')['Temperature_Avg'].mean().reset_index()
    average_daily_condtion_temps['Month'] = seasonal_condition_df.groupby('DayOfYear')['Month'].first().values

    fig = go.Figure()

    # Plot temperature over time grouped by Dayes of the year
    fig.add_trace(go.Scatter(x=average_daily_temps['DayOfYear'],
                            y=average_daily_temps['Temperature_Avg'],
                            mode='lines',
                            name='Temperature'))

    # layer over season periods
    for index in range(len(average_daily_condtion_temps) - 1):
        # The weather is volatile, Hence it was necessary to ensure that the observed values were consistent
        #  and did not occur randomly within a given month
        if average_daily_condtion_temps.loc[index, 'Month'] == average_daily_condtion_temps.loc[index + 1, 'Month']:
            fig.add_vrect(x0=average_daily_condtion_temps.loc[index, 'DayOfYear'], 
                        x1=average_daily_condtion_temps.loc[index + 1, 'DayOfYear'], 
                        fillcolor="Red", opacity=0.2, line_width=0)

    fig.update_layout(title='Season Over Time',
                    xaxis_title='Date', yaxis_title='Temperature (F)')

    fig.show()    

In [262]:
warm_season_data = df[df['Temperature_Avg'] > 76]
get_season_plot(df,warm_season_data)

The warm season typically begins (April to June) and ends in late (september to October). 
This pattern aligns with general temperature trends in many regions, where temperatures decrease starting from about April to The temperature then begins to decrease again with the onset of wenter and fall.

#### Q6: When does the warm season start and end in New York?

The cold season with an average daily high temperature below 48°F

In [263]:
cold_season_data = df[df['Temperature_Avg'] < 48]
get_season_plot(df,cold_season_data)

The cold season typically begins (January to April) and ends in late (October to December). 
This pattern aligns with general temperature trends in many regions, where temperatures decrease starting from December and continue to stay low until the following year. The temperature then begins to rise again with the onset of spring and summer.

#### Q7: What dates have rare temperatures?
The temperature in NY is rarely below 14F or above 92F.

In [264]:
rarely_temp_above_92 = df[df['Temperature_Avg'] > 92]
rarely_temp_above_92

Unnamed: 0,Date,Humidity_Avg(%),Wind_Type,Pressure_Avg(in),Condition,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length(Minutes),Year,Month,Day_of_weak,DayOfYear
547,2010-07-06,37.0,CALM,29.94,Fair,92.17,78.6,0.0,5.1,65.46,12.0,29.99,899.0,2010,7,1,187
926,2011-07-22,65.0,WNW,29.78,Fair,93.83,79.6,0.0,4.9,69.87,14.0,29.82,878.0,2011,7,4,203
1613,2013-07-19,59.0,SW,29.92,Partly Cloudy,92.26,79.6,0.0,4.9,71.48,18.0,29.94,882.0,2013,7,4,200


The degrees above 92 were all in the month of July, and this is consistent with what we previously observed that July is the month with the highest temperature.

In [265]:
rarely_temp_below_14 = df[df['Temperature_Avg'] < 14]
rarely_temp_below_14

Unnamed: 0,Date,Humidity_Avg(%),Wind_Type,Pressure_Avg(in),Condition,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length(Minutes),Year,Month,Day_of_weak,DayOfYear
1767,2014-01-07,42.0,W,29.7,Fair,9.17,35.0,0.0,5.7,-11.38,30.0,30.36,564.0,2014,1,1,7
2168,2015-02-16,55.0,NW,30.1,Cloudy,13.13,36.4,0.0,5.6,-7.39,23.0,30.19,642.0,2015,2,0,47
2172,2015-02-20,41.0,WNW,30.02,Fair,11.61,37.2,0.0,5.6,-7.96,25.0,30.47,652.0,2015,2,4,51
2526,2016-02-14,52.0,NNW,30.52,Fair,9.83,36.1,0.0,5.6,-10.65,24.0,30.53,636.0,2016,2,6,45
3200,2018-01-01,61.0,NW,30.34,Fair,13.96,36.1,0.0,5.7,-0.83,22.0,30.37,558.0,2018,1,0,1
3205,2018-01-06,44.0,WNW,30.19,Fair,11.0,35.2,0.0,5.7,-6.13,32.0,30.41,562.0,2018,1,5,6
3565,2019-01-21,59.0,WNW,30.07,Mostly Cloudy,11.43,33.8,0.2,5.6,-2.35,37.0,30.36,584.0,2019,1,0,21
3575,2019-01-31,32.0,W,30.41,Fair,10.3,34.1,0.0,5.5,-12.13,20.0,30.41,604.0,2019,1,3,31
4967,2022-12-24,41.0,W,29.79,Fair,11.65,37.9,0.05,5.7,-7.26,29.0,29.89,554.0,2022,12,5,358


It appear that temperatures below 14 F occurred in New York City from 2014 to 2022, primarily in the months of January  and February , with one occurrence in December of 2022.

In [266]:
len_natural_temp = len(df)-(len(rarely_temp_above_92)+ len(rarely_temp_below_14))
labels = ['Above 29 F','Below 14 F','Natural']
values = [len(rarely_temp_above_92), len(rarely_temp_below_14),len_natural_temp]

In [267]:
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title='Above 29 F VS Below 14 F VS Natural')

#### Q8: What dates that weather conditions that might lead to school closures or vacations?

Heavy rain, temperature below -37, strong winds around 40mph and poor air quality (smoke,Haze).

In [268]:
weather_conditions = ['Heavy Rain', 'Haze', 'Smoke']
weather_related_vacation = df[(df['Temperature_Avg'] < -37) |
                                    (df['Max_Wind_Speed'] >= 40) |
                                    (df['Condition'].isin(weather_conditions))]

weather_related_vacation

Unnamed: 0,Date,Humidity_Avg(%),Wind_Type,Pressure_Avg(in),Condition,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length(Minutes),Year,Month,Day_of_weak,DayOfYear
42,2009-02-12,43.0,W,29.32,Cloudy,50.71,35.7,0.0,5.5,32.38,43.0,29.59,633.0,2009,2,3,43
228,2009-08-18,51.0,WSW,29.95,Fair,82.41,77.7,0.0,4.1,65.66,43.0,30.09,820.0,2009,8,1,230
1385,2012-10-29,88.0,NNE,29.64,Cloudy,59.03,54.9,0.0,4.3,54.07,56.0,29.61,632.0,2012,10,0,303
1470,2013-01-31,94.0,W,29.36,Cloudy,48.0,34.1,0.0,5.5,32.5,40.0,29.75,605.0,2013,1,3,31
1876,2014-04-30,93.0,ENE,30.33,Heavy Rain,45.67,58.9,0.55,4.5,42.25,21.0,30.38,834.0,2014,4,2,120
1898,2014-05-22,80.0,SE,29.83,Haze,61.43,65.5,0.27,4.7,55.7,12.0,29.86,879.0,2014,5,3,142
2491,2016-01-10,93.0,ENE,29.81,Heavy Rain,52.28,34.6,0.63,5.7,47.83,26.0,29.99,566.0,2016,1,6,10
2887,2017-02-13,54.0,NW,29.58,Mostly Cloudy,37.57,35.9,0.41,5.6,22.65,41.0,29.96,635.0,2017,2,0,44
3321,2018-05-15,93.0,S,29.87,Cloudy,68.02,63.4,0.0,4.6,60.66,45.0,29.97,867.0,2018,5,1,135
4114,2020-08-04,82.0,S,29.96,Cloudy,78.08,79.1,0.53,4.6,69.35,53.0,30.03,852.0,2020,8,1,217


In [269]:
print(f"So {len(weather_related_vacation)}days had weather conditions that maybe led to a vacation")

So 20days had weather conditions that maybe led to a vacation


#### Q9: How many Suitable Day for Exploiting Solar Energy by Month?

- **Sunlight**: Solar panels ideally require a minimum of five hours of direct sunlight daily

- **Temperature**: Solar panels generally work best at a moderate temperature, around 25°C (77°F)

- **Absence of Extreme Weather**: Extreme weather conditions like heavy rain, snow, or high winds can reduce the efficiency of solar panels and potentially cause damage.


In [270]:
weather_conditions = ['Heavy Rain', 'Wintry Mix', 'Snow','Light Drizzle']

In [271]:
# Filter data to find suitable days for solar energy exploitation based on conditions
solar_energy_days = df[(df['Day_Length(Minutes)'] >= 300) &
                        (df['Temperature_Avg'] >= 77) &
                        (~df['Condition'].isin(weather_conditions))]

# Group the filtered data by 'Year' and 'Month', and count the number of suitable days for each month
solar_each_Month = solar_energy_days.groupby(['Year', 'Month']).size().reset_index(name='Number Of Suitable Days')

# Group the result by 'Month' and sum the number of suitable days for each month across all years
solar_each_Month = solar_each_Month.groupby('Month')['Number Of Suitable Days'].sum().reset_index()
solar_each_Month

Unnamed: 0,Month,Number Of Suitable Days
0,4,4
1,5,26
2,6,101
3,7,292
4,8,235
5,9,63
6,10,1


In [272]:
# There are some months that do not contain days that meet solar energy standards
# But I want to add these months to the plot
months = pd.DataFrame({'Month': range(1, 13)})
all_months = pd.merge(months, solar_each_Month, on='Month', how='left').fillna(0)

In [273]:
fig = px.bar(all_months, x='Month', y='Number Of Suitable Days', 
             title='Number of Suitable Day for Exploiting Solar Energy by Month')
fig.show()

It turns out that the warm season, which we previously conclued, presents the optimal conditions for the exploitation of solar energy. This observation aligns with expectations, as higher temperatures typically correspond to increased solar energy availability. Notably, July emerges as the hottest month, consequently boasting the highest number of exploitable days for solar energy. This correlation underscores the significance of temperature in influencing solar energy potential, further reinforcing the importance of leveraging weather patterns to maximize renewable energy utilization.

#### Q10: What is the general weather condition in New York?

In [274]:
most_common_condition_by_month = df.groupby('Month')['Condition'].agg(lambda x: x.mode())
most_common_condition_by_month

Month
1            Cloudy
2            Cloudy
3            Cloudy
4     Mostly Cloudy
5     Mostly Cloudy
6     Mostly Cloudy
7     Mostly Cloudy
8     Mostly Cloudy
9     Mostly Cloudy
10    Mostly Cloudy
11             Fair
12           Cloudy
Name: Condition, dtype: object

In [275]:
value_counts = most_common_condition_by_month.value_counts()

fig = px.bar(value_counts, x=value_counts.index, y=value_counts.values, labels={'x':'Weather Condition', 'y':'Frequency'}, color=value_counts.values, color_continuous_scale='Blues')
fig.update_layout(title_text='Most Common Weather Condition by Month')
fig.show()




Common weather conditions in New York is Mostly Cloudy <br>
It also seems that November is the clearest month

#### Q11 : What is the month with the most wet days?

A wet day is one with at least 0.04 inches of liquid

In [276]:
df_copy = df.copy()
df_copy['Wet_Day'] = df['Precipitation_Actual'] >= 0.04

In [277]:
wet_days = df_copy.groupby('Month')['Wet_Day'].sum()

max_wet_days_month = wet_days.idxmax()
print(f"The month with the most wet days is {max_wet_days_month}.")


The month with the most wet days is 12.


#### Q12 : What is the month with the fewest wet days?

In [278]:
min_wet_days_month = wet_days.idxmin()
print(f"The month with the fewest wet days is {min_wet_days_month}.")

The month with the fewest wet days is 11.


#### Q13: Does New York experience more rainy or snowy days? 

In [279]:
rainy_days = df[df['Condition'].str.contains('Rain')]
snowy_days = df[df['Condition'].str.contains('Snow')]

In [280]:
if len(rainy_days) > len(snowy_days):
    print('NY has more rainy days than snowy days.')
else:
    print('NY has more snowy days than rainy days.')


NY has more rainy days than snowy days.


In [281]:
labels = ['Rain','Snow']
values = [len(rainy_days), len(snowy_days)]

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title='Rain Days VS Snow Days')

fig.show()


#### Q14: Is there a correlation between the day length and temperature?

In [282]:
fig1 = px.scatter(df, x='Day_Length(Minutes)', y='Temperature_Avg')
fig1.show()

Clearly there is a positive relationship

#### Q15: Are there any specific days of the week where certain weather conditions are more likely?

In [283]:
df_condition = df.groupby(['Day_of_weak', 'Condition']).size().reset_index(name='Counts')
df_condition.index.map(day_dict)
fig = px.bar(df_condition, x='Day_of_weak', y='Counts', color='Condition', barmode='group')
fig.show()





- The weather condition that occurs most frequently in New York is **Mostly Cloudy**, especially on **Sunday** with **323 occurrences**.

- **Cloudy** and **Fair** conditions also occur quite frequently throughout the week.

- **Fog**, **Light Drizzle**, **Light Rain**, and **Light Snow** conditions occur less frequently, but still have a noticeable presence, especially on certain days of the week.

- Conditions like **Heavy Rain**, **Rain and**, **Smoke**, and **Wintry Mix** are relatively rare.

- **Rain** is very rare and only appears once in the dataset on **Monday** and **Sunday**.


#### Q16: How does weather Property interact with another?

In [284]:
corr = df[numeric_columns[:10]].corr()
corr

Unnamed: 0,Humidity_Avg(%),Pressure_Avg(in),Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length(Minutes)
Humidity_Avg(%),1.0,-0.118016,0.143744,0.111938,0.215499,-0.097096,0.446326,-0.007941,-0.216279,0.055055
Pressure_Avg(in),-0.118016,1.0,-0.04464,-0.029677,-0.098341,-0.018088,-0.07173,-0.117755,0.291991,-0.050919
Temperature_Avg,0.143744,-0.04464,1.0,0.911861,0.00852,-0.550913,0.924765,-0.254526,-0.349898,0.744666
Temperature_Historic,0.111938,-0.029677,0.911861,1.0,0.017827,-0.597343,0.842881,-0.232976,-0.288147,0.828998
Precipitation_Actual,0.215499,-0.098341,0.00852,0.017827,1.0,-0.010623,0.07898,0.165337,-0.166463,0.007052
Precipitation_Historic,-0.097096,-0.018088,-0.550913,-0.597343,-0.010623,1.0,-0.522892,0.173906,0.100025,-0.322254
Dew_Point,0.446326,-0.07173,0.924765,0.842881,0.07898,-0.522892,1.0,-0.237273,-0.382446,0.655774
Max_Wind_Speed,-0.007941,-0.117755,-0.254526,-0.232976,0.165337,0.173906,-0.237273,1.0,-0.178149,-0.140198
Sea_Level_Pressure,-0.216279,0.291991,-0.349898,-0.288147,-0.166463,0.100025,-0.382446,-0.178149,1.0,-0.30412
Day_Length(Minutes),0.055055,-0.050919,0.744666,0.828998,0.007052,-0.322254,0.655774,-0.140198,-0.30412,1.0


In [285]:
fig = px.imshow(corr)
fig.show()

1. **Temperature_Avg** and **Dew_Point** have a very strong positive correlation of 0.924765. 

2. **Temperature_Avg** and **Temperature_Historic** also have a strong positive correlation of 0.911861. This indicates that the current average temperature is strongly influenced by the historic temperature data.

3. **Temperature_Avg** and **Precipitatio_Actual** have a strong negative correlation of -0.550913. 

4. **Sea_Level_Pressure** and **Temperature_Avg** have a negative correlation of -0.349898. 

5. **Day_Length(Minutes)** and **Temperature_Avg** have a strong positive correlation of 0.744666. 

6. **Max_Wind_Speed** does not seem to have a strong correlation with any of the other variables. The strongest correlation it has is a weak positive correlation with **Precipitatio_Actual** (0.173906) and **Precipitation_Historic** (0.165337).

7. **Humidity_Avg(%)** and **Dew_Point** have a moderate positive correlation of 0.446326. This suggests that when the average humidity increases, the dew point also tends to increase.

8. **Precipitation_Actual** and **Humidity_Avg(%)** have a weak positive correlation of 0.215499. 



#### Q17: What is the relationship between the type of wind and other properties?

In [286]:
groupby_wind_type = df.groupby(['Wind_Type'])[numeric_columns[:9]].mean().reset_index()
groupby_wind_type

Unnamed: 0,Wind_Type,Humidity_Avg(%),Pressure_Avg(in),Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure
0,CALM,67.197917,30.050469,62.874948,59.528125,0.047292,4.579167,47.50125,13.520833,30.115313
1,E,74.671233,30.068356,57.139315,59.384932,0.052603,4.391781,47.265205,16.986301,30.16411
2,ENE,76.657895,30.030677,53.676353,56.457895,0.115602,4.746241,43.879586,18.345865,30.134474
3,ESE,75.9,30.043,60.441667,61.54,0.012333,4.44,50.811333,17.2,30.131333
4,N,62.918079,30.036017,51.766017,54.691808,0.104181,4.658192,35.657316,17.166667,30.14096
5,NE,75.855263,29.998086,56.734067,58.321292,0.105179,4.708134,46.576986,16.519139,30.123852
6,NNE,66.033898,30.112458,52.860424,57.621186,0.108814,4.281356,39.149407,17.288136,30.204492
7,NNW,58.429719,29.902008,50.789759,53.547791,0.11988,4.737349,32.337992,18.630522,30.146747
8,NW,57.480149,29.956377,52.113859,53.72196,0.084194,4.921836,33.31928,20.615385,30.085211
9,S,70.606349,29.994201,65.720011,62.886243,0.062317,4.633545,52.860921,17.197884,30.079672


In [287]:
fig = go.Figure()

columns = groupby_wind_type.select_dtypes(include=np.number).columns

for col in columns:
    fig.add_trace(go.Scatter(x=groupby_wind_type['Wind_Type'], y=groupby_wind_type[col], mode='lines+markers', name=col))

# Update layout
fig.update_layout(
    title='Weather Parameters by Wind Type',
    xaxis=dict(title='Wind Type'),
    yaxis=dict(title='Value'),
    legend=dict(title='Parameter')
)

fig.show()

- **Temperature**: There is noticeable variation in average temperatures across different wind types. Wind types such as "CALM," "S," and "SE" tend to be associated with higher average temperatures, while "N," "NNW," and "WNW" show lower average temperatures.

- **Humidity**: Wind types like "ENE," "NE," and "E" exhibit higher average humidity levels compared to others, while "N," "NW," "W," and "WNW" tend to have lower humidity levels.

- **Precipitation**: Certain wind types, such as "ENE," "N," and "NE," show higher average precipitation levels compared to others. Conversely, wind types like "W," "WNW," and "WSW" have lower average precipitation.

- **Wind Speed**: There is variation in average maximum wind speeds across different wind types. For instance, "NE," "SE," and "VAR" show lower average maximum wind speeds, while "NW" and "W" exhibit higher speeds.

- **Pressure**: Average pressure values show less consistent patterns across wind types, with some variation but no clear trends.

#### Q18: What is the relationship between the condition and other properties?

In [288]:
groupby_condition = df.groupby(['Condition'])[numeric_columns[:9]].mean().reset_index()
groupby_condition

Unnamed: 0,Condition,Humidity_Avg(%),Pressure_Avg(in),Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure
0,Cloudy,75.760653,29.933018,52.742344,52.888565,0.077926,4.841122,42.738239,18.080256,30.075774
1,Fair,53.96365,30.083279,54.216976,55.3529,0.052003,4.703712,33.864695,17.936582,30.175367
2,Fog,92.585366,29.946829,57.499512,54.04878,0.140732,4.853659,52.497073,17.219512,30.091951
3,Haze,55.0,29.76,71.236667,71.8,0.093333,4.8,55.303333,14.333333,29.84
4,Heavy Rain,93.0,30.07,48.975,46.75,0.59,5.1,45.04,23.5,30.185
5,Light Drizzle,89.583333,29.964167,46.4025,44.683333,0.499167,5.225,40.991667,19.166667,30.094167
6,Light Rain,86.533613,29.853992,51.749412,51.748319,0.349244,4.80042,45.693824,21.331933,30.05542
7,Light Snow,79.814815,29.347778,31.403333,37.774074,0.174444,5.498148,23.129444,22.944444,30.081481
8,Mist,93.0,30.03,68.41,63.3,0.0,3.1,65.97,10.0,30.1
9,Mostly Cloudy,63.352413,29.968827,61.978401,61.201066,0.079994,4.672671,46.08459,17.621212,30.0733


In [289]:
fig = go.Figure()

columns = groupby_condition.select_dtypes(include=np.number).columns

for col in columns:
    fig.add_trace(go.Scatter(x=groupby_condition['Condition'], y=groupby_condition[col], mode='lines+markers', name=col))

# Update layout
fig.update_layout(
    title='Weather Parameters by Condition ',
    xaxis=dict(title='Condtion'),
    yaxis=dict(title='Value'),
    legend=dict(title='Parameter')
)

fig.show()


- **Temperature**: Different weather conditions exhibit significant variation in average temperatures. For instance, "Haze" and "Heavy Rain" have notably different average temperatures compared to "Light Snow" or "Snow." "Fair" weather conditions tend to have moderate temperatures.

- **Humidity**: Weather conditions such as "Fog," "Mist," "Heavy Rain," and "Light Drizzle" are associated with higher average humidity levels, indicating moist atmospheric conditions. Conversely, "Fair" and "Partly Cloudy" conditions tend to have lower humidity levels.

- **Wind Speed**: There is variability in average maximum wind speeds across different weather conditions. Conditions such as "Heavy Rain" and "Rain" tend to have higher average maximum wind speeds compared to conditions like "Light Drizzle" or "Fair."

- **Sea Level Pressure**: Sea level pressure also varies among different weather conditions. "Rain" and "Rain and" conditions show slightly higher average sea level pressures compared to conditions like "Heavy Rain" or "Mist."

- **Dew Point**: Weather conditions such as "Fog," "Mist," and "Light Drizzle" exhibit higher average dew points, indicating higher moisture content in the air. Conversely, conditions like "Haze" and "Snow" have lower average dew points.


#### Q19: Is there relationship between wind type and weather conditions?

The significance level is 0.05 and the null hypothesis that there is no relationship between the two features

In [290]:
import scipy.stats as stats

contingency_table = pd.crosstab(df['Wind_Type'], df['Condition'])
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

print(f"P-value: {p}")

P-value: 9.623057028373488e-159


the p-value is much smaller than 0.05, so the null hypothesis reject at the 0.05 significance level, concluding that there is a significant relationship between the two featur

### Indexing with Time Series Data

In [291]:
df.set_index('Date',inplace=True)

In [292]:
# check
df.index

DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03', '2009-01-04',
               '2009-01-05', '2009-01-06', '2009-01-07', '2009-01-08',
               '2009-01-09', '2009-01-10',
               ...
               '2023-12-21', '2023-12-22', '2023-12-23', '2023-12-24',
               '2023-12-25', '2023-12-26', '2023-12-27', '2023-12-28',
               '2023-12-29', '2023-12-30'],
              dtype='datetime64[ns]', name='Date', length=5331, freq=None)

### Normalization 

In [293]:
numerical_columns = df.select_dtypes(include='number').columns

# Apply normalization  only to numerical columns
scaler = MinMaxScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

In [294]:
#check
df.head()

Unnamed: 0_level_0,Humidity_Avg(%),Wind_Type,Pressure_Avg(in),Condition,Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length(Minutes),Year,Month,Day_of_weak,DayOfYear
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2009-01-01,0.511364,NW,0.982137,Fair,0.149067,0.050218,0.0,1.0,0.187131,0.44898,0.631579,0.011396,0.0,0.0,0.5,0.0
2009-01-02,0.715909,SW,0.965898,Cloudy,0.260572,0.045852,0.0,1.0,0.346749,0.22449,0.578947,0.014245,0.0,0.0,0.666667,0.00274
2009-01-03,0.386364,WNW,0.972718,Fair,0.297189,0.041485,0.0,1.0,0.33447,0.367347,0.493421,0.017094,0.0,0.0,0.833333,0.005479
2009-01-04,0.465909,WNW,0.974667,Fair,0.293763,0.037118,0.0,1.0,0.287176,0.326531,0.519737,0.019943,0.0,0.0,1.0,0.008219
2009-01-05,0.465909,NW,0.971095,Cloudy,0.386487,0.034934,0.0,1.0,0.391769,0.22449,0.447368,0.022792,0.0,0.0,0.0,0.010959


#### Feature Selection

In [295]:
df.drop(['Year','Month','Day_of_weak','DayOfYear'],axis=1,inplace=True)

### Encoding

In [296]:
one_hot_encoder = OneHotEncoder(sparse=False)
df = pd.get_dummies(df, columns=['Wind_Type','Condition'], dtype=int)
# check
df.head()

Unnamed: 0_level_0,Humidity_Avg(%),Pressure_Avg(in),Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length(Minutes),...,Condition_Light Rain,Condition_Light Snow,Condition_Mist,Condition_Mostly Cloudy,Condition_Partly Cloudy,Condition_Rain,Condition_Rain and,Condition_Smoke,Condition_Snow,Condition_Wintry Mix
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-01-01,0.511364,0.982137,0.149067,0.050218,0.0,1.0,0.187131,0.44898,0.631579,0.011396,...,0,0,0,0,0,0,0,0,0,0
2009-01-02,0.715909,0.965898,0.260572,0.045852,0.0,1.0,0.346749,0.22449,0.578947,0.014245,...,0,0,0,0,0,0,0,0,0,0
2009-01-03,0.386364,0.972718,0.297189,0.041485,0.0,1.0,0.33447,0.367347,0.493421,0.017094,...,0,0,0,0,0,0,0,0,0,0
2009-01-04,0.465909,0.974667,0.293763,0.037118,0.0,1.0,0.287176,0.326531,0.519737,0.019943,...,0,0,0,0,0,0,0,0,0,0
2009-01-05,0.465909,0.971095,0.386487,0.034934,0.0,1.0,0.391769,0.22449,0.447368,0.022792,...,0,0,0,0,0,0,0,0,0,0


## Trend And Seasonality

#### Check For Stationarity

In [297]:
def check_stationarity(series):
    """ 
    Checks the stationarity of a given time series.

    Args:
        series (pd.Series): The time series to check for stationarity.

    Returns:
        bool: True if the series is stationary (p-value < 0.05), False otherwise.
    """
    result = adfuller(series)
    p_value = result[1]
    return p_value < 0.05 

In [298]:
stationary_data = pd.DataFrame(index=df.index)
for col in df.columns:
    differenced_series = df[col].diff().dropna()  
    if check_stationarity(differenced_series):
        stationary_data[col] = differenced_series

stationary_data = stationary_data.dropna()

## Modeling

### Split The Data (80% train, 20% test)

In [299]:
train_data, test_data = train_test_split(stationary_data, test_size=0.2, random_state=False)


In [300]:
model = VAR(train_data)


A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.


A date index has been provided, but it is not monotonic and so will be ignored when e.g. forecasting.



In [301]:
results = model.fit()

In [302]:
forecast = results.forecast(test_data.values, len(test_data))
forecast = pd.DataFrame(forecast,columns = test_data.columns,index=test_data.index)
forecast.head()

Unnamed: 0_level_0,Humidity_Avg(%),Pressure_Avg(in),Temperature_Avg,Temperature_Historic,Precipitation_Actual,Precipitation_Historic,Dew_Point,Max_Wind_Speed,Sea_Level_Pressure,Day_Length(Minutes),...,Condition_Light Rain,Condition_Light Snow,Condition_Mist,Condition_Mostly Cloudy,Condition_Partly Cloudy,Condition_Rain,Condition_Rain and,Condition_Smoke,Condition_Snow,Condition_Wintry Mix
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-04-11,0.00691,0.000598,-0.000775,-0.000212,-0.000349,-0.000167,0.001747,-0.002293,0.001971,-1.2e-05,...,0.001023,-0.005136,0.0002034327,-0.017841,0.010657,-0.000128,0.0002742846,0.000411,-2e-05,-0.001512
2017-06-17,0.001597,-0.000222,-3e-06,-3.5e-05,4.2e-05,-3.8e-05,0.001028,-0.000334,-0.000467,1e-06,...,0.000251,0.000201,0.0001685914,-0.001675,0.002606,0.000272,-6.62683e-06,0.000481,-0.000223,2.5e-05
2023-09-23,0.001476,-0.000267,0.000249,-6.8e-05,8.1e-05,-1.8e-05,0.001124,0.000264,-0.000857,-4.8e-05,...,-0.000305,-0.000776,-6.94781e-06,-0.000671,0.003645,0.00024,3.516363e-06,0.000472,-0.000238,2.4e-05
2019-11-15,0.001619,-0.000255,0.000239,-6.7e-05,7.1e-05,-2.6e-05,0.001143,0.00022,-0.000833,-4.7e-05,...,-0.000234,-0.000705,-1.266179e-08,-0.000537,0.003518,0.000234,4.98285e-07,0.000469,-0.000234,3e-06
2012-04-29,0.001621,-0.000255,0.000236,-6.7e-05,7e-05,-2.6e-05,0.001142,0.000221,-0.000832,-4.7e-05,...,-0.000231,-0.000703,-1.674012e-07,-0.000529,0.003516,0.000235,1.017143e-06,0.000469,-0.000235,3e-06


## Model evaluation

In [303]:
mse_test = mean_squared_error(test_data, forecast)
rmse_test = rmse = mean_squared_error(test_data, forecast, squared=False)  
print(f'MSE on test data: {mse_test}')
print(f'RMSE on test data: {rmse_test }')

MSE on test data: 0.06771102207327336
RMSE on test data: 0.19527102518439765
