# Evaluating Performance - Weather Model

## By Jean-Philippe Pitteloud

### Requirements

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
import statsmodels.api as sm
from scipy.stats import jarque_bera
from scipy.stats import normaltest
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

### Data Gathering

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
weather_df = pd.read_sql_query('select * from weatherinszeged',con = engine)

engine.dispose()


weather_df.head()

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


### Modeling and Evaluation

#### _Model #1:_

In our first and simpler model, a Linear Regression Model is used using OLS method. The target variable is the difference between the variables 'apparenttemperature' and 'temperature'. The independent variables included are 'humidity' and 'windspeed'. The results and statistics of the model are presented below

In [3]:
Y_1 = weather_df['apparenttemperature'] - weather_df['temperature']

X_1 = weather_df[['humidity', 'windspeed']]

X_1 = sm.add_constant(X_1)

results_1 = sm.OLS(Y_1, X_1).fit()

results_1.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Mon, 14 Oct 2019",Prob (F-statistic):,0.0
Time:,10:22:52,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


As it can be seen in the table above, the first model is able to explain 29% of the variance in the target. Also, all independent variables included as well as the introduced bias or constant are statistically significant, judging by their associated p-value (threshold 0.05). In terms of the F-statistic, a value of 19,490 was obtained, confirming the ability of our model to explain significantly more target variance than an "empty model". A F-statistic close to zero suggest the difference in explanatory power between our model and the "empty" model is statistically significant.

In terms of the metrics Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), the first model gave values for both AIC and BIC of 340,900. This values take into account the Sum of the Squares Errors (SSE) along with the sample size and the number of parameters. These two metrics are very useful in comparing the performance of models, so in the next two sections, modified models will be estimated and the statistics compared to the ones discussed in this section

#### _Model #2:_

In this second model, the target variable was the same (difference between apparent and actual temperature) while a third independent variables was included. This new feature accounts for the interaction between two of the original variables 'humidity' and 'windspeed'. The model was fitted and evaluated below

In [4]:
weather_df['humidity_windspeed_inter'] = weather_df.humidity * weather_df.windspeed

Y_2 = weather_df['apparenttemperature'] - weather_df['temperature']

X_2 = weather_df[['humidity','windspeed', 'humidity_windspeed_inter']]

X_2 = sm.add_constant(X_2)

results_2 = sm.OLS(Y_2, X_2).fit()

results_2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Mon, 14 Oct 2019",Prob (F-statistic):,0.0
Time:,10:22:52,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
humidity_windspeed_inter,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


As it can be seen in the summary table above, the R-squared value obtained was 0.341 suggesting that 34% of the variance in the target variable may be explained by the model. This higher value compared to the value obtained for the original model (29%) indicates the introduction of the interaction feature expanded the scope of the model in terms of its explanatory power. Similar to the original model, in this new model all features were considered statistically significant as judged by their associated p-values. In terms of the F-statistic obtained from the F-test, the new model estimated a value of 16,660. This value on itself confirms the superior explanatory power of this model compared to an "empty" model, however, its use to compare it to our original model is not adviced since none of the model is nested on the other.

Regarding the AIC and BIC metrics, this new model outperform the original with smaller values of 333,400 compared to 340,900. The smaller values initially suggest smaller errors from the new model

#### _Model #3:_

In our third and final model of this assignment, the target variables was not changed while the predictor variables were 'humidity', 'windspeed', and 'visibility'. Remember the original model included only 'humidity' and 'windspeed' as the predictors. The new model was fitted and evaluated and the summary of the results is presented below

In [5]:
Y_3 = weather_df['apparenttemperature'] - weather_df['temperature']

X_3 = weather_df[['humidity', 'windspeed', 'visibility']]

X_3 = sm.add_constant(X_3)

results_3 = sm.OLS(Y_3, X_3).fit()

results_3.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.304
Model:,OLS,Adj. R-squared:,0.303
Method:,Least Squares,F-statistic:,14010.0
Date:,"Mon, 14 Oct 2019",Prob (F-statistic):,0.0
Time:,10:22:52,Log-Likelihood:,-169380.0
No. Observations:,96453,AIC:,338800.0
Df Residuals:,96449,BIC:,338800.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.5756,0.028,56.605,0.000,1.521,1.630
humidity,-2.6066,0.025,-102.784,0.000,-2.656,-2.557
windspeed,-0.1199,0.001,-179.014,0.000,-0.121,-0.119
visibility,0.0540,0.001,46.614,0.000,0.052,0.056

0,1,2,3
Omnibus:,3833.895,Durbin-Watson:,0.282
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4584.022
Skew:,-0.459,Prob(JB):,0.0
Kurtosis:,3.545,Cond. No.,131.0


From the table above, we can conclude that this third model succesfully explains about 30% of the variance in the target variables. This R-square value represents an improvement in the explanatory power compared to Model #1 (29%) but is still outperformed by Model #2 (34%). In this case also, all included features were estimated as statistically significant. Concerning the F-statistic value, this new model is associated to a value of 14,010 while the original model was associated to a larger value of 19,490. In the case of Model #1 and Model #3, the former can be considered to be nested on the latter making this metric more relevant while comparing them. Judging by the F-statistic values, Model #1 could be considered more efficient explaining the variance in the target variable than Model #3, however, other metrics should be considered. Remember the F-statistics considers the performance of the model in comparison to an "empty" model

In terms of the AIC and BIC metrics, the comparison between all three models reflects the conclusions drawn from comparing the R-squared values. Model #2 (333,400 - included the interaction between 'humidity' and 'windspeed') outperformed Model #3 (338,800 - included 'visibility') and Model #1 (340,900 - only 'humidity' and 'windspeed'). From these results, it seems that the inclusion of more features in the model, helped improved the explanatory power of the model resulting in an increase in the R-squared values and decreased AIC and BIC values. Also, in this particular case the inclusion of an engineered feature, such as the interaction between two of the existing variables, proved to be beneficial to the performance of the model