# 1. Weather model

* Load the weather data from Kaggle
* Like in the previous lesson, build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why?

In [7]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action = "ignore")

%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

from sklearn import linear_model
import statsmodels.api as sm

from sklearn.model_selection import train_test_split

In [8]:
weather = pd.read_csv("data/weatherHistory.csv")
weather.head(5)

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472,7.389,0.89,14.12,251.0,15.826,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.356,7.228,0.86,14.265,259.0,15.826,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.378,9.378,0.89,3.928,204.0,14.957,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.289,5.944,0.83,14.104,269.0,15.826,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.756,6.978,0.83,11.045,259.0,15.826,0.0,1016.51,Partly cloudy throughout the day.


In [26]:
X = weather[["Humidity","Wind Speed (km/h)"]]

Y = weather["Apparent Temperature (C)"]-weather["Temperature (C)"]

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

In [28]:
model = linear_model.LinearRegression()
model.fit(X_train,y_train)

LinearRegression()

In [29]:
X_test = sm.add_constant(X_test)
results = sm.OLS(y_test,X_test).fit()
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.283
Model:,OLS,Adj. R-squared:,0.283
Method:,Least Squares,F-statistic:,3810.0
Date:,"Thu, 17 Sep 2020",Prob (F-statistic):,0.0
Time:,20:10:56,Log-Likelihood:,-34231.0
No. Observations:,19291,AIC:,68470.0
Df Residuals:,19288,BIC:,68490.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4347,0.048,51.220,0.000,2.342,2.528
Humidity,-3.0331,0.054,-56.255,0.000,-3.139,-2.927
Wind Speed (km/h),-0.1200,0.002,-77.975,0.000,-0.123,-0.117

0,1,2,3
Omnibus:,852.068,Durbin-Watson:,2.018
Prob(Omnibus):,0.0,Jarque-Bera (JB):,993.141
Skew:,-0.508,Prob(JB):,2.2e-216
Kurtosis:,3.45,Cond. No.,87.8


* *R-squared , Adj. R-squared are very low. These variables define the real variables %28.3.*

**Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one?**

In [31]:
weather["Humidity_wind_speed"] = weather.Humidity*weather["Wind Speed (km/h)"]

In [32]:
X = weather[["Humidity","Wind Speed (km/h)","Humidity_wind_speed"]]

Y = weather["Apparent Temperature (C)"]-weather["Temperature (C)"]

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42 )

In [34]:
model = linear_model.LinearRegression()
model.fit(X_train,y_train)

LinearRegression()

In [35]:
X_test = sm.add_constant(X_test)
results = sm.OLS(y_test,X_test).fit()
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.34
Model:,OLS,Adj. R-squared:,0.34
Method:,Least Squares,F-statistic:,3308.0
Date:,"Thu, 17 Sep 2020",Prob (F-statistic):,0.0
Time:,20:27:10,Log-Likelihood:,-33438.0
No. Observations:,19291,AIC:,66880.0
Df Residuals:,19287,BIC:,66920.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0274,0.076,-0.361,0.718,-0.176,0.121
Humidity,0.3191,0.097,3.277,0.001,0.128,0.510
Wind Speed (km/h),0.0997,0.006,17.782,0.000,0.089,0.111
Humidity_wind_speed,-0.3113,0.008,-40.633,0.000,-0.326,-0.296

0,1,2,3
Omnibus:,968.332,Durbin-Watson:,2.012
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1655.018
Skew:,-0.41,Prob(JB):,0.0
Kurtosis:,4.177,Cond. No.,194.0


* We improved the model a little bit but It's not enough to predict the target values.

**Add visibility as additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the visibility in terms of the improvement in the adjusted R-squared. Which one is more useful?**

In [36]:
X = weather[["Humidity","Wind Speed (km/h)","Humidity_wind_speed","Visibility (km)"]]

Y = weather["Apparent Temperature (C)"]-weather["Temperature (C)"]

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

In [38]:
model = linear_model.LinearRegression()
model.fit(X_train,y_train)

LinearRegression()

In [39]:
X_test = sm.add_constant(X_test)
results = sm.OLS(y_test,X_test).fit()
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.363
Model:,OLS,Adj. R-squared:,0.363
Method:,Least Squares,F-statistic:,2752.0
Date:,"Thu, 17 Sep 2020",Prob (F-statistic):,0.0
Time:,20:35:27,Log-Likelihood:,-33086.0
No. Observations:,19291,AIC:,66180.0
Df Residuals:,19286,BIC:,66220.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.2603,0.088,-14.393,0.000,-1.432,-1.089
Humidity,1.0711,0.100,10.747,0.000,0.876,1.266
Wind Speed (km/h),0.1132,0.006,20.479,0.000,0.102,0.124
Humidity_wind_speed,-0.3317,0.008,-43.862,0.000,-0.346,-0.317
Visibility (km),0.0669,0.002,26.774,0.000,0.062,0.072

0,1,2,3
Omnibus:,1075.191,Durbin-Watson:,2.011
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2074.751
Skew:,-0.41,Prob(JB):,0.0
Kurtosis:,4.382,Cond. No.,247.0


* There is little change in R-square.

**Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.**

* Smaller AIC and BIC scores are better for us. So, when we looked all scores we should choose last model.

# 2. House prices model

**Load the houseprices data from Kaggle.**

In [42]:
house_prices = pd.read_csv("data/houseprices.csv")
house = house_prices.select_dtypes(exclude='object')
house.isnull().sum()*100/house.shape[0]

MSSubClass       0.000
LotFrontage     17.740
LotArea          0.000
OverallQual      0.000
OverallCond      0.000
YearBuilt        0.000
YearRemodAdd     0.000
MasVnrArea       0.548
BsmtFinSF1       0.000
BsmtFinSF2       0.000
BsmtUnfSF        0.000
TotalBsmtSF      0.000
1stFlrSF         0.000
2ndFlrSF         0.000
LowQualFinSF     0.000
GrLivArea        0.000
BsmtFullBath     0.000
BsmtHalfBath     0.000
FullBath         0.000
HalfBath         0.000
BedroomAbvGr     0.000
KitchenAbvGr     0.000
TotRmsAbvGrd     0.000
Fireplaces       0.000
GarageYrBlt      5.548
GarageCars       0.000
GarageArea       0.000
WoodDeckSF       0.000
OpenPorchSF      0.000
EnclosedPorch    0.000
3SsnPorch        0.000
ScreenPorch      0.000
PoolArea         0.000
MiscVal          0.000
MoSold           0.000
YrSold           0.000
SalePrice        0.000
dtype: float64

**Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.**

In [46]:
X = house[[x for x in house.columns if x not in ["LotFrontage","SalePrice","GarageYrBlt","MasVnrArea"]]]

Y = house["SalePrice"]

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

In [48]:
model = linear_model.LinearRegression()
model.fit(X_train,y_train)

print("Coefficents : {}".format(model.coef_))
print("Intercept : {}".format(model.intercept_))

Coefficents : [-1.75548690e+02  4.09174234e-01  1.84955503e+04  3.83573710e+03
  3.37312908e+02  1.80209917e+02  1.00461555e+01 -4.61607099e+00
  3.82786747e-01  5.81287122e+00  1.31240039e+01  1.42134340e+01
  4.50422890e+00  3.18416668e+01  1.10492803e+04 -3.26518493e+02
  3.42606130e+03 -1.66531509e+03 -9.08810085e+03 -1.00971990e+04
  5.10100134e+03  4.37021557e+03  1.16074917e+04  5.58607730e-01
  2.55639049e+01 -6.62562003e+00  7.06908781e+00  3.85262374e+01
  6.55989735e+01 -3.73443286e+01 -7.15692563e-01 -2.09433750e+02
 -5.14524004e+02]
Intercept : -48411.54387291285


In [49]:
X_test = sm.add_constant(X_test)
results = sm.OLS(y_test,X_test).fit()
results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.866
Model:,OLS,Adj. R-squared:,0.85
Method:,Least Squares,F-statistic:,54.04
Date:,"Thu, 17 Sep 2020",Prob (F-statistic):,1.83e-95
Time:,20:51:09,Log-Likelihood:,-3444.3
No. Observations:,292,AIC:,6953.0
Df Residuals:,260,BIC:,7070.0
Df Model:,31,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.32e+05,3.16e+06,0.137,0.891,-5.8e+06,6.66e+06
MSSubClass,-90.3671,60.490,-1.494,0.136,-209.479,28.745
LotArea,0.8979,0.403,2.230,0.027,0.105,1.691
OverallQual,1.392e+04,2801.257,4.970,0.000,8406.883,1.94e+04
OverallCond,7562.2941,2367.198,3.195,0.002,2900.973,1.22e+04
YearBuilt,271.1093,147.505,1.838,0.067,-19.348,561.566
YearRemodAdd,0.1949,157.715,0.001,0.999,-310.366,310.756
BsmtFinSF1,22.8507,5.732,3.987,0.000,11.564,34.137
BsmtFinSF2,5.7991,9.443,0.614,0.540,-12.796,24.394

0,1,2,3
Omnibus:,130.967,Durbin-Watson:,2.02
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1259.031
Skew:,1.558,Prob(JB):,4.03e-274
Kurtosis:,12.684,Cond. No.,1.13e+16


* *F-statistic:	54.04*
* R-squared, adjusted R-squared are very satisfying. 
* AIC:	6953. BIC:	7070.

**Do you think your model is satisfactory? If so, why?**

I think, I found a good model because of R-squared value. We have 86 percent success on the target value. 

**In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables.**

In [79]:
house["LotArea_year"] = house["LotArea"] * house ["YearBuilt"]
house["LotArea_ov_qual"] = house["LotArea"] * house["OverallQual"]
house["LotArea_ov_cond"] = house["LotArea"] * house["OverallCond"]

In [80]:
X = house[["LotArea","OverallQual","LotArea_ov_qual","OverallCond","LotArea_ov_cond","YearBuilt","LotArea_year","BsmtFinSF1","TotalBsmtSF","1stFlrSF","2ndFlrSF","LowQualFinSF","GrLivArea","BedroomAbvGr","GarageCars","WoodDeckSF"]]
Y = house["SalePrice"]

In [81]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

In [82]:
model = linear_model.LinearRegression()
model.fit(X_train,y_train)

print("Coefficents : {}".format(model.coef_))
print("Intercept : {}".format(model.intercept_))

Coefficents : [ 3.35425844e+00  2.16258713e+04 -1.54901426e-01  5.39138015e+03
  1.25307311e-02  4.04086485e+02 -9.87460614e-04  1.58351610e+01
  8.03618592e+00  2.00907087e+01  1.19322688e+01  7.55342231e+00
  3.95763998e+01 -5.26649113e+03  1.33128067e+04  2.60336400e+01]
Intercept : -897698.9021217629


In [83]:
X_test = sm.add_constant(X_test)
results = sm.OLS(y_test,X_test).fit()
results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.874
Model:,OLS,Adj. R-squared:,0.867
Method:,Least Squares,F-statistic:,127.3
Date:,"Fri, 18 Sep 2020",Prob (F-statistic):,2.02e-114
Time:,09:13:40,Log-Likelihood:,-3435.2
No. Observations:,292,AIC:,6902.0
Df Residuals:,276,BIC:,6961.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.16e+05,4.21e+05,0.513,0.608,-6.13e+05,1.04e+06
LotArea,-94.2967,41.399,-2.278,0.024,-175.794,-12.800
OverallQual,-709.8159,4133.514,-0.172,0.864,-8847.036,7427.405
LotArea_ov_qual,1.5008,0.324,4.635,0.000,0.863,2.138
OverallCond,2369.0293,3539.845,0.669,0.504,-4599.497,9337.556
LotArea_ov_cond,0.4984,0.303,1.646,0.101,-0.098,1.095
YearBuilt,-104.2656,214.509,-0.486,0.627,-526.548,318.016
LotArea_year,0.0422,0.021,1.990,0.048,0.000,0.084
BsmtFinSF1,18.0891,5.436,3.328,0.001,7.388,28.790

0,1,2,3
Omnibus:,102.807,Durbin-Watson:,2.126
Prob(Omnibus):,0.0,Jarque-Bera (JB):,732.694
Skew:,1.235,Prob(JB):,7.9e-160
Kurtosis:,10.357,Cond. No.,2.05e+20


**For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?**

I think last model is the best model which we got. If we compare the r-squared values, I think 0.87 enough for our model.