# Assignments

As in previous checkpoints, please submit links to two Juypyter notebooks (one for each assignment below).

Please submit links to all your work below. This is not a graded checkpoint, but you should discuss your solutions with your mentor. Also, when you're done, compare your work to these [example solutions](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/5.solution_evaluating_goodness_of_fit.ipynb).

1. Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

- First, load the dataset from the weatherinszeged table from Thinkful's database.
- Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why?
- Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one?
- Add visibility as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the visibility in terms of the improvement in the adjusted R-squared. Which one is more useful?
- Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.


2. House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

- Load the houseprices data from Thinkful's database.
- Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
- Do you think your model is satisfactory? If so, why?
- In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables.
- For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

### Import Statements

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from scipy.stats.mstats import winsorize
from scipy.stats import bartlett
from scipy.stats import levene
from scipy.stats import jarque_bera
from scipy.stats import normaltest
from statsmodels.tsa.stattools import acf
from sklearn import linear_model
from sqlalchemy import create_engine
import statsmodels.api as sm

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings('ignore')

### Loading the Weather Dataframe

In [4]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
weather_df = pd.read_sql_query('select * from weatherinszeged',con=engine)

engine.dispose()

### Weather Model Assignment Recap

- Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why?
- Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one?
- Add visibility as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the visibility in terms of the improvement in the adjusted R-squared. Which one is more useful?
- Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

_Build a Linear Regression Model & Estimate Your Model Using Ordinary Least Squares (OLS)_

Tips:

- The target variable is the difference between the apparenttemperature and the temperature.
- The explanatory variables are humidity and windspeed.

In [5]:
# Y is the target variable.
temperature_differences = weather_df['apparenttemperature'] - weather_df['temperature']
Y = temperature_differences

# X is the feature set (or the explanatory variables).
X = weather_df[['humidity', 'windspeed']]

# As a best practice, add a constant to the model.
X = sm.add_constant(X)

# Fit an OLS model using statsmodels.
results = sm.OLS(Y, X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Tue, 20 Aug 2019   Prob (F-statistic):               0.00
Time:                        16:44:00   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4381      0.021    115.948      0.0

_What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why?_

The R-squared value and the adjusted R-squared value are the same at 0.288, which is striking.

Those values don't seem satisfactory because they're low. (With the exception of over-fitting, if the adjusted R-squared value is high, it means the model explains the target variable well.)

_Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS._

In [6]:
# Y is the target variable.
Y = temperature_differences

# Using feature engineering to capture the interaction between humidity and windspeed.
weather_df['humidity_times_windspeed'] = weather_df['humidity']*weather_df['windspeed']

# X is the feature set (or the explanatory variables).
X = weather_df[['humidity', 'windspeed', 'humidity_times_windspeed']]

# As a best practice, add a constant to the model.
X = sm.add_constant(X)

# Fit an OLS model using statsmodels.
results = sm.OLS(Y, X).fit()

# Print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Tue, 20 Aug 2019   Prob (F-statistic):               0.00
Time:                        16:49:34   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                   

_Now, what is the R-squared of this model? Does this model improve upon the previous one?_

The R-squared value of this model is 0.341, which is a better than the previous score of 0.288.

_Add visibility as an additional explanatory variable to the first model and estimate it._

In [8]:
# Y is the target variable.
Y = temperature_differences

# Using feature engineering to capture the interaction between humidity and windspeed.
weather_df['humidity_times_windspeed'] = weather_df['humidity']*weather_df['windspeed']

# X is the feature set (or the explanatory variables).
X = weather_df[['humidity', 'windspeed', 'humidity_times_windspeed', 'visibility']]

# As a best practice, add a constant to the model.
X = sm.add_constant(X)

# Fit an OLS model using statsmodels.
results = sm.OLS(Y, X).fit()

# Print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.364
Model:                            OLS   Adj. R-squared:                  0.363
Method:                 Least Squares   F-statistic:                 1.377e+04
Date:                Tue, 20 Aug 2019   Prob (F-statistic):               0.00
Time:                        16:51:25   Log-Likelihood:            -1.6504e+05
No. Observations:               96453   AIC:                         3.301e+05
Df Residuals:                   96448   BIC:                         3.301e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                   

_Did R-squared increase? What about adjusted R-squared?_

Yes, R-squared increased from 0.341 in the second model to 0.364 in this third model; however, the adjusted R-squared value didn't increase quite as much because its new value is 0.363.

_Compare the differences put on the table by the interaction term and the visibility in terms of the improvement in the adjusted R-squared. Which one is more useful?_

In [10]:
#I'm going to test the model again by removing the interaction term and leaving in visibility.

# Y is the target variable.
Y = temperature_differences

# X is the feature set (or the explanatory variables).
X = weather_df[['humidity', 'windspeed', 'visibility']]

# As a best practice, add a constant to the model.
X = sm.add_constant(X)

# Fit an OLS model using statsmodels.
results = sm.OLS(Y, X).fit()

# Print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                 1.401e+04
Date:                Tue, 20 Aug 2019   Prob (F-statistic):               0.00
Time:                        17:45:10   Log-Likelihood:            -1.6938e+05
No. Observations:               96453   AIC:                         3.388e+05
Df Residuals:                   96449   BIC:                         3.388e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.5756      0.028     56.605      0.0

The interaction term improved the model more than adding visibility did because the adjusted R-squared value was 0.341 vs. 0.303.

_Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor._

AIC and BIC scores for each model:
- First model's scores (humidity and windspeed were the explanatory variables {EVs}):
    - AIC: 3.409e+05
    - BIC: 3.409e+05
    
    
- Second model's scores (humidity, windspeed and the interaction term {humidity times windspeed} were the EVs):
    - AIC: 3.334e+05
    - BIC: 3.334e+05
    
    
- Third model's scores (humidity, windspeed, interaction term and visibility were the EVs:
    - AIC: 3.301e+05
    - BIC: 3.301e+05
    
    
- Fourth models' scores (humidity, windspeed and visibility were the EVs)
    - AIC: 3.388e+05
    - BIC: 3.388e+05
    
    
Based off the AIC and BIC scores alone, the third model performed the best.

### Loading the House Prices Dataframe

In [24]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
house_prices_df = pd.read_sql_query('select * from houseprices',con=engine)

engine.dispose()

### House Prices Model Recap

- Load the houseprices data from Thinkful's database.
- Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
- Do you think your model is satisfactory? If so, why?
- In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables.
- For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

_Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC._

In [25]:
# Y is the target variable.
Y = house_prices_df['saleprice']

# X is the feature set (or the explanatory variables).
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf']]

# As a best practice, add a constant to the model.
X = sm.add_constant(X)

# Fit an OLS model using statsmodels.
results = sm.OLS(Y, X).fit()

# Print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.761
Model:                            OLS   Adj. R-squared:                  0.760
Method:                 Least Squares   F-statistic:                     926.5
Date:                Tue, 20 Aug 2019   Prob (F-statistic):               0.00
Time:                        18:21:49   Log-Likelihood:                -17499.
No. Observations:                1460   AIC:                         3.501e+04
Df Residuals:                    1454   BIC:                         3.504e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -9.907e+04   4638.450    -21.359      

Here are the model's statistics:

- F-test: 926.5
- R-squared: 0.761
- Adjusted R-squared: 0.760
- AIC: 3.501e+04
- BIC: 3.504e+04

_Do you think your model's satisfactory?_

I'm pleased with the R-squared, adjusted R-squared, AIC and BIC values, but I'm not sure what to make of the F-test because I'm not sure how to measure my model's error terms (which impact the F-test) and I don't believe this model's nested.

_In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables._

I'll make two more versions of my model: one where I remove two explanatory variables and another where I add two explanatory variables.

In [27]:
# First model with two explanatory variables added.

# Y is the target variable.
Y = house_prices_df['saleprice']

# X is the feature set (or the explanatory variables).
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf', 'fullbath', 'fireplaces']]

# As a best practice, add a constant to the model.
X = sm.add_constant(X)

# Fit an OLS model using statsmodels.
results = sm.OLS(Y, X).fit()

# Print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.765
Model:                            OLS   Adj. R-squared:                  0.764
Method:                 Least Squares   F-statistic:                     675.5
Date:                Tue, 20 Aug 2019   Prob (F-statistic):               0.00
Time:                        18:22:11   Log-Likelihood:                -17487.
No. Observations:                1460   AIC:                         3.499e+04
Df Residuals:                    1452   BIC:                         3.503e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -9.452e+04   4729.676    -19.984      

In [28]:
# Second model with two explanatory variables removed.

# Y is the target variable.
Y = house_prices_df['saleprice']

# X is the feature set (or the explanatory variables).
X = house_prices_df[['overallqual', 'grlivarea', 'garagecars']]

# As a best practice, add a constant to the model.
X = sm.add_constant(X)

# Fit an OLS model using statsmodels.
results = sm.OLS(Y, X).fit()

# Print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.739
Model:                            OLS   Adj. R-squared:                  0.739
Method:                 Least Squares   F-statistic:                     1375.
Date:                Tue, 20 Aug 2019   Prob (F-statistic):               0.00
Time:                        18:22:41   Log-Likelihood:                -17563.
No. Observations:                1460   AIC:                         3.513e+04
Df Residuals:                    1456   BIC:                         3.516e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -9.883e+04   4842.897    -20.408      

_For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?_


- First model's scores ('overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf' were the explanatory variables {EVs}):
    - R-squared: 0.761
    - Adjusted R-squared: 0.760
    - F-test: 926.5
    - AIC: 3.501e+04
    - BIC: 3.501e+04
    
    
- Second model's scores ('overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf', 'fullbath', 'fireplaces' were the EVs):
    - R-squared: 0.765
    - Adjusted R-squared: 0.764
    - F-test: 675.5
    - AIC: 3.499e+04
    - BIC: 3.503e+04
    
    
- Third model's scores ('overallqual', 'grlivarea', 'garagecars' were the EVs):
    - R-squared: 0.739
    - Adjusted R-squared: 0.739
    - F-test: 1375
    - AIC: 3.513e+04
    - BIC: 3.516e+04
    
    
The third model performed the best because it had the lowest AIC and BIC values, plus the highest R-squared and adjusted R-squared value. 

The models' performances surprised me because I was expecting the first model to do the best.