# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [10]:
import pandas as pd
import numpy as np

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


## Continuous Features

In [107]:
# Log transform and normalize
def standard(series):
    stand = (series - np.mean(series))/np.std(series)
    return stand
    
Cont_log_norm= ["log_" +feature +"_norm" for feature in continuous]

log = np.log(ames.loc[:,continuous])
normalize = log.apply(standard)
normalize.columns=Cont_log_norm
normalize


Unnamed: 0,log_LotArea_norm,log_1stFlrSF_norm,log_GrLivArea_norm
0,-0.133231,-0.803570,0.529260
1,0.113442,0.418585,-0.381846
2,0.420061,-0.576560,0.659675
3,0.103347,-0.439287,0.541511
4,0.878409,0.112267,1.282191
...,...,...,...
1455,-0.259188,-0.465607,0.416680
1456,0.725419,1.981135,1.106592
1457,-0.002325,0.228338,1.469942
1458,0.136861,-0.077573,-0.854471


## Categorical Features

In [99]:
# One hot encode categoricals
categorical_new = pd.get_dummies(ames[categoricals], drop_first = True, prefix = "Cat_")
categorical_new

Unnamed: 0,Cat__2fmCon,Cat__Duplex,Cat__Twnhs,Cat__TwnhsE,Cat__Fa,Cat__Gd,Cat__TA,Cat__CWD,Cat__Con,Cat__ConLD,...,Cat__NoRidge,Cat__NridgHt,Cat__OldTown,Cat__SWISU,Cat__Sawyer,Cat__SawyerW,Cat__Somerst,Cat__StoneBr,Cat__Timber,Cat__Veenker
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1456,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1457,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1458,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Combine Categorical and Continuous Features

In [126]:
# combine features into a single dataframe called preprocessed
preprocessed = pd.concat([normalize,categorical_new], axis = 1)
preprocessed

Unnamed: 0,log_LotArea_norm,log_1stFlrSF_norm,log_GrLivArea_norm,log_SalePrice_norm,Cat__2fmCon,Cat__Duplex,Cat__Twnhs,Cat__TwnhsE,Cat__Fa,Cat__Gd,...,Cat__NoRidge,Cat__NridgHt,Cat__OldTown,Cat__SWISU,Cat__Sawyer,Cat__SawyerW,Cat__Somerst,Cat__StoneBr,Cat__Timber,Cat__Veenker
0,-0.133231,-0.803570,0.529260,0.560068,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0.113442,0.418585,-0.381846,0.212764,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0.420061,-0.576560,0.659675,0.734046,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0.103347,-0.439287,0.541511,-0.437382,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0.878409,0.112267,1.282191,1.014651,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,-0.259188,-0.465607,0.416680,0.121434,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1456,0.725419,1.981135,1.106592,0.578020,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1457,-0.002325,0.228338,1.469942,1.174708,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1458,0.136861,-0.077573,-0.854471,-0.399656,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## Run a linear model with SalePrice as the target variable in statsmodels

In [133]:
import statsmodels.api as sm

# Your code here
y = preprocessed['log_SalePrice_norm']
X = preprocessed.drop(['log_SalePrice_norm'], axis = 1)



In [134]:
#formula2 = "SalePrice ~ LotArea + 1stFlrSF"
X_int=sm.add_constant(X)
model = sm.OLS(endog = y, exog = X_int).fit()
model.summary()
model

0,1,2,3
Dep. Variable:,log_SalePrice_norm,R-squared:,0.839
Model:,OLS,Adj. R-squared:,0.834
Method:,Least Squares,F-statistic:,156.5
Date:,"Tue, 28 Dec 2021",Prob (F-statistic):,0.0
Time:,21:17:42,Log-Likelihood:,-738.64
No. Observations:,1460,AIC:,1573.0
Df Residuals:,1412,BIC:,1827.0
Df Model:,47,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.1317,0.263,-0.500,0.617,-0.648,0.385
log_LotArea_norm,0.1033,0.019,5.475,0.000,0.066,0.140
log_1stFlrSF_norm,0.1371,0.016,8.584,0.000,0.106,0.168
log_GrLivArea_norm,0.3768,0.016,24.114,0.000,0.346,0.407
Cat__2fmCon,-0.1715,0.079,-2.173,0.030,-0.326,-0.017
Cat__Duplex,-0.4205,0.062,-6.813,0.000,-0.542,-0.299
Cat__Twnhs,-0.1404,0.093,-1.513,0.130,-0.322,0.042
Cat__TwnhsE,-0.0512,0.060,-0.858,0.391,-0.168,0.066
Cat__Fa,-1.0002,0.088,-11.315,0.000,-1.174,-0.827

0,1,2,3
Omnibus:,289.988,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1242.992
Skew:,-0.886,Prob(JB):,1.22e-270
Kurtosis:,7.159,Cond. No.,109.0


## Run the same model in scikit-learn

In [144]:
from sklearn.linear_model import LinearRegression
Linear_regression = LinearRegression

model = Linear_regression().fit(X=X, y = y)
model.intercept_
model.coef_
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels

array([ 0.10327192,  0.1371289 ,  0.37682133, -0.17152105, -0.42048287,
       -0.14038921, -0.05121949, -1.00020261, -0.38215288, -0.6694784 ,
        0.22855565,  0.58627941,  0.31521364,  0.03310544,  0.01609215,
        0.29995612,  0.1178827 ,  0.17486316,  1.06700108,  0.8771105 ,
        0.99643261,  1.10266268, -0.21318409,  0.0529509 , -0.46287108,
       -0.65004527, -0.21026441, -0.0761186 , -0.08236455, -0.76152767,
       -0.09803299, -0.96216285, -0.6920628 , -0.25540919, -0.4408245 ,
       -0.01595592, -0.26772132,  0.36325607,  0.36272091, -0.93537011,
       -0.70000301, -0.47559431, -0.23317719,  0.09506225,  0.42971796,
        0.00569435,  0.12766986])

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!