# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [1]:
import pandas as pd
import numpy as np

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


## Continuous Features

In [2]:
# Log transform and normalize
log_cont = ames[continuous].apply(np.log)

In [3]:
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(log_cont)
x_scaled = pd.DataFrame(x_scaled,columns=continuous)

## Categorical Features

In [4]:
# One hot encode categoricals
encoded_Cat = pd.get_dummies(ames[categoricals],drop_first=True)

## Combine Categorical and Continuous Features

In [5]:
# combine features into a single dataframe called preprocessed
preprocessed = pd.concat([encoded_Cat,x_scaled],axis=1)

In [6]:
predictors = list(preprocessed.columns)
predictors.remove('SalePrice')

## Run a linear model with SalePrice as the target variable in statsmodels

In [11]:
import statsmodels.api as sm
X = preprocessed.drop('SalePrice', axis=1)
y = preprocessed['SalePrice']
X_int = sm.add_constant(X)
sm_model = sm.OLS(y,X_int).fit()
sm_model.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.839
Model:,OLS,Adj. R-squared:,0.834
Method:,Least Squares,F-statistic:,156.5
Date:,"Wed, 06 May 2020",Prob (F-statistic):,0.0
Time:,15:22:33,Log-Likelihood:,2241.3
No. Observations:,1460,AIC:,-4387.0
Df Residuals:,1412,BIC:,-4133.0
Df Model:,47,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.1603,0.037,4.380,0.000,0.089,0.232
BldgType_2fmCon,-0.0223,0.010,-2.173,0.030,-0.042,-0.002
BldgType_Duplex,-0.0546,0.008,-6.813,0.000,-0.070,-0.039
BldgType_Twnhs,-0.0182,0.012,-1.513,0.130,-0.042,0.005
BldgType_TwnhsE,-0.0067,0.008,-0.858,0.391,-0.022,0.009
KitchenQual_Fa,-0.1299,0.011,-11.315,0.000,-0.152,-0.107
KitchenQual_Gd,-0.0496,0.007,-7.613,0.000,-0.062,-0.037
KitchenQual_TA,-0.0870,0.007,-12.111,0.000,-0.101,-0.073
SaleType_CWD,0.0297,0.028,1.061,0.289,-0.025,0.085

0,1,2,3
Omnibus:,289.988,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1242.992
Skew:,-0.886,Prob(JB):,1.22e-270
Kurtosis:,7.159,Cond. No.,118.0


In [12]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!