# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [1]:
import pandas as pd
import numpy as np

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


## Continuous Features

In [2]:
# Log transform and normalize
log_cont = ames[continuous].apply(np.log)

In [3]:
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(log_cont)
x_scaled = pd.DataFrame(x_scaled,columns=continuous)

## Categorical Features

In [4]:
# One hot encode categoricals
encoded_Cat = pd.get_dummies(ames[categoricals],drop_first=True)

## Combine Categorical and Continuous Features

In [5]:
# combine features into a single dataframe called preprocessed
preprocessed = pd.concat([encoded_Cat,x_scaled],axis=1)

In [6]:
predictors = list(preprocessed.columns)
predictors.remove('SalePrice')

## Run a linear model with SalePrice as the target variable in statsmodels

In [7]:
import statsmodels.formula.api as smf

In [20]:
# Your code here
y = 'SalePrice'
pred_sum = '+'.join(predictors)
formula = y+'~'+pred_sum
sm_model = smf.ols(formula,data=preprocessed).fit()

SyntaxError: invalid syntax (<unknown>, line 1)

In [7]:
import statsmodels.api as sm
X = preprocessed.drop('SalePrice', axis=1)
y = preprocessed['SalePrice']
X_int = sm.add_constant(preprocessed)
sm_model = sm.OLS(y,X_int).fit()
sm_model.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,SalePrice,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,2.831e+30
Date:,"Wed, 06 May 2020",Prob (F-statistic):,0.0
Time:,15:21:14,Log-Likelihood:,49627.0
No. Observations:,1460,AIC:,-99160.0
Df Residuals:,1411,BIC:,-98900.0
Df Model:,48,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5.829e-16,2.96e-16,-1.969,0.049,-1.16e-15,-2.19e-18
BldgType_2fmCon,-4.302e-16,8.25e-17,-5.214,0.000,-5.92e-16,-2.68e-16
BldgType_Duplex,-9.714e-17,6.54e-17,-1.484,0.138,-2.26e-16,3.12e-17
BldgType_Twnhs,-6.939e-16,9.69e-17,-7.162,0.000,-8.84e-16,-5.04e-16
BldgType_TwnhsE,-4.337e-16,6.23e-17,-6.964,0.000,-5.56e-16,-3.12e-16
KitchenQual_Fa,-2.559e-16,9.63e-17,-2.657,0.008,-4.45e-16,-6.69e-17
KitchenQual_Gd,-7.286e-17,5.34e-17,-1.363,0.173,-1.78e-16,3.2e-17
KitchenQual_TA,-1.475e-16,6.06e-17,-2.433,0.015,-2.66e-16,-2.86e-17
SaleType_CWD,4.718e-16,2.25e-16,2.099,0.036,3.1e-17,9.13e-16

0,1,2,3
Omnibus:,161.427,Durbin-Watson:,1.913
Prob(Omnibus):,0.0,Jarque-Bera (JB):,215.438
Skew:,0.922,Prob(JB):,1.65e-47
Kurtosis:,3.378,Cond. No.,121.0


## Run the same model in scikit-learn

In [None]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!