# Basic Linear Regression Model

12 features that I think make the most difference in house price

| Model Features | --- | --- | --- |---|
| --- | --- |--- | --- |---|
| **Numeric** | --- |--- | --- |---|
| Lot Area | Overall Condition | Overall Qual | Garage Area | Total Bathrooms |
| Year Remod/Add | Total SF |--- | --- |---|
| **Catagorical** | --- | --- | --- |---|
| MS SubClass | Neighborhood | Condition 1 | Exter Qual | Kitchen Qual |
| --- | --- |--- | --- |---|

In [1]:
#importing libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics
#import re

# Data Dictionary - [Link](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) 

In [2]:
#importing complete training data
housing_data = pd.read_csv("../datasets/complete_training_data.csv")
#importing testing data
testing_data = pd.read_csv("../datasets/complete_kaggle_test.csv")

In [3]:
housing_data.columns

Index(['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
       'G

In [4]:
ms_subclass_dummies = pd.get_dummies(housing_data["MS SubClass"],prefix="SubClass")
ms_subclass_dummies.drop(columns=["SubClass_150","SubClass_40"],inplace=True)

neighborhood_dummies = pd.get_dummies(housing_data["Neighborhood"])
neighborhood_dummies.drop(columns=["Landmrk","GrnHill","Greens","Blueste"],inplace=True)

condition_1_dummies = pd.get_dummies(housing_data["Condition 1"])
condition_1_dummies.drop(columns=["RRNe","RRNn"],inplace=True)

exter_qual_dummies = pd.get_dummies(housing_data["Exter Qual"],prefix="ExQ")
exter_qual_dummies.drop(columns="ExQ_Fa",inplace=True)

kitchen_qual_dummies = pd.get_dummies(housing_data["Kitchen Qual"],prefix="Kit")
kitchen_qual_dummies.drop(columns="Kit_Fa",inplace=True)

In [5]:
xvars = ["Lot Area","Overall Cond","Overall Qual","Total SF","Garage Area","Year Remod/Add","Total Bathrooms"]

In [6]:
X = housing_data[xvars]
y = housing_data["SalePrice"]

In [7]:
#creating matrix for xvars and dummy cells
X = pd.concat([X,
ms_subclass_dummies,
neighborhood_dummies,
condition_1_dummies,
exter_qual_dummies,
kitchen_qual_dummies
              ], axis=1)

In [8]:
#spliting the data 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=2020)

In [9]:
linreg = LinearRegression()

In [10]:
linreg.fit(X_train,y_train)

LinearRegression()

In [11]:
#looking at 5 part cross validation on traing data. The R2 is quite good at 0.90
linreg_scores = cross_val_score(linreg, X_train, y_train, cv=5)
linreg_scores.mean()

0.8973593002549061

In [12]:
display(linreg.score(X_train,y_train))
display(linreg.score(X_test,y_test))

0.9065190790030951

0.8732924121487826

In [13]:
X_train_sm = X_train
X_train_sm = sm.add_constant(X_train_sm)
y_train_sm = y_train

In [14]:
sm_model = sm.OLS(y_train_sm,X_train_sm).fit()

In [15]:
results_summary = sm_model.summary()

In [16]:
results_summary.tables[0]

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.907
Model:,OLS,Adj. R-squared:,0.903
Method:,Least Squares,F-statistic:,260.7
Date:,"Sun, 16 Aug 2020",Prob (F-statistic):,0.0
Time:,18:08:14,Log-Likelihood:,-18640.0
No. Observations:,1618,AIC:,37400.0
Df Residuals:,1559,BIC:,37720.0
Df Model:,58,,
Covariance Type:,nonrobust,,


In [17]:
results_as_html = results_summary.tables[1].as_html()
coef = pd.read_html(results_as_html, header=0, index_col=0)[0]

In [18]:
coef.head(8)

Unnamed: 0,coef,std err,t,P>|t|,[0.025,0.975]
const,-323400.0,97700.0,-3.309,0.001,-515000.0,-132000.0
Lot Area,2.0067,0.248,8.106,0.0,1.521,2.492
Overall Cond,4982.0127,705.249,7.064,0.0,3598.676,6365.349
Overall Qual,9892.2874,904.906,10.932,0.0,8117.327,11700.0
Total SF,33.7703,1.473,22.923,0.0,30.881,36.66
Garage Area,24.3281,4.217,5.769,0.0,16.057,32.599
Year Remod/Add,134.8438,49.457,2.726,0.006,37.834,231.853
Total Bathrooms,11310.0,1176.062,9.62,0.0,9006.48,13600.0


### Conclusions from modeling

The train score for this model is 0.90 (test score is 0.878). I think that this model could fit the data better but it is a good start set as the bench mark. This model is also very interperable which is a nice feature.

### Calculating sales prices in training data (need to output .csv w/ header Id,SalePrice)

In [19]:
X_kaggle = testing_data[xvars]

In [20]:
kaggle_ms_subclass_dummies = pd.get_dummies(testing_data["MS SubClass"],prefix="SubClass")
kaggle_ms_subclass_dummies.drop(columns=["SubClass_40"],inplace=True)

kaggle_neighborhood_dummies = pd.get_dummies(testing_data["Neighborhood"])
kaggle_neighborhood_dummies.drop(columns=["Greens","Blueste"],inplace=True)

kaggle_condition_1_dummies = pd.get_dummies(testing_data["Condition 1"])
kaggle_condition_1_dummies.drop(columns=["RRNe","RRNn"],inplace=True)

kaggle_exter_qual_dummies = pd.get_dummies(testing_data["Exter Qual"],prefix="ExQ")
kaggle_exter_qual_dummies.drop(columns="ExQ_Fa",inplace=True)

kaggle_kitchen_qual_dummies = pd.get_dummies(testing_data["Kitchen Qual"],prefix="Kit")
kaggle_kitchen_qual_dummies.drop(columns=["Kit_Fa","Kit_Po"],inplace=True)

In [21]:
#creating matrix for xvars and dummy cells
X_kaggle = pd.concat([X_kaggle,
kaggle_ms_subclass_dummies,
kaggle_neighborhood_dummies,
kaggle_condition_1_dummies,
kaggle_exter_qual_dummies,
kaggle_kitchen_qual_dummies
              ], axis=1)

In [22]:
#finding differences between train and testing data columns. They need to match for fit to make sense
for i,x in enumerate(X_train.columns):
    if X_kaggle.columns[i] != x:
        print(i,x)
    else:
        pass

In [23]:
price_X_testing = linreg.predict(X_kaggle)
testing_data["SalePrice"] = price_X_testing
testing_data.head(1)

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Total SF x Garage Area,Overall Qual^2,Year Remod/Add x Total SF,Total SF^2,Overall Qual x Garage Area,Overall Qual x Total Bathrooms,Total SF x Total Bathrooms,Overall Qual x Year Remod/Add,Total Bathrooms x Garage Area,SalePrice
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,1297120,36,5748600,8690704,2640,12.0,5896.0,11700,880.0,158934.991398


In [24]:
ols_basic_fit = testing_data[["Id","SalePrice"]]
ols_basic_fit.to_csv("../datasets/ols_basic_fit.csv",index=False)