## Multiple Linear Regression

Regression is among one the most common used method when analyst is trying to describe the result of an event - like the house price, which is what we are going to focus on in this notebook. We encourage you to create your own Jupytor notebook and follow along. You can also download this notebook together with any affiliated data in the [Notebooks and Data](https://github.com/Master-of-Business-Analytics/Notebooks_and_Data) GitHub repository. Alternatively, if you do not have Python or Jupyter Notebook installed yet, you may experiment with a virtual notebook by launching Binder or Syzygy below (learn more about these two tools in the [Resource](https://analytics-at-sauder.github.io/resource.html) tab). 

<a href="https://ubc.syzygy.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FAnalytics-at-Sauder%2FProject_16_MLR&urlpath=tree%2FProject_16_MLR%2Fp16_mlr.ipynb&branch=master" target="_blank" class="button">Launch Syzygy (UBC)</a>

<a href="https://pims.syzygy.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FAnalytics-at-Sauder%2FProject_16_MLR&urlpath=tree%2FProject_16_MLR%2Fp16_mlr.ipynb&branch=master" target="_blank" class="button">Launch Syzygy (Google)</a>

<a href="https://mybinder.org/v2/gh/Analytics-at-Sauder/Project_16_MLR/master?filepath=p16_mlr.ipynb" target="_blank" class="button">Launch Binder</a>

In [14]:
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas as pd
import matplotlib as plot

First let look at the data first to see what's inside it.

In [8]:
raw_df = pd.read_csv("data/train.csv")
raw_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [13]:
#Number of columns in the dataframe(-2 due to sales price and id feature in the column)
num_feature = len(raw_df.columns) -2

In [20]:
#check the uniqueness of each variable so that we can further select the feature in the dataframe
info = raw_df.agg(['count', 'size', 'nunique')

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
count,1460,1460,1460,1201,1460,1460,91,1460,1460,1460,...,1460,7,281,54,1460,1460,1460,1460,1460,1460
size,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,...,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460
nunique,1460,15,5,110,1073,2,2,4,4,2,...,8,3,4,4,21,12,5,9,6,663


### Step 1: Include two sample features within the dataframe

In [60]:
# Include the lot area and the year of sales features as x values, include SalePrice as y value
x = raw_df.loc[:,['LotArea','YrSold']].values.tolist()
y = list(map(int,raw_df["SalePrice"].tolist()))
x, y = np.array(x), np.array(y)

In [62]:
model = LinearRegression().fit(x, y)
# Get r squared value
r_sq = model.score(x, y)
print('coefficient of determination:', r_sq)

# Get intercept
print('intercept:', model.intercept_)

# Get coefficients
print('coefficients:', model.coef_)


coefficient of determination: 0.07024646007626767
intercept: 3181205.713858099
coefficients: [    2.09711551 -1505.28728351]


The coefficient indicates that there is something wrong here in the model. The coefficient associated with the year is negative. But in real life we are witnessing a huge increase in house price every year. So we need to further look search for the appropriate features to include in the model.