# Multiple Linear Regression : House Prices

This lab demonstrates building a multi-linear regression ML model to predit house prices

## Step 1 - Load Data

- The first step is to load the pandas module that allows us to use dataframes. 
- We also set the number of decimal points to display
- The dataset used here is downloadable from the URL in the lab notes.
- After reading it in, we display the first five rows to see that the data loaded ok

In [4]:
import pandas as pd
from sklearn.linear_model import LinearRegression





In [2]:
dataset = 'datasets/house-sales-full.csv'
pd.options.display.float_format = '{:,.2f}'.format
house_prices = pd.read_csv(dataset)
house_prices.head()

Unnamed: 0,DocumentID,Date,SalePrice,PropertyID,PropertyType,ym,zhvi_px,zhvi_idx,AdjSalePrice,NbrLivingUnits,...,Bathrooms,Bedrooms,BldgGrade,YrBuilt,YrRenovated,TrafficNoise,LandVal,ImpsVal,ZipCode,NewConstruction
0,1,9/16/14,280000,1000102,Multiplex,9/1/14,405100,0.93,300805.0,2,...,3.0,6,7,1991,0,0,70000,229000,98002,False
1,2,6/16/06,1000000,1200013,Single Family,6/1/06,404400,0.93,1076162.0,1,...,3.75,4,10,2005,0,0,203000,590000,98166,True
2,3,1/29/07,745000,1200019,Single Family,1/1/07,425600,0.98,761805.0,1,...,1.75,4,8,1947,0,0,183000,275000,98166,False
3,4,2/25/08,425000,2800016,Single Family,2/1/08,418400,0.96,442065.0,1,...,3.75,5,7,1966,0,0,104000,229000,98168,False
4,5,3/29/13,240000,2800024,Single Family,3/1/13,351600,0.81,297065.0,1,...,1.75,4,7,1948,0,0,104000,205000,98168,False


## Step 2 - Explore Data

Before we do any training, we usually want to get some idea about what the data looks like. 
- It looks like there are a number of features that probably won't be input parameters
- The DocumentID which refers to a specific property id
- The Date which is probably the date of the document
- We may need to get an explanation as to what some of the columns mean like 'zhvi_idx'

The `describe` function can be used to give us some basic statistics about the data set.

We can explore questions motivated by the problem domain like:

- How many 'max' bedrooms do we have? 
- What are the min/max of 'SalePrice'?
- Do we have outliers in data?

In [10]:
house_prices.describe()


Unnamed: 0,DocumentID,SalePrice,PropertyID,zhvi_px,zhvi_idx,AdjSalePrice,NbrLivingUnits,SqFtLot,SqFtTotLiving,SqFtFinBasement,Bathrooms,Bedrooms,BldgGrade,YrBuilt,YrRenovated,TrafficNoise,LandVal,ImpsVal,ZipCode
count,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0,27063.0
mean,13532.0,511626.2,4680324882.08,390750.58,0.9,570918.76,1.02,10997.68,2122.96,275.3,2.26,3.38,7.78,1977.09,86.73,0.21,213820.07,317211.67,82223.04
std,7812.56,342821.17,2896350979.15,37024.46,0.09,380236.63,0.15,28110.66,939.84,428.71,0.77,0.9,1.18,30.92,407.32,0.55,177213.41,234038.34,36106.67
min,1.0,3000.0,1000102.0,311600.0,0.72,3368.0,1.0,494.0,370.0,0.0,0.0,0.0,3.0,1900.0,0.0,0.0,0.0,0.0,-1.0
25%,6766.5,329000.0,2213000057.5,357100.0,0.82,366918.5,1.0,4257.5,1440.0,0.0,1.75,3.0,7.0,1954.0,0.0,0.0,105000.0,183000.0,98019.0
50%,13532.0,425000.0,3972900140.0,400600.0,0.92,475664.0,1.0,6636.0,1940.0,0.0,2.5,3.0,8.0,1986.0,0.0,0.0,172000.0,261000.0,98053.0
75%,20297.5,590000.0,7504001385.0,421200.0,0.97,655061.0,1.0,9450.0,2610.0,510.0,2.5,4.0,8.0,2006.0,0.0,0.0,258000.0,380000.0,98115.0
max,27063.0,11000000.0,9906000035.0,435200.0,1.0,11644855.0,5.0,1024068.0,10740.0,3500.0,8.0,33.0,13.0,2016.0,2016.0,3.0,5538000.0,5772000.0,98354.0


#### Data Type Issues

- Notice that some of the columns were not used since they data type of that column, like 'Date' which is a string, is not numeric, which means that statistics cannot be computed for them
- There is some data which is also categorical, like property id and zip code, but are represented by numeric labels.
- The `describe()` just assumed the were interval data and compute the statistics anyway.
- Obviously the mean of the zip codes is meaningless even though it was computed

#### Correlation exploration

- We could also explore some possible correlations between inputs by using `house_prices.corr()`
- But this would produce an error since we can only do correlations between numeric values
- We could try doing correlation between sale price and some interesting parameters and see if anything pops out



In [12]:

 house_prices[['SqFtLot', 'SqFtTotLiving', 'Bedrooms']].corrwith(house_prices['SalePrice'])


SqFtLot         0.14
SqFtTotLiving   0.68
Bedrooms        0.32
dtype: float64

## Step 3 - Shape Data

- It seems clear that not every column in the data will be useful in the model. The correlation analysis might suggest the columns that seem to be more correlated with sale price
- Because this is supervised learning, we have to pull out the taget column, sales price, and use it as the label
- For a first cut at a model **=> Select only "SalePrice", "Bedrooms", "Bathrooms", "SqFtTotLiving", "SqFtLot"**

In [13]:
input_columns= ['Bedrooms', 'Bathrooms', 'SqFtTotLiving', 'SqFtLot', 'LandVal']
label_column = ['SalePrice']

house_prices2 = house_prices[input_columns + label_column]
house_prices2

Unnamed: 0,Bedrooms,Bathrooms,SqFtTotLiving,SqFtLot,LandVal,SalePrice
0,6,3.00,2400,9373,70000,280000
1,4,3.75,3764,20156,203000,1000000
2,4,1.75,2060,26036,183000,745000
3,5,3.75,3200,8618,104000,425000
4,4,1.75,1720,8620,104000,240000
...,...,...,...,...,...,...
27058,2,1.75,1410,1161,147000,374000
27059,2,1.75,1410,1005,147000,374000
27060,4,1.00,1070,11170,92000,165000
27061,3,2.00,1345,6223,103000,315000


### Create the variables
- To use the regression library, we need to have an input x and set of labels y

In [14]:
x = house_prices2[input_columns]
y = house_prices2[label_column]

print ("x.shape = ", x.shape)
print ("y.shape = ", y.shape)

x.shape =  (27063, 5)
y.shape =  (27063, 1)


## Step 4 : Run Regression

- This is the simplest step, we just use the scikit-learn library to create a regression model using the standard defaults

In [15]:
model = LinearRegression ().fit(x,y)
model

## Step 5 : Predict

- With the model, we can see how good our predictor is by creating a set of predictions and comparing those to the actual lables.

In [16]:
predictions = model.predict(x)
predictions

array([[313206.30930952],
       [772741.67412942],
       [420944.98520074],
       ...,
       [140648.78714461],
       [276488.84419071],
       [419174.00295527]], shape=(27063, 1))

In [17]:
pd.options.display.float_format = '{:,.2f}'.format

a = house_prices2[input_columns  + label_column]
a['predicted_price'] = predictions
a

Unnamed: 0,Bedrooms,Bathrooms,SqFtTotLiving,SqFtLot,LandVal,SalePrice,predicted_price
0,6,3.00,2400,9373,70000,280000,313206.31
1,4,3.75,3764,20156,203000,1000000,772741.67
2,4,1.75,2060,26036,183000,745000,420944.99
3,5,3.75,3200,8618,104000,425000,537956.60
4,4,1.75,1720,8620,104000,240000,279323.64
...,...,...,...,...,...,...,...
27058,2,1.75,1410,1161,147000,374000,364174.50
27059,2,1.75,1410,1005,147000,374000,364156.80
27060,4,1.00,1070,11170,92000,165000,140648.79
27061,3,2.00,1345,6223,103000,315000,276488.84


## Step 6 : Evaluate

**Q==> Are any coefficients close to zero?  What does that mean?**

- The coefficients are a measure of how much that input affects the outcome
- A coefficient near zero means that feature has littel effect on the model
- In this data, the data is not scaled so we may see some unusual values
- The model would be better if we normalized the data

**Q==> Also inspect R2 value.  Is it decent?**

- The R² value (coefficient of determination) measures how well the independent variables (X) explain the variability in the dependent variable (Y).
- R2=1 → Perfect fit (Model explains all variance)
- R2=0 → No explanatory power (Model is no better than guessing)

In [18]:
import numpy as np
np.set_printoptions(precision=2, suppress=True)

print ("coefficients: " , model.coef_)
print ("intercept : ", model.intercept_)

coefficients:  [[-39286.34  45993.71    139.14      0.11      1.17]]
intercept :  [-5881.37]


In [19]:
## print each feature and it's coefficients
coef = pd.DataFrame({"input_column" : input_columns,  
                     "coefficient": model.coef_[0]})
coef

Unnamed: 0,input_column,coefficient
0,Bedrooms,-39286.34
1,Bathrooms,45993.71
2,SqFtTotLiving,139.14
3,SqFtLot,0.11
4,LandVal,1.17


In [21]:
from sklearn.metrics import  r2_score


print ("R2 : " , r2_score(y, predictions))


R2 :  0.7618151215674123


## Step 7 : Predict on New Data

- Now generate new data and use the model for predictions

In [24]:
new_data = pd.DataFrame({'Bedrooms' : [5,3,2],
                         'Bathrooms' : [3,2,1.5],
                         'SqFtTotLiving' : [4400, 1800, 1550],
                         'SqFtLot' : [10000, 5000, 4000],
                         'LandVal' : [150000, 80000, 20000]
             })
new_data

Unnamed: 0,Bedrooms,Bathrooms,SqFtTotLiving,SqFtLot,LandVal
0,5,3.0,4400,10000,150000
1,3,2.0,1800,5000,80000
2,2,1.5,1550,4000,20000


In [25]:
new_prediction = model.predict(new_data)
new_prediction

array([[724355.59],
       [312777.54],
       [224038.85]])

In [26]:
pd.options.display.float_format = '{:,.2f}'.format

new_data['predicted_price'] = new_prediction
new_data

Unnamed: 0,Bedrooms,Bathrooms,SqFtTotLiving,SqFtLot,LandVal,predicted_price
0,5,3.0,4400,10000,150000,724355.59
1,3,2.0,1800,5000,80000,312777.54
2,2,1.5,1550,4000,20000,224038.85


## Step 8 : Impo]rove Model Performance

- Now that we have done a 'end-to-end' implementation in Regression.  
- How ever our accuracy isn't all that great!  

**Q ==> What can we do to improve our model?**

- One option is to choose better or more input columns.  
- Examine the data and suggest other columns
- The data could be scaled so that all parameters are in the range [0,1]
- We could convert strings like dates and booleans like new construction into numerics to see if they have an effect
