# Multiple Linear Regression : House Prices

## Step 1 - Load Data

In [None]:
import os
import urllib.request

data_location = "house-sales-full.csv"
data_url = 'https://elephantscale-public.s3.amazonaws.com/data/house-prices/house-sales-full.csv'


if not os.path.exists (data_location):
    data_location = os.path.basename(data_location)
    if not os.path.exists(data_location):
        print("Downloading : ", data_url)
        urllib.request.urlretrieve(data_url, data_location)
print('data_location:', data_location)

In [None]:
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format

house_prices = pd.read_csv(data_location)
house_prices

## Step 2 - Explore Data (EDA)
EDA is a great way to get a sense of the data.  

Try to find answers to the following questions, by looking at the output of `describe` below

- How many 'max' bedrooms do we have? :-)
- Find min/max of 'SalePrice'
- Do you think we have outliers in data

In [None]:
house_prices.describe()

In [None]:
## any correlation
house_prices.corr()

## Step 3 - Shape Data
Wow! That's a lot of columns.  Maybe we should foucs on just a few of them.

**=> Select only "SalePrice", "Bedrooms", "Bathrooms", "SqFtTotLiving", "SqFtLot"**

In [None]:
input_columns= ['Bedrooms', 'Bathrooms', 'SqFtTotLiving', 'SqFtLot']
label_column = ['SalePrice']

house_prices2 = house_prices[input_columns + label_column]
house_prices2

In [None]:
x = house_prices2[input_columns]
y = house_prices2[label_column]

print ("x.shape = ", x.shape)
print ("y.shape = ", y.shape)

## Step 4 : Run Regression

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression ().fit(x,y)
model

## Step 5 : Predict

In [None]:
predictions = model.predict(x)
predictions

In [None]:
pd.options.display.float_format = '{:,.2f}'.format

a = house_prices2[input_columns  + label_column]
a['predicted_price'] = predictions
a

## Step 6 : Evaluate

**Q==> Are any coefficients close to zero?  What does that mean?**

**Q==> Also inspect R2 value.  Is it decent?**

In [None]:
import numpy as np
np.set_printoptions(precision=2, suppress=True)

print ("coefficients: " , model.coef_)
print ("intercept : ", model.intercept_)

In [None]:
## print each feature and it's coefficients
coef = pd.DataFrame({"input_column" : input_columns,  
                     "coefficient": model.coef_[0]})
coef

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

print ("R2 : " , r2_score(y, predictions))
print ("MSE : ", mean_squared_error(y, predictions))
print ("RMSE : ", sqrt(mean_squared_error(y, predictions)) )

## Step 7 : Predict on New Data

In [None]:
new_data = pd.DataFrame({'Bedrooms' : [5,3,2],
                         'Bathrooms' : [3,2,1.5],
                         'SqFtTotLiving' : [4400, 1800, 1550],
                         'SqFtLot' : [10000, 5000, 4000]
             })
new_data

In [None]:
new_prediction = model.predict(new_data)
new_prediction

In [None]:
pd.options.display.float_format = '{:,.2f}'.format

new_data['predicted_price'] = new_prediction
new_data

## Step 8 : Imporove Model Performance
Now that we have done a 'end-to-end' implementation in Regression.  
How ever our accuracy isn't all that great!  

**Q ==> What can we do to improve our model?**

One option is to choose better input columns.  
in Step-3, add more input columns.
For example you can add 'LandVal' to input columns as follows

```python
input_columns= ['Bedrooms', 'Bathrooms', 'SqFtTotLiving', 'SqFtLot', 'LandVal']
```
