# Problem 3 Boston Housing Data
### Created By: Ivor Zalud
***
## The Problem
I want to predict the NOX and the median housing price using a linear regressor for Boston homes.


## The Data
Data is provided by CMU [here](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). The set describes housing data in the Boston area collected by the US Census Service. Note, there are only 506 cases, results will be weak. The features set is: crim zn indus chas nox rm	age	dis	rad	tax	ptratio	b lstat and medv.

## Methods
* Stochastic Gradient Descent as our linear regression so we can take advantage of gradient descent




In [107]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

## Read the data in early as we will clean/split our data differently for each prediction
data = pd.read_csv('Data/BostonHousing.csv')
data

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


# Predicting NOX
***


## 1. Clean the data

In [108]:
data_nox = data.copy()
data_nox

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


## 2. Split the data into training and testing sets
We will do an 80/20 split

In [109]:
## Define our indepedent and dependant variables
data_column_names = [column for column in data_nox.columns if column not in ['nox']]
x = data_nox.loc[:, data_column_names]
y = data_nox.loc[:,'nox']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,crim,zn,indus,chas,rm,age,dis,rad,tax,ptratio,b,lstat,medv
473,4.64689,0.0,18.10,0,6.980,67.6,2.5329,24,666,20.2,374.68,11.66,29.8
475,6.39312,0.0,18.10,0,6.162,97.4,2.2060,24,666,20.2,302.76,24.10,13.3
207,0.25199,0.0,10.59,0,5.783,72.7,4.3549,4,277,18.6,389.43,18.06,22.5
347,0.01870,85.0,4.15,0,6.516,27.7,8.5353,4,351,17.9,392.43,6.36,23.1
15,0.62739,0.0,8.14,0,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...
340,0.06151,0.0,5.19,0,5.968,58.5,4.8122,5,224,20.2,396.90,9.29,18.7
97,0.12083,0.0,2.89,0,8.069,76.0,3.4952,2,276,18.0,396.90,4.21,38.7
382,9.18702,0.0,18.10,0,5.536,100.0,1.5804,24,666,20.2,396.90,23.60,11.3
195,0.01381,80.0,0.46,0,7.875,32.0,5.6484,4,255,14.4,394.23,2.97,50.0


## 3. Create the SGD regressor, regularize, and fit the data

### We use scikit-learn's pipeline to scale the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of Logit Regression in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.
### - **Hyperparameter:** max_iter: ~1500 typically resulted in the best model. Larger data sets we may want to decrease this. However, the best strategy would be to use a grid search to determine optimal params.

In [110]:
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1500, tol=1e-3))
reg.fit(x_train, y_train)


Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdregressor', SGDRegressor(max_iter=1500))])

## 4. Test the model and print out scores

In [111]:
reg_predictions = reg.predict(x_test)

## Prettier the data to compare actual vs prediction of the first 6 items.
i = 0
pred_arr = []
actual_arr = []
for pred in reg_predictions:
    pred_arr.append(pred)
    i += 1
    if i > 6:
        break
print("-----------------------------------------------------")
i = 0
for act in y_test:
    actual_arr.append(act)
    i += 1
    if i > 6:
        break

for x in range(6):
    print("Actual: " + str(actual_arr[x]) + " - Prediction: " + str(pred_arr[x]))


print("Score: " + str(reg.score(x_test, y_test)))

-----------------------------------------------------
Actual: 0.515 - Prediction: 0.41573981933029125
Actual: 0.489 - Prediction: 0.5110442226297296
Actual: 0.4429 - Prediction: 0.48167258538310026
Actual: 0.693 - Prediction: 0.6989764597656621
Actual: 0.693 - Prediction: 0.7005472391370082
Actual: 0.679 - Prediction: 0.6785720213127965
Score: 0.7822271525179844


## Results
The model does sufficiently well at predicting the nox level for a given house. Future studies will benefit from a larger data set and more features. Also back filling the NaN data would likely improve the model.

#

# Predicting Median Home Value
***

## 1. Clean the data

In [112]:
data_home = data.copy()
data_home

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


## 2. Split the data into training and testing sets
We will do an 80/20 split

In [113]:
## Define our indepedent and dependant variables
data_column_names = [column for column in data_home.columns if column not in ['medv']]
x = data_home.loc[:, data_column_names]
y = data_home.loc[:,'medv']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat
151,1.49632,0.0,19.58,0,0.871,5.404,100.0,1.5916,5,403,14.7,341.60,13.28
123,0.15038,0.0,25.65,0,0.581,5.856,97.0,1.9444,2,188,19.1,370.31,25.41
54,0.01360,75.0,4.00,0,0.410,5.888,47.6,7.3197,3,469,21.1,396.90,14.80
478,10.23300,0.0,18.10,0,0.614,6.185,96.7,2.1705,24,666,20.2,379.70,18.03
474,8.05579,0.0,18.10,0,0.584,5.427,95.4,2.4298,24,666,20.2,352.58,18.14
...,...,...,...,...,...,...,...,...,...,...,...,...,...
440,22.05110,0.0,18.10,0,0.740,5.818,92.4,1.8662,24,666,20.2,391.45,22.11
35,0.06417,0.0,5.96,0,0.499,5.933,68.2,3.3603,5,279,19.2,396.90,9.68
154,1.41385,0.0,19.58,1,0.871,6.129,96.0,1.7494,5,403,14.7,321.02,15.12
481,5.70818,0.0,18.10,0,0.532,6.750,74.9,3.3317,24,666,20.2,393.07,7.74


## 3. Create the SGD regressor, regularize, and fit the data

### We use scikit-learn's pipeline to normalize the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of Logit Regression in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.
### - **Hyperparameter:** max_iter: ~1500 typically resulted in the best model. Larger data sets we may want to decrease this. However, the best strategy would be to use a grid search to determine optimal params. We could benefit from grid search here if we had more hyperparameters.

In [114]:
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1500, tol=1e-3))
reg.fit(x_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdregressor', SGDRegressor(max_iter=1500))])

## 4. Test the model and print out scores

In [115]:
reg_predictions = reg.predict(x_test)

## Prettier the data to compare actual vs prediction of the first 6 items.
i = 0
pred_arr = []
actual_arr = []
for pred in reg_predictions:
    pred_arr.append(pred)
    i += 1
    if i > 6:
        break
print("-----------------------------------------------------")
i = 0
for act in y_test:
    actual_arr.append(act)
    i += 1
    if i > 6:
        break

for x in range(6):
    print("Actual: " + str(actual_arr[x]) + " - Prediction: " + str(pred_arr[x]))


print("Score: " + str(reg.score(x_test, y_test)))

-----------------------------------------------------
Actual: 15.0 - Prediction: 14.09468878639181
Actual: 20.3 - Prediction: 22.93287303252346
Actual: 37.2 - Prediction: 32.38477221690656
Actual: 37.6 - Prediction: 36.81708507980378
Actual: 18.6 - Prediction: 18.58194485093853
Actual: 14.1 - Prediction: 17.665080967016387
Score: 0.7530299723644374


## Results
The model precits housing prices adequetly. The score is similar to the NOX score. I'm inclined to believe this is due to the low amount of features and cases given. Future studise will benefit form a larger data set with more features.