# Problem 3 Boston Housing Data
### Created By: Ivor Zalud
***
## The Problem
I want to predict the NOX and the median housing price using a linear regressor for Boston homes.


## The Data
Data is provided by CMU [here](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). The set describes housing data in the Boston area collected by the US Census Service. Note, there are only 506 cases, results will be weak. The features set is: crim zn indus chas nox rm	age	dis	rad	tax	ptratio	b lstat and medv.

## Methods
* Stochastic Gradient Descent as our linear regression so we can take advantage of gradient descent




In [129]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

## Read the data in early as we will clean/split our data differently for each prediction
data = pd.read_csv('Data/BostonHousing.csv')
data

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


# Predicting NOX
***


## 1. Clean the data

In [130]:
data_nox = data.copy()
data_nox

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


## 2. Split the data into training and testing sets
We will do an 80/20 split

In [131]:
## Define our indepedent and dependant variables
data_column_names = [column for column in data_nox.columns if column not in ['nox']]
x = data_nox.loc[:, data_column_names]
y = data_nox.loc[:,'nox']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,crim,zn,indus,chas,rm,age,dis,rad,tax,ptratio,b,lstat,medv
359,4.26131,0.0,18.10,0,6.112,81.3,2.5091,24,666,20.2,390.74,12.67,22.6
335,0.03961,0.0,5.19,0,6.037,34.5,5.9853,5,224,20.2,396.90,8.01,21.1
146,2.15505,0.0,19.58,0,5.628,100.0,1.5166,5,403,14.7,169.27,16.65,15.6
168,2.30040,0.0,19.58,0,6.319,96.1,2.1000,5,403,14.7,297.09,11.10,23.8
244,0.20608,22.0,5.86,0,5.593,76.5,7.9549,7,330,19.1,372.49,12.50,17.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
264,0.55007,20.0,3.97,0,7.206,91.6,1.9301,5,264,13.0,387.89,8.10,36.5
239,0.09252,30.0,4.93,0,6.606,42.2,6.1899,6,300,16.6,383.78,7.37,23.3
260,0.54011,20.0,3.97,0,7.203,81.8,2.1121,5,264,13.0,392.80,9.59,33.8
205,0.13642,0.0,10.59,0,5.891,22.3,3.9454,4,277,18.6,396.90,10.87,22.6


## 3. Create the SGD regressor, regularize, and fit the data

### We use scikit-learn's pipeline to scale the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of Logit Regression in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.
### - **Hyperparameter:** max_iter: ~1500 typically resulted in the best model. Larger data sets we may want to decrease this. However, the best strategy would be to use a grid search to determine optimal params.

In [132]:
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1500, tol=1e-3))
reg.fit(x_train, y_train)


Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdregressor', SGDRegressor(max_iter=1500))])

## 4. Test the model and print out scores

In [133]:
reg_predictions = reg.predict(x_test)

## Prettier the data to compare actual vs prediction of the first 6 items.
i = 0
pred_arr = []
actual_arr = []
for pred in reg_predictions:
    pred_arr.append(pred)
    i += 1
    if i > 6:
        break
print("-----------------------------------------------------")
i = 0
for act in y_test:
    actual_arr.append(act)
    i += 1
    if i > 6:
        break

for x in range(6):
    print("Actual: " + str(actual_arr[x]) + " - Prediction: " + str(pred_arr[x]))


print("Score: " + str(reg.score(x_test, y_test)))

-----------------------------------------------------
Actual: 0.871 - Prediction: 0.7014282685601834
Actual: 0.605 - Prediction: 0.6867053904144889
Actual: 0.74 - Prediction: 0.6851385397152467
Actual: 0.605 - Prediction: 0.6802777425540361
Actual: 0.7 - Prediction: 0.6905757934207425
Actual: 0.464 - Prediction: 0.5618910405198076
Score: 0.760577126697026


## Results
The model does sufficiently well at predicting the nox level for a given house. Future studies will benefit from a larger data set and more features. Also back filling the NaN data would likely improve the model.

#

# Predicting Median Home Value
***

## 1. Clean the data

In [134]:
data_home = data.copy()
data_home

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


## 2. Split the data into training and testing sets
We will do an 80/20 split

In [135]:
## Define our indepedent and dependant variables
data_column_names = [column for column in data_home.columns if column not in ['medv']]
x = data_home.loc[:, data_column_names]
y = data_home.loc[:,'medv']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat
296,0.05372,0.0,13.92,0,0.437,6.549,51.0,5.9604,4,289,16.0,392.85,7.39
65,0.03584,80.0,3.37,0,0.398,6.290,17.8,6.6115,4,337,16.1,396.90,4.67
390,6.96215,0.0,18.10,0,0.700,5.713,97.0,1.9265,24,666,20.2,394.43,17.11
108,0.12802,0.0,8.56,0,0.520,6.474,97.1,2.4329,5,384,20.9,395.24,12.27
414,45.74610,0.0,18.10,0,0.693,4.519,100.0,1.6582,24,666,20.2,88.27,36.98
...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,0.03932,0.0,3.41,0,0.489,6.405,73.9,3.0921,2,270,17.8,393.55,8.20
345,0.03113,0.0,4.39,0,0.442,6.014,48.5,8.0136,3,352,18.8,385.64,10.53
364,3.47428,0.0,18.10,1,0.718,8.780,82.9,1.9047,24,666,20.2,354.55,5.29
266,0.78570,20.0,3.97,0,0.647,7.014,84.6,2.1329,5,264,13.0,384.07,14.79


## 3. Create the SGD regressor, regularize, and fit the data

### We use scikit-learn's pipeline to normalize the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of Logit Regression in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.
### - **Hyperparameter:** max_iter: ~1500 typically resulted in the best model. Larger data sets we may want to decrease this. However, the best strategy would be to use a grid search to determine optimal params. We could benefit from grid search here if we had more hyperparameters.

In [136]:
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1500, tol=1e-3))
reg.fit(x_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdregressor', SGDRegressor(max_iter=1500))])

## 4. Test the model and print out scores

In [137]:
reg_predictions = reg.predict(x_test)

## Prettier the data to compare actual vs prediction of the first 6 items.
i = 0
pred_arr = []
actual_arr = []
for pred in reg_predictions:
    pred_arr.append(pred)
    i += 1
    if i > 6:
        break
print("-----------------------------------------------------")
i = 0
for act in y_test:
    actual_arr.append(act)
    i += 1
    if i > 6:
        break

for x in range(6):
    print("Actual: " + str(actual_arr[x]) + " - Prediction: " + str(pred_arr[x]))


print("Score: " + str(reg.score(x_test, y_test)))

-----------------------------------------------------
Actual: 29.8 - Prediction: 24.97412111710109
Actual: 23.7 - Prediction: 25.74473003435085
Actual: 25.0 - Prediction: 24.375918544310878
Actual: 19.4 - Prediction: 25.459202836821436
Actual: 21.8 - Prediction: 20.55592486949337
Actual: 23.9 - Prediction: 27.283475259683712
Score: 0.6435893053572705


## Results
The model precits housing prices adequetly. The score is similar to the NOX score. I'm inclined to believe this is due to the low amount of features and cases given. Future studise will benefit form a larger data set with more features.