# Problem 3 Boston Housing Data
### Created By: Ivor Zalud
***
## The Problem
I want to predict the NOX and the median housing price using a linear regressor for Boston homes.


## The Data
Data is provided by CMU [here](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). The set describes housing data in the Boston area collected by the US Census Service. Note, there are only 506 cases, results will be weak. The features set is: crim zn indus chas nox rm	age	dis	rad	tax	ptratio	b lstat and medv.

## Methods
* Stochastic Gradient Descent as our linear regression so we can take advantage of gradient descent




In [88]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

## Read the data in early as we will clean/split our data differently for each prediction
data = pd.read_csv('Data/BostonHousing.csv')
data

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


# Predicting NOX
***


## 1. Clean the data
- Remove the medv column as it is one of our predictors

In [89]:
data_nox = data.copy()
data_nox = data_nox.drop(columns='medv')
data_nox

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48


## 2. Split the data into training and testing sets
We will do an 80/20 split

In [90]:
## Define our indepedent and dependant variables
data_column_names = [column for column in data_nox.columns if column not in ['nox']]
x = data_nox.loc[:, data_column_names]
y = data_nox.loc[:,'nox']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,crim,zn,indus,chas,rm,age,dis,rad,tax,ptratio,b,lstat
113,0.22212,0.0,10.01,0,6.092,95.4,2.5480,6,432,17.8,396.90,17.09
347,0.01870,85.0,4.15,0,6.516,27.7,8.5353,4,351,17.9,392.43,6.36
432,6.44405,0.0,18.10,0,6.425,74.8,2.2004,24,666,20.2,97.95,12.03
497,0.26838,0.0,9.69,0,5.794,70.6,2.8927,6,391,19.2,396.90,14.10
346,0.06162,0.0,4.39,0,5.898,52.3,8.0136,3,352,18.8,364.61,12.67
...,...,...,...,...,...,...,...,...,...,...,...,...
381,15.87440,0.0,18.10,0,6.545,99.1,1.5192,24,666,20.2,396.90,21.08
68,0.13554,12.5,6.07,0,5.594,36.8,6.4980,4,345,18.9,396.90,13.09
228,0.29819,0.0,6.20,0,7.686,17.0,3.3751,8,307,17.4,377.51,3.92
44,0.12269,0.0,6.91,0,6.069,40.0,5.7209,3,233,17.9,389.39,9.55


## 3. Create the SGD regressor, regularize, and fit the data

### We use scikit-learn's pipeline to scale the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of Logit Regression in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.
### - **Hyperparameter:** max_iter: ~1500 typically resulted in the best model. Larger data sets we may want to decrease this. However, the best strategy would be to use a grid search to determine optimal params.

In [91]:
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1500, tol=1e-3))
reg.fit(x_train, y_train)


Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdregressor', SGDRegressor(max_iter=1500))])

## 4. Test the model and print out scores

In [92]:
reg_predictions = reg.predict(x_test)

## Prettier the data to compare actual vs prediction of the first 6 items.
i = 0
pred_arr = []
actual_arr = []
for pred in reg_predictions:
    pred_arr.append(pred)
    i += 1
    if i > 6:
        break
print("-----------------------------------------------------")
i = 0
for act in y_test:
    actual_arr.append(act)
    i += 1
    if i > 6:
        break

for x in range(6):
    print("Actual: " + str(actual_arr[x]) + " - Prediction: " + str(pred_arr[x]))


print("Score: " + str(reg.score(x_test, y_test)))

-----------------------------------------------------
Actual: 0.538 - Prediction: 0.5371947025093341
Actual: 0.426 - Prediction: 0.44566073998216416
Actual: 0.77 - Prediction: 0.6479549305427998
Actual: 0.524 - Prediction: 0.5508669573371088
Actual: 0.624 - Prediction: 0.6277191421485055
Actual: 0.524 - Prediction: 0.5057867881568263
Score: 0.7421824104018022


## Results
The model does sufficiently well at predicting the nox level for a given house. Future studies will benefit from a larger data set and more features. Also back filling the NaN data would likely improve the model.

#

# Predicting Median Home Value
***

## 1. Clean the data
- remove NOX as it is one of our other predictors

In [93]:
data_home = data.copy()
data_home = data_home.drop(columns='nox')
data_home

Unnamed: 0,crim,zn,indus,chas,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


## 2. Split the data into training and testing sets
We will do an 80/20 split

In [94]:
## Define our indepedent and dependant variables
data_column_names = [column for column in data_home.columns if column not in ['medv']]
x = data_home.loc[:, data_column_names]
y = data_home.loc[:,'medv']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,crim,zn,indus,chas,rm,age,dis,rad,tax,ptratio,b,lstat
324,0.34109,0.0,7.38,0,6.415,40.1,4.7211,5,287,19.6,396.90,6.12
387,22.59710,0.0,18.10,0,5.000,89.5,1.5184,24,666,20.2,396.90,31.99
159,1.42502,0.0,19.58,0,6.510,100.0,1.7659,5,403,14.7,364.31,7.39
5,0.02985,0.0,2.18,0,6.430,58.7,6.0622,3,222,18.7,394.12,5.21
173,0.09178,0.0,4.05,0,6.416,84.1,2.6463,5,296,16.6,395.50,9.04
...,...,...,...,...,...,...,...,...,...,...,...,...
90,0.04684,0.0,3.41,0,6.417,66.1,3.0923,2,270,17.8,392.18,8.81
394,13.35980,0.0,18.10,0,5.887,94.7,1.7821,24,666,20.2,396.90,16.35
220,0.35809,0.0,6.20,1,6.951,88.5,2.8617,8,307,17.4,391.70,9.71
376,15.28800,0.0,18.10,0,6.649,93.3,1.3449,24,666,20.2,363.02,23.24


## 3. Create the SGD regressor, regularize, and fit the data

### We use scikit-learn's pipeline to normalize the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of Logit Regression in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.
### - **Hyperparameter:** max_iter: ~1500 typically resulted in the best model. Larger data sets we may want to decrease this. However, the best strategy would be to use a grid search to determine optimal params.

In [95]:
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1500, tol=1e-3))
reg.fit(x_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdregressor', SGDRegressor(max_iter=1500))])

## 4. Test the model and print out scores

In [96]:
reg_predictions = reg.predict(x_test)

## Prettier the data to compare actual vs prediction of the first 6 items.
i = 0
pred_arr = []
actual_arr = []
for pred in reg_predictions:
    pred_arr.append(pred)
    i += 1
    if i > 6:
        break
print("-----------------------------------------------------")
i = 0
for act in y_test:
    actual_arr.append(act)
    i += 1
    if i > 6:
        break

for x in range(6):
    print("Actual: " + str(actual_arr[x]) + " - Prediction: " + str(pred_arr[x]))


print("Score: " + str(reg.score(x_test, y_test)))

-----------------------------------------------------
Actual: 29.9 - Prediction: 30.93755973902669
Actual: 20.3 - Prediction: 23.537444036080412
Actual: 8.8 - Prediction: 6.732627876275632
Actual: 23.2 - Prediction: 25.75017399885518
Actual: 20.6 - Prediction: 19.026685719805233
Actual: 13.2 - Prediction: 10.214458014468455
Score: 0.719679625657123


## Results
The model precits housing prices adequetly. The score is similar to the NOX score. I'm inclined to believe this is due to the low amount of features and cases given. Future studise will benefit form a larger data set with more features.