# Problem 3 Boston Housing Data
### Created By: Ivor Zalud
***
## The Problem
I want to predict the NOX and the median housing price using a linear regressor.


## The Data
Data is provided by CMU [here](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). The set describes housing data in the Boston area collected by the US Census Service. Note, there are only 506 cases, results will be weak.

## Methods
* Stochastic Gradient Descent as our linear regression




In [45]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline

data = pd.read_csv('Data/BostonHousing.csv')
data

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


# Predicting NOX
***


## 1. Clean the data
- Remove the medv column as it is one of our predictors

In [46]:
data_nox = data.copy()
data_nox = data_nox.drop(columns='medv')
data_nox

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48


## 2. Split the data into training and testing sets
We will do an 80/20 split

In [47]:
## Define our indepedent and dependant variables
data_column_names = [column for column in data_nox.columns if column not in ['nox']]
x = data_nox.loc[:, data_column_names]
y = data_nox.loc[:,'nox']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,crim,zn,indus,chas,rm,age,dis,rad,tax,ptratio,b,lstat
12,0.09378,12.5,7.87,0,5.889,39.0,5.4509,5,311,15.2,390.50,15.71
449,7.52601,0.0,18.10,0,6.417,98.3,2.1850,24,666,20.2,304.21,19.31
460,4.81213,0.0,18.10,0,6.701,90.0,2.5975,24,666,20.2,255.23,16.42
382,9.18702,0.0,18.10,0,5.536,100.0,1.5804,24,666,20.2,396.90,23.60
51,0.04337,21.0,5.64,0,6.115,63.0,6.8147,4,243,16.8,393.97,9.43
...,...,...,...,...,...,...,...,...,...,...,...,...
5,0.02985,0.0,2.18,0,6.430,58.7,6.0622,3,222,18.7,394.12,5.21
146,2.15505,0.0,19.58,0,5.628,100.0,1.5166,5,403,14.7,169.27,16.65
494,0.27957,0.0,9.69,0,5.926,42.6,2.3817,6,391,19.2,396.90,13.59
215,0.19802,0.0,10.59,0,6.182,42.4,3.9454,4,277,18.6,393.63,9.47


## 3. Create the SGD regressor, regularize, and fit the data

### We use scikit-learn's pipeline to scale the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of Logit Regression in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.

In [48]:
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1500, tol=1e-3))
reg.fit(x_train, y_train)


Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdregressor', SGDRegressor(max_iter=1500))])

## 4. Test the model and print out scores

In [49]:
reg_predictions = reg.predict(x_test)

## Prettier the data to compare actual vs prediction of the first 6 items.
i = 0
pred_arr = []
actual_arr = []
for pred in reg_predictions:
    pred_arr.append(pred)
    i += 1
    if i > 6:
        break
print("-----------------------------------------------------")
i = 0
for act in y_test:
    actual_arr.append(act)
    i += 1
    if i > 6:
        break

for x in range(6):
    print("Actual: " + str(actual_arr[x]) + " - Prediction: " + str(pred_arr[x]))


print("Score: " + str(reg.score(x_test, y_test)))

-----------------------------------------------------
Actual: 0.448 - Prediction: 0.4136806039099281
Actual: 0.445 - Prediction: 0.5039144265128032
Actual: 0.624 - Prediction: 0.6362222213324717
Actual: 0.493 - Prediction: 0.42416400782624203
Actual: 0.631 - Prediction: 0.6846387357738365
Actual: 0.493 - Prediction: 0.5020277180202058
Score: 0.7636390242119302


## Results
The model does sufficiently well at predicting the nox level for a given house. Future studies will benefit from a larger data set.

### Hyperparameters:
- Max_iter: I tested values between 500-10000 and found that ~1500 maximizes the score typically. However, this can vary depending on how the data splits.

#

# Predicting Median Home Value
***

## 1. Clean the data
- remove NOX as it is one of our other predictors

In [51]:
data_home = data.copy()
data_home = data_home.drop(columns='nox')
data_home

Unnamed: 0,crim,zn,indus,chas,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


## 2. Split the data into training and testing sets
We will do an 80/20 split

In [52]:
## Define our indepedent and dependant variables
data_column_names = [column for column in data_home.columns if column not in ['medv']]
x = data_home.loc[:, data_column_names]
y = data_home.loc[:,'medv']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,crim,zn,indus,chas,rm,age,dis,rad,tax,ptratio,b,lstat
231,0.46296,0.0,6.20,0,7.412,76.9,3.6715,8,307,17.4,376.14,5.25
384,20.08490,0.0,18.10,0,4.368,91.2,1.4395,24,666,20.2,285.83,30.63
205,0.13642,0.0,10.59,0,5.891,22.3,3.9454,4,277,18.6,396.90,10.87
394,13.35980,0.0,18.10,0,5.887,94.7,1.7821,24,666,20.2,396.90,16.35
363,4.22239,0.0,18.10,1,5.803,89.0,1.9047,24,666,20.2,353.04,14.64
...,...,...,...,...,...,...,...,...,...,...,...,...
230,0.53700,0.0,6.20,0,5.981,68.1,3.6715,8,307,17.4,378.35,11.65
483,2.81838,0.0,18.10,0,5.762,40.3,4.0983,24,666,20.2,392.92,10.42
240,0.11329,30.0,4.93,0,6.897,54.3,6.3361,6,300,16.6,391.25,11.38
321,0.18159,0.0,7.38,0,6.376,54.3,4.5404,5,287,19.6,396.90,6.87


## 3. Create the SGD regressor, regularize, and fit the data

### We use scikit-learn's pipeline to scale the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of Logit Regression in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.

In [53]:
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1500, tol=1e-3))
reg.fit(x_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('sgdregressor', SGDRegressor(max_iter=1500))])

## 4. Test the model and print out scores

In [54]:
reg_predictions = reg.predict(x_test)

## Prettier the data to compare actual vs prediction of the first 6 items.
i = 0
pred_arr = []
actual_arr = []
for pred in reg_predictions:
    pred_arr.append(pred)
    i += 1
    if i > 6:
        break
print("-----------------------------------------------------")
i = 0
for act in y_test:
    actual_arr.append(act)
    i += 1
    if i > 6:
        break

for x in range(6):
    print("Actual: " + str(actual_arr[x]) + " - Prediction: " + str(pred_arr[x]))


print("Score: " + str(reg.score(x_test, y_test)))

-----------------------------------------------------
Actual: 16.1 - Prediction: 19.37922613412853
Actual: 24.2 - Prediction: 24.115525089986885
Actual: 15.2 - Prediction: 10.01565649287111
Actual: 20.9 - Prediction: 20.714021624048492
Actual: 11.8 - Prediction: 11.706869009896142
Actual: 22.5 - Prediction: 17.24632393278913
Score: 0.7816641002498717
