### Your name:

<pre> Ehsanul Haque</pre>

### Collaborators:

<pre> None</pre>


In [1]:
# Author: Ehsanul Haque, UofT ID: qq286343
# Course: 3253 Machine Learning
# Date: Feb 16, 2019

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Open the housing data


In [2]:
# Both 'housing.tgz' and 'housing.csv' are saved in local directory, so no need to download from github
# All DEBUG codes are commented out

import os
import tarfile

HOUSING_PATH = "."

def fetch_housing_data(housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data()

import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

# Display first 5 rows in housing data
housing.head()





Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


#### Considerations for building pipeline:

- Make your notebook as compact as possible. 
- Split data into training and testing sets below.
- Convert all categorical data to one-hot vectors below
- Normalize all non-categorical data 
-  Perform KNN regression using a variety of values for n_neighbors (K) between 1 and 10 and both "uniform" and "distance" weights via a grid search where  *housing_labels* is the output and all other features are the input (similar to as seen in lecture two.)

In [3]:
np.random.seed(42)

In [4]:
# Create a new feature 'income_cat' = median_income/1.5

# Divide by 1.5 to limit the number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

In [5]:
# Data split for training and test sets

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [6]:
# Drop 'income_cat' feature from both test abd training sets

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

In [7]:
# Drop labels for training set

housing = strat_train_set.drop("median_house_value", axis=1) 
housing_labels = strat_train_set["median_house_value"].copy()

In [8]:
# Imputer to replace missing values with median values

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

In [9]:
# Drop 'ocean_proximity' as it is not numeric data

housing_num = housing.drop('ocean_proximity', axis=1)

In [None]:
# **** DEBUG CODE ****

#imputer.fit(housing_num)

In [None]:
# **** DEBUG CODE ****

#X = imputer.transform(housing_num)

In [None]:
# **** DEBUG CODE ****

#housing_tr = pd.DataFrame(X, columns=housing_num.columns)
#housing_tr.head()

In [10]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()

# **** DEBUG CODE: This part is to display one hot encoding values ****

"""
housing_cat = housing[['ocean_proximity']]
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot.toarray()
"""

"\nhousing_cat = housing[['ocean_proximity']]\nhousing_cat_1hot = cat_encoder.fit_transform(housing_cat)\nhousing_cat_1hot.toarray()\n"

In [11]:
from sklearn.preprocessing import FunctionTransformer

# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

def add_extra_features(X, add_bedrooms_per_room=True):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,
                     bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household]

In [12]:
# Numeric Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()),
        # Add normalization
        #('min_max_scaler', MinMaxScaler())
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

In [13]:
# This to display the shape of transformed Numeric Pipeline

housing_num_tr.shape


(16512, 11)

In [14]:
# Full Pipeline consists of both numeric column and catagory column

from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

In [15]:
# This is to display shape of full Pipeline

housing_prepared.shape

(16512, 16)

In [16]:
import warnings
warnings.filterwarnings('ignore')

In [17]:
# KNN Regression

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_neighbors': [n for n in range(1,11)],
      'weights': ['uniform','distance']}
  ]

knn_reg = KNeighborsRegressor()

grid_search = GridSearchCV(knn_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)

grid_search.fit(housing_prepared, housing_labels)



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
          weights='uniform'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'weights': ['uniform', 'distance']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)

In [18]:
# Display all parameters and estimators

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)


76905.39172624913 {'n_neighbors': 1, 'weights': 'uniform'}
76905.39172624913 {'n_neighbors': 1, 'weights': 'distance'}
67378.54417965506 {'n_neighbors': 2, 'weights': 'uniform'}
67305.91676929075 {'n_neighbors': 2, 'weights': 'distance'}
64192.31926224855 {'n_neighbors': 3, 'weights': 'uniform'}
63982.89656472898 {'n_neighbors': 3, 'weights': 'distance'}
63127.87298667067 {'n_neighbors': 4, 'weights': 'uniform'}
62744.06339213252 {'n_neighbors': 4, 'weights': 'distance'}
62530.047471028614 {'n_neighbors': 5, 'weights': 'uniform'}
62084.49118070277 {'n_neighbors': 5, 'weights': 'distance'}
61972.76358941466 {'n_neighbors': 6, 'weights': 'uniform'}
61503.884150916645 {'n_neighbors': 6, 'weights': 'distance'}
61677.2190632811 {'n_neighbors': 7, 'weights': 'uniform'}
61191.195783780175 {'n_neighbors': 7, 'weights': 'distance'}
61475.85273417105 {'n_neighbors': 8, 'weights': 'uniform'}
60977.81068888092 {'n_neighbors': 8, 'weights': 'distance'}
61529.934208171966 {'n_neighbors': 9, 'weights

In [19]:
# Select best estimator

knn_reg_final = grid_search.best_estimator_
knn_reg_final

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=10, p=2,
          weights='distance')

In [20]:
from sklearn.metrics import mean_squared_error

In [None]:
# **** DEBUG CODE: This is to display rmse value with training data and labels ****

"""
knn_final_train_housing_labels = knn_reg_final.predict(housing_prepared)
knn_final_train_mse = mean_squared_error(housing_labels, knn_final_train_housing_labels)
knn_final_train_rmse = np.sqrt(knn_final_train_mse)
knn_final_train_rmse
""""


In [21]:
# Prepare test sets and labels

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)

In [22]:
# Calculate rmse value with test sets and labels

knn_final_test_housing_labels = knn_reg_final.predict(X_test_prepared)
knn_final_test_mse = mean_squared_error(y_test, knn_final_test_housing_labels)
knn_final_test_rmse = np.sqrt(knn_final_test_mse)
knn_final_test_rmse

58302.326349180155

In [23]:
# Linear regressor

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [None]:
# **** DEBUG CODE: Calculate rmse value with training sets and labels ****

"""
lin_train_housing_labels = lin_reg.predict(housing_prepared)
lin_train_mse = mean_squared_error(housing_labels, lin_train_housing_labels)
lin_train_rmse = np.sqrt(lin_train_mse)
lin_train_rmse
"""

In [24]:
# Calculate rmse with test sets and labels

lin_test_housing_labels = lin_reg.predict(X_test_prepared)
lin_test_mse = mean_squared_error(y_test, lin_test_housing_labels)
lin_test_rmse = np.sqrt(lin_test_mse)
lin_test_rmse

66911.98070857547

### Conclusions
For what values of n_neighbors and weight does KNeighborsRegressor perform the best? Does it perform as well on the housing data as the linear regressor from the lectures? Why do you think this is?

<pre>1. With KNN and StandardScaler() normalization, the best estimators are the following parameters:
         
         np.sqrt(-mean_score) = 60871.45102608848 [with training data] 
         params = {'n_neighbors': 10, 'weights': 'distance'}
         
         The final rmse value with test data and labels with this model is 58302.326349180155

2. With KNN and MinMaxScaler() normalization, the best estimators are the following parameters:
         
         np.sqrt(-mean_score) = 61515.453256223416 [with training data] 
         params = {'n_neighbors': 8, 'weights': 'distance'}
         
         The final rmse value with test data and labels with this model is 60327.04153954014

3. Since the first KNN model with StandardScaler() provides better results [in both train data and test data], 
   I have selected that as my final model. 

4. With Linear Regressor, the final rmse value with test data and labels is 66911.98070857547

5. Final rmse value with KNN is less than rmse value with Linear regressor. So, for our data set, KNN is a 
   better model.</pre>
         

### Read appending B

- Reflect on your last data project, read appendix B. Then, write down a few of the checklist items that your last data project could have used. If you have not yet done a data project, then write down a few of the items that you found most interesting.

I did not do any data project before. Assignment 2 of this course is my first data project. After reading Appendix B, I found that few steps might have benefited my project:

1. Exploring data to gain insights: I could have spend more time and effort on exploring the data. By this process, it would have been helpful to eliminate few features in the data set which are not important.

2. Explore many different models and short list best one: I have tried two models [KNN and Linear] to come up with a solution. If I have tried few more models, I might have come with a better solution.

3. Combining models: I have used KNN as my final model. Combining multiple models might have given better results.

### Submit your notebook

Submit your solution to Quercus
Make sure you rename your notebook to    
W2_UTORid.ipynb    
Example W2_adfasd01.ipynb
