## Assignment for Module 6

In this assignment you will continue working with the housing price per district from the previous module assignment, this time training SVM models, both for regression and classification.

#### Getting the data for the assignment (similar to the notebook from chapter 2 of Hands-On...)

In [1]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [2]:
fetch_housing_data()

In [3]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [4]:
housing = load_housing_data()

### Fix the categories in the categorical variable

In [5]:
d = {'<1H OCEAN':'LESS_1H_OCEAN', 'INLAND':'INLAND', 'ISLAND':'ISLAND', 'NEAR BAY':'NEAR_BAY', 'NEAR OCEAN':'NEAR_OCEAN'}
housing['ocean_proximity'] = housing['ocean_proximity'].map(lambda s: d[s])

### Add 2 more features

In [6]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["population_per_household"]=housing["population"]/housing["households"]

### Fix missing data

In [7]:
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True) 

### Create dummy variables based on the categorical variable

In [8]:
one_hot = pd.get_dummies(housing['ocean_proximity'])
housing = housing.drop('ocean_proximity', axis=1)
housing = housing.join(one_hot)

### Check the data

In [9]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   longitude                 20640 non-null  float64
 1   latitude                  20640 non-null  float64
 2   housing_median_age        20640 non-null  float64
 3   total_rooms               20640 non-null  float64
 4   total_bedrooms            20640 non-null  float64
 5   population                20640 non-null  float64
 6   households                20640 non-null  float64
 7   median_income             20640 non-null  float64
 8   median_house_value        20640 non-null  float64
 9   rooms_per_household       20640 non-null  float64
 10  population_per_household  20640 non-null  float64
 11  INLAND                    20640 non-null  uint8  
 12  ISLAND                    20640 non-null  uint8  
 13  LESS_1H_OCEAN             20640 non-null  uint8  
 14  NEAR_B

### Partition into train and test

Use train_test_split from sklearn.model_selection to partition the dataset into 70% for training and 30% for testing.

You can use the 70% for training set as both training and validation by using cross-validation.


In [10]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.3, random_state=42)

### Features

In [11]:
target = 'median_house_value'
features = list(train_set.columns)
features = [f for f in features if f!=target]

In [12]:
X_tr = train_set[features]
y_tr = train_set[[target]]

X_te = test_set[features]
y_te = test_set[[target]]

### Scaling features

Similarly, use StandardScaler from sklearn.preprocessing to normalize the training and testing data, using the training data

In [13]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_tr)
X_tr = scaler.transform(X_tr)
X_te = scaler.transform(X_te)

#### Comparing models

In [14]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import numpy as np

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())

### Linear regression on original features (no transformations) --- benchmark

In [15]:
from sklearn.linear_model import LinearRegression
lin_scores = cross_val_score(LinearRegression(), train_set[features], train_set[target], scoring="neg_mean_squared_error", cv=4)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [70142.55721218 67456.39127204 67318.3258893  70866.26065275]
Mean: 68945.88375656861


### 1. Support Vector Machines for Regression

#### (a) In this exercise your goal is to tune SVR with FBR kernel, and make the average score mean_squared_error over 3-folds (cv=3) below 58000. 

You are encouraged to try optimizing any of the hyper-parameters of SVR

See http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html for more details

However, as a hint, you can focus on C and gamma. 

Hint 2: if when you try different values for a hyper-parameter, the optimal models corresponds to one of the extreme values in your range, that probably means you can keep improving your solution by considering values beyond the current range.



In [44]:
?SVR

[0;31mInit signature:[0m [0mSVR[0m[0;34m([0m[0mkernel[0m[0;34m=[0m[0;34m'rbf'[0m[0;34m,[0m [0mdegree[0m[0;34m=[0m[0;36m3[0m[0;34m,[0m [0mgamma[0m[0;34m=[0m[0;34m'auto'[0m[0;34m,[0m [0mcoef0[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m [0mtol[0m[0;34m=[0m[0;36m0.001[0m[0;34m,[0m [0mC[0m[0;34m=[0m[0;36m1.0[0m[0;34m,[0m [0mepsilon[0m[0;34m=[0m[0;36m0.1[0m[0;34m,[0m [0mshrinking[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mcache_size[0m[0;34m=[0m[0;36m200[0m[0;34m,[0m [0mverbose[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mmax_iter[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Epsilon-Support Vector Regression.

The free parameters in the model are C and epsilon.

The implementation is based on libsvm.

Read more in the :ref:`User Guide <svm_regression>`.

Parameters
----------
C : float, optional (default=1.0)
    Penalty parameter C of the error term.

epsilon : fl

In [73]:
from sklearn.svm import SVR

C_vals = [30000, 40000, 50000, 60000] ## YOUR VALUES FOR C ##
gamma_vals = [0.05, 0.1, 0.5, 1] ## YOUR VALUES FOR gamma ## 

svr_model = SVR(kernel='rbf',
                     max_iter=-1, shrinking=True,
                     verbose=False)

param_grid = [{'C':C_vals, 
               'gamma':gamma_vals,
               'tol': [50, 100, 125, 150, 200, 250]}]

In [41]:
grid_search_rbf = GridSearchCV(svr_model, param_grid, cv=3,scoring='neg_mean_squared_error', n_jobs = -1, verbose=1)
grid_search_rbf.fit(X_tr, np.ravel(y_tr))

Fitting 3 folds for each of 96 candidates, totalling 288 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done 288 out of 288 | elapsed:  8.6min finished


GridSearchCV(cv=3, error_score=nan,
             estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3,
                           epsilon=0.1, gamma='scale', kernel='rbf',
                           max_iter=-1, shrinking=True, tol=0.001,
                           verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid=[{'C': [30000, 40000, 50000, 60000],
                          'gamma': [0.05, 0.1, 0.5, 1],
                          'tol': [50, 100, 125, 150, 200, 250]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='neg_mean_squared_error', verbose=1)

In [42]:
print(grid_search_rbf.best_params_)
print(np.sqrt(-grid_search_rbf.best_score_))
# {'C': 25000, 'gamma': 0.1, 'tol': 200}
# 59149.948792414776

{'C': 60000, 'gamma': 0.1, 'tol': 200}
58018.248457297384


In [36]:
from sklearn.model_selection import cross_val_score

svr_model = SVR(kernel='rbf',C= 25000, gamma= 0.1, tol=200, 
#                 cache_size=200, max_iter=-1, shrinking=True, 
                     verbose=1 )
cval = cross_val_score(svr_model, X_tr, np.ravel(y_tr), scoring="neg_mean_squared_error", cv=3)
cval_score = np.sqrt(-cval)
display_scores(cval_score)

[LibSVM][LibSVM][LibSVM]Scores: [59825.79281889 58743.7226973  60776.48975222]
Mean: 59782.00175613793


### Performance on Test Set

In [57]:
from sklearn.metrics import mean_squared_error

final_model = grid_search_rbf.best_estimator_   ## THIS SHOULD BE THE BEST GRID_SEARCH ##

y_te_estimation = final_model.predict(X_te)

final_mse = mean_squared_error(y_te, y_te_estimation)
final_rmse = np.sqrt(final_mse)
print(final_rmse)

57267.121101569224


### 2. SVM for Classification

Now we transform the continuous target into a binary variable, indicating whether or not the price is above the average $179700


In [26]:
from sklearn.metrics import accuracy_score

In [27]:
np.median(housing[['median_house_value']])

179700.0

#### Binary target variable

In [28]:
y_tr_b = 1*np.ravel(y_tr>=179700.0)
y_te_b = 1*np.ravel(y_te>=179700.0)

#### Linear SVM for classification

In [29]:
from sklearn.svm import LinearSVC

In [30]:
lin_clf = LinearSVC(random_state=42)
lin_clf.fit(X_tr, y_tr_b)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
          verbose=0)

In [37]:
y_pred = lin_clf.predict(X_tr)
accuracy_score(y_tr_b, y_pred)

0.8385935769656699

### (a) Does SVC (with default hyper-parameters) improve the performance of the linear SVM?

In [43]:
from sklearn.svm import SVC

In [44]:
clf = SVC(random_state=42)
clf.fit(X_tr, y_tr_b)
y_pred = clf.predict(X_tr)
accuracy_score(y_tr_b, y_pred)

0.866140642303433

No, it doesn't improve performance (at least based on accuracy).

### (b) Use randomized search to tune hyper-parameters of SVC and improve its performance

In [45]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform

In [46]:
?RandomizedSearchCV

[0;31mInit signature:[0m
[0mRandomizedSearchCV[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mestimator[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mparam_distributions[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_iter[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mscoring[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_jobs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0miid[0m[0;34m=[0m[0;34m'deprecated'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrefit[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcv[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mverbose[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpre_dispatch[0m[0;34m=[0m[0;34m'2*n_jobs'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0merror_score[0m[0;34m=[0m[0mnan[0m[0;34m,

In [83]:
# using previous params
clf = SVC(random_state=42)
param_grid = [{'C':[0, 0.1, 0.5, 1, 10, 100, 150, 200, 1000], 
               'gamma':gamma_vals,
               'tol': [0.001, 0.1, 1, 10, 50, 100, 125, 150, 200, 250]
#                'decision_function_shape': ['ovo', 'ovr'],
#               'kernel': ['poly', 'rbf', 'linear'],
#               'degree' : [1, 2, 3, 4]
              }]
rand_svc = RandomizedSearchCV(clf, param_grid, cv=5,scoring='neg_mean_squared_error', n_jobs = -1, verbose=1)

rand_svc.fit(X_tr, y_tr_b)
rand_svc.best_params_
# {'tol': 100, 'gamma': 0.5, 'decision_function_shape': 'ovo', 'C': 60000}

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    8.0s finished


{'tol': 1, 'gamma': 1, 'C': 0.1}

In [84]:
# best_clf = SVC(kernel='linear', degree=2, gamma=0.5, C=50000, decision_function_shape='pvr', tol=250, random_state=42)
best_clf = SVC(kernel='rbf', gamma=1, C=0.1, tol=1, random_state=42)
# best_clf = SVC(kernel='rbf',random_state=42)
best_clf.fit(X_tr, y_tr_b)
y_pred = best_clf.predict(X_tr)
accuracy_score(y_tr_b, y_pred)

0.8666943521594684

In [85]:
best_clf.get_params

<bound method BaseEstimator.get_params of SVC(C=0.1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
    probability=False, random_state=42, shrinking=True, tol=1, verbose=False)>

## (c) Train a Logistic Regression (search the best hyper-parameters) and compare its performance with SVC 

In [89]:
from sklearn.linear_model import LogisticRegression

lreg = LogisticRegression(random_state=42)
param_grid = [{'C':[0, 0.1, 0.5, 1, 10, 100, 150, 200, 1000], 
#                'gamma':gamma_vals,
               'tol': [0.001, 0.1, 1, 10, 50, 100, 125, 150, 200, 250]}]

rand_svc = RandomizedSearchCV(lreg, param_grid, cv=5,scoring='neg_mean_squared_error', n_jobs = -1, verbose=1)
rand_svc.fit(X_tr, y_tr_b)
rand_svc.best_params_
# best_clf = SVC(kernel='rbf',random_state=42)
# best_clf.fit(X_tr, y_tr_b)
# y_pred = best_clf.predict(X_tr)
# accuracy_score(y_tr_b, y_pred)
# ?LogisticRegression


Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    2.0s finished


{'tol': 10, 'C': 10}

In [90]:
# best_clf = SVC(kernel='linear', degree=2, gamma=0.5, C=50000, decision_function_shape='pvr', tol=250, random_state=42)
best_lreg = LogisticRegression(C=10, tol=10, random_state=42)
# best_clf = SVC(kernel='rbf',random_state=42)
best_lreg.fit(X_tr, y_tr_b)
y_pred = best_lreg.predict(X_tr)
accuracy_score(y_tr_b, y_pred)

0.8383859357696567

0.8666943521594684 vs 0.8383859357696567 (SVC and LogisticRegression).