## Assignment for Module 6

In this assignment you will continue working with the housing price per district from the previous module assignment, this time training SVM models, both for regression and classification.

#### Getting the data for the assignment (similar to the notebook from chapter 2 of Hands-On...)

In [1]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [2]:
fetch_housing_data()

In [3]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [4]:
housing = load_housing_data()

In [5]:
housing.shape

(20640, 10)

### Fix the categories in the categorical variable

In [6]:
d = {'<1H OCEAN':'LESS_1H_OCEAN', 'INLAND':'INLAND', 'ISLAND':'ISLAND', 'NEAR BAY':'NEAR_BAY', 'NEAR OCEAN':'NEAR_OCEAN'}
housing['ocean_proximity'] = housing['ocean_proximity'].map(lambda s: d[s])

### Add 2 more features

In [7]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["population_per_household"]=housing["population"]/housing["households"]

### Fix missing data

In [8]:
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True) 

### Create dummy variables based on the categorical variable

In [9]:
one_hot = pd.get_dummies(housing['ocean_proximity'])
housing = housing.drop('ocean_proximity', axis=1)
housing = housing.join(one_hot)

### Check the data

In [10]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 16 columns):
longitude                   20640 non-null float64
latitude                    20640 non-null float64
housing_median_age          20640 non-null float64
total_rooms                 20640 non-null float64
total_bedrooms              20640 non-null float64
population                  20640 non-null float64
households                  20640 non-null float64
median_income               20640 non-null float64
median_house_value          20640 non-null float64
rooms_per_household         20640 non-null float64
population_per_household    20640 non-null float64
INLAND                      20640 non-null uint8
ISLAND                      20640 non-null uint8
LESS_1H_OCEAN               20640 non-null uint8
NEAR_BAY                    20640 non-null uint8
NEAR_OCEAN                  20640 non-null uint8
dtypes: float64(11), uint8(5)
memory usage: 1.8 MB


### Partition into train and test

Use train_test_split from sklearn.model_selection to partition the dataset into 70% for training and 30% for testing.

You can use the 70% for training set as both training and validation by using cross-validation.


In [11]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.3, random_state=42)

### Features

In [12]:
target = 'median_house_value'
features = list(train_set.columns)
features = [f for f in features if f!=target]

In [13]:
X_tr = train_set[features]
y_tr = train_set[[target]]

X_te = test_set[features]
y_te = test_set[[target]]

### Scaling features

Similarly, use StandardScaler from sklearn.preprocessing to normalize the training and testing data, using the training data

In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_tr)
X_tr = scaler.transform(X_tr)
X_te = scaler.transform(X_te)

  return self.partial_fit(X, y)
  """
  


#### Comparing models

In [15]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import numpy as np

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())

### Linear regression on original features (no transformations) --- benchmark

In [16]:
from sklearn.linear_model import LinearRegression
lin_scores = cross_val_score(LinearRegression(), train_set[features], train_set[target], scoring="neg_mean_squared_error", cv=4)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [70142.55721218 67456.39127204 67318.3258893  70866.26065275]
Mean: 68945.88375656874


### 1. Support Vector Machines for Regression

#### (a) In this exercise your goal is to tune SVR with FBR kernel, and make the average score mean_squared_error over 3-folds (cv=3) below 58000. 

You are encouraged to try optimizing any of the hyper-parameters of SVR

See http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html for more details

However, as a hint, you can focus on C and gamma. 

Hint 2: if when you try different values for a hyper-parameter, the optimal models corresponds to one of the extreme values in your range, that probably means you can keep improving your solution by considering values beyond the current range.



In [17]:
from sklearn.svm import SVR

C_vals = [100,1000,10000,100000] ## YOUR VALUES FOR C ##
gamma_vals = [.01,.1,.5,1,5] ## YOUR VALUES FOR gamma ## 

param_grid = [{'C':C_vals, 'gamma':gamma_vals}]
grid_search_rbf = GridSearchCV(SVR(kernel='rbf'), param_grid, cv=3,scoring='neg_mean_squared_error')
grid_search_rbf.fit(X_tr, np.ravel(y_tr))

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'C': [100, 1000, 10000, 100000], 'gamma': [0.01, 0.1, 0.5, 1, 5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [18]:
print(grid_search_rbf.best_params_)
print(np.sqrt(-grid_search_rbf.best_score_))

{'C': 100000, 'gamma': 0.1}
57255.36016378938


After running Grid search, the best hyper parameters obtained are C = 100000 and gamma = .1

With this parameters, RMSE is 57255.36

### Performance on Test Set

In [19]:
from sklearn.metrics import mean_squared_error

final_model = grid_search_rbf.best_estimator_   ## THIS SHOULD BE THE BEST GRID_SEARCH ##

y_te_estimation = final_model.predict(X_te)

final_mse = mean_squared_error(y_te, y_te_estimation)
final_rmse = np.sqrt(final_mse)
print(final_rmse)

56466.15041953526


This model has RMSE value 56466.15 with test set.

# 2. SVM for Classification

Now we transform the continuous target into a binary variable, indicating whether or not the price is above the average $179700


In [20]:
from sklearn.metrics import accuracy_score

In [21]:
np.median(housing[['median_house_value']])

179700.0

#### Binary target variable

In [22]:
y_tr_b = 1*np.ravel(y_tr>=179700.0)
y_te_b = 1*np.ravel(y_te>=179700.0)

#### Linear SVM for classification

In [23]:
from sklearn.svm import LinearSVC

In [24]:
lin_clf = LinearSVC(random_state=42)
lin_clf.fit(X_tr, y_tr_b)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
     verbose=0)

In [25]:
y_pred = lin_clf.predict(X_tr)
accuracy_score(y_tr_b, y_pred)

0.8384551495016611

Accuracy with training set is .8384

In [26]:
y_pred_te = lin_clf.predict(X_te)
accuracy_score(y_te_b, y_pred_te)

0.8394702842377261

Accuracy with test set is .8394

### (a) Does SVC (with default hyper-parameters) improve the performance of the linear SVM?

In [27]:
from sklearn.svm import SVC

In [28]:
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

clf = SVC(kernel='rbf',gamma='auto')
clf.fit(X_tr, y_tr_b)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [29]:
y_pred = clf.predict(X_tr)
accuracy_score(y_tr_b, y_pred)

0.866140642303433

With training set, SVC with default hyper parameters has accuracy .8661

In [30]:
y_test_pred = clf.predict(X_te)
accuracy_score(y_te_b, y_test_pred)

0.8624031007751938

SVC with default hyper parameters is an improvement over the Linear SVM.

SVC with default hyper parameters = accuracy .8624 on test set
Linear SVM = accuracy .8394 on test set

### (b) Use randomized search to tune hyper-parameters of SVC and improve its performance

In [31]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform

In [36]:
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##

param_distributions = {"gamma": reciprocal(0.001, 0.1), "C": uniform(100, 1000)}
rnd_search_cv = RandomizedSearchCV(clf, param_distributions, n_iter=10, verbose=2)
rnd_search_cv.fit(X_tr, y_tr_b)




Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] C=239.61300573046918, gamma=0.0056810074794576195 ...............


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  C=239.61300573046918, gamma=0.0056810074794576195, total=   4.7s
[CV] C=239.61300573046918, gamma=0.0056810074794576195 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.9s remaining:    0.0s


[CV]  C=239.61300573046918, gamma=0.0056810074794576195, total=   4.5s
[CV] C=239.61300573046918, gamma=0.0056810074794576195 ...............
[CV]  C=239.61300573046918, gamma=0.0056810074794576195, total=   4.0s
[CV] C=1086.479308854202, gamma=0.002450473147799242 .................
[CV] .. C=1086.479308854202, gamma=0.002450473147799242, total=   5.3s
[CV] C=1086.479308854202, gamma=0.002450473147799242 .................
[CV] .. C=1086.479308854202, gamma=0.002450473147799242, total=   5.6s
[CV] C=1086.479308854202, gamma=0.002450473147799242 .................
[CV] .. C=1086.479308854202, gamma=0.002450473147799242, total=   5.1s
[CV] C=1018.8304697721801, gamma=0.028014887342307688 ................
[CV] . C=1018.8304697721801, gamma=0.028014887342307688, total=  11.7s
[CV] C=1018.8304697721801, gamma=0.028014887342307688 ................
[CV] . C=1018.8304697721801, gamma=0.028014887342307688, total=  11.3s
[CV] C=1018.8304697721801, gamma=0.028014887342307688 ................
[CV] .

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  3.9min finished


RandomizedSearchCV(cv='warn', error_score='raise-deprecating',
          estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          fit_params=None, iid='warn', n_iter=10, n_jobs=None,
          param_distributions={'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000002CEFA285F8>, 'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000002CEFA28B00>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=2)

In [37]:
rnd_search_cv.best_estimator_

SVC(C=1018.8304697721801, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.028014887342307688,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [38]:
y_pred_rnd_Classifier = rnd_search_cv.predict(X_tr)
accuracy_score(y_tr_b, y_pred_rnd_Classifier)

0.8971483942414175

Accuracy is .8971 on training set

In [39]:
Final = rnd_search_cv.best_estimator_   ## THIS IS THE BEST GRID_SEARCH ##

FinalModel = Final.predict(X_te)
accuracy_score(y_te_b, FinalModel)

0.8770994832041343

Accuracy is .8771 with test set which is an improvement over SVC with default parameters (accuracy .8624)

### (c) Train a Logistic Regression (search teh best hyper-parameters) and compare its performance with SVC 

In [40]:
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
## YOUR CODE HERE ##
from sklearn.linear_model import LogisticRegression

C_value = [.001,.01,1,10,100,1000,100000]
results = []

for i in C_value:
  log_reg = LogisticRegression(C=i,random_state=42)
  log_reg.fit(X_tr, y_tr_b)
  log_reg_predict = log_reg.predict(X_te)
  result = accuracy_score(y_te_b, log_reg_predict)
  results.append(result)

for j,i in zip(results,C_value):
  print(j, "with C =",i)

max(results)



0.8063630490956072 with C = 0.001
0.8242894056847545 with C = 0.01
0.8402777777777778 with C = 1
0.8407622739018088 with C = 10
0.8407622739018088 with C = 100
0.8407622739018088 with C = 1000
0.8407622739018088 with C = 100000


0.8407622739018088

With Logictic regression, accuracy is .8407 on test set which is not as good as SVC. 

Logistic regression = accuracy .8407
SVC with default hyper parameters = accuracy .8624
SVC with tuned up hyper parameters = accuracy .8771

Among all these models, SVC with tuned up parameters provides most accuracy on both training and test sets.