# Exercises

Using this chapter's housing dataset:

1. Try a support vector machine regressor (`sklearn.svm.SVR`), with various hyperparameters such as `kernel = "linear"` (with various values for the `C` hyperparameter) or `kernel = "rbf"` (with various values for the `C` & `gamma` hyperparameters). Don't worry about that these hyperparameters mean for now. How does the best `SVR` predictor perform?
2. Try replacing `GridSearchCV` with `RandomisedSearchCV`.
3. Try adding a transformer in the preparation pipeline to select only the most important attributes.
4. Try creating a single pipeline that does the full data preparation plus the final prediction.
5. Automatically explore some preparation options using `GridSearchCV`.

---

# 1.

In [48]:
import pandas as pd

housing = pd.read_csv("housing.csv")
housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [49]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [50]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size = 0.2, random_state = 31)
X_train = train_set.drop("median_house_value", axis = 1)
y_train = train_set["median_house_value"]
X_test = test_set.drop("median_house_value", axis = 1)
y_test = test_set["median_house_value"]

With our data split, make sure splits are similar so it doesn't affect model evaluation.

In [51]:
train_set.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,16512.0,16512.0,16512.0,16512.0,16347.0,16512.0,16512.0,16512.0,16512.0
mean,-119.567325,35.628352,28.656734,2636.86422,537.933933,1425.368702,499.57225,3.876215,207153.583394
std,2.000261,2.135195,12.627148,2194.580225,422.98114,1141.720172,383.384752,1.899114,115603.351513
min,-124.35,32.54,1.0,2.0,1.0,6.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1446.0,295.0,788.0,279.0,2.5663,119550.0
50%,-118.5,34.26,29.0,2127.0,433.0,1166.0,408.0,3.5394,179800.0
75%,-118.0,37.72,37.0,3151.25,648.0,1730.0,606.0,4.7601,265300.0
max,-114.47,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [52]:
test_set.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,4128.0,4128.0,4128.0,4128.0,4086.0,4128.0,4128.0,4128.0,4128.0
mean,-119.579222,35.645899,28.570494,2631.358527,537.616985,1425.908915,499.409399,3.848496,205664.750969
std,2.016778,2.13918,12.419076,2129.221411,414.989296,1094.782032,378.126352,1.90272,114567.095074
min,-124.26,32.56,1.0,6.0,2.0,3.0,2.0,0.4999,14999.0
25%,-121.8,33.9375,18.0,1451.0,297.0,786.0,281.0,2.5551,119900.0
50%,-118.48,34.26,29.0,2125.0,443.0,1164.5,415.0,3.51905,179500.0
75%,-118.02,37.71,37.0,3133.75,644.0,1699.0,600.0,4.6935,263300.0
max,-114.31,41.81,52.0,27870.0,5419.0,12153.0,4930.0,15.0001,500001.0


Looks good enough, treat the numeric features & categorical features before training a model.

In [53]:
from sklearn.pipeline import Pipeline

# Numeric
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin

rooms, bedrooms, population, households = [list(X_train.columns).index(col) 
                                           for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributes(BaseEstimator, TransformerMixin):
    def __init__(self):
        self
    def fit(self, X, y = None):
        return self
    def transform(self, X, y = None):
        rooms_per_household = X[:, rooms]/X[:, households]
        population_per_household = X[:, population]/X[:, households]
        bedrooms_per_household = X[:, bedrooms]/X[:, households]
        return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_household]
        
num_pipeline = Pipeline([("imputer", SimpleImputer(strategy = "median")),
                         ("attribs_add", CombinedAttributes()),
                         ("std_scaler", StandardScaler())])
X_train_num = X_train.drop("ocean_proximity", axis = 1)
X_train_num_prepared = num_pipeline.fit_transform(X_train_num)

In [54]:
# Categorical
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([("encoder", OneHotEncoder())])
X_train_cat = X_train[["ocean_proximity"]]
X_train_cat_prepared = cat_pipeline.fit_transform(X_train_cat)

In [55]:
# Combine numerical & categorical pipelines
from sklearn.compose import ColumnTransformer

num_attributes = list(X_train_num)
cat_attributes = list(X_train_cat)

full_pipeline = ColumnTransformer([("numeric", num_pipeline, num_attributes),
                                   ("categorical", cat_pipeline, cat_attributes)])
X_train_prepared = full_pipeline.fit_transform(X_train)

With the prepared data, we can now try a support vector machine regressor. Also, find best hyperparameters for svm regressor.

In [56]:
# Search for best combination of hyperparamters.
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

supportVectorRegressor = SVR()
param_search_space = [{"kernel":["linear"], "C":[10, 20, 30, 40, 50]},
                      {"kernel":["rbf"], "C":[10, 20, 30, 40, 50], "gamma":["scale", "auto"]}]
grid_search = GridSearchCV(supportVectorRegressor, param_search_space,
                           cv = 3, scoring = "neg_mean_squared_error",
                           return_train_score = True)
grid_search.fit(X_train_prepared, y_train)
grid_search.best_params_

{'C': 1.0, 'kernel': 'linear'}

We'll evaluate our svm model with its best performing hyperparameters.

In [60]:
from sklearn.model_selection import cross_val_score

svmRegressor = SVR(kernel = "linear", C = 1.0) # Add the best hyperparameters.
scores = cross_val_score(svmRegressor, X_train_prepared, y_train,
                         scoring = "neg_mean_squared_error", cv = 10)
print("Scores: ", np.sqrt(-scores))
print("Mean: ", np.sqrt(-scores).mean())
print("Standard Deviation: ", np.sqrt(-scores).std())

Scores:  [113531.42847235 111880.54845623 112679.50780293 112114.38983612
 111304.75541308 112991.89600464 111253.26553519 110675.23187123
 113321.68839355 110788.96394943]
Mean:  112054.16757347454
Standard Deviation:  989.8651760185556


Our svm regressor performs worse than our linear regression model, decision tree model, & random forests model. It does have more precise predictions, although it may be the result of having the same hyperparameters across different folds during cross-validation.

# 2.

Try randomised search instead of grid search.

In [23]:
from sklearn.model_selection import RandomizedSearchCV

supportVectorRegressor = SVR()
param_search_space = [{"kernel":["linear"], "C":[10, 20, 30, 40, 50]},
                      {"kernel":["rbf"], "C":[10, 20, 30, 40, 50], "gamma":["scale", "auto"]}]
random_search = RandomizedSearchCV(supportVectorRegressor, param_search_space,
                                   n_iter = 10, cv = 3, scoring = "neg_mean_squared_error",
                                   return_train_score = True)
random_search.fit(X_train_prepared, y_train)
random_search.best_params_

{'kernel': 'linear', 'C': 0.9}

In [24]:
svmRegressor = SVR(kernel = "linear", C = 0.9) # Add hyperparameters from randomised search.
scores = cross_val_score(svmRegressor, X_train_prepared, y_train,
                         scoring = "neg_mean_squared_error", cv = 10)
print("Scores: ", np.sqrt(-scores))
print("Mean: ", np.sqrt(-scores).mean())
print("Standard Deviation: ", np.sqrt(-scores).std())

Scores:  [114189.13937919 112506.33252945 113265.52056637 112713.59799245
 111899.69444759 113595.29478078 111846.30265771 111298.61808343
 113934.71774988 111351.76632487]
Mean:  112660.09845117298
Standard Deviation:  1001.8030931701944


# 3.

Try feature selection & put it in a pipeline. Since SVR models do not support native feature importance scores, we'll make our own.