### 11. Train and fine-tune an SVM regressor on the California housing dataset. You can use the original dataset rather than the tweaked version we used in Chapter 2, which you can load using `sklearn.datasets.fetch_california_housing()`. The targets represent hundreds of thousands of dollars. Since there are over 20,000 instances, SVMs can be slow, so for hyperparameter tuning you should use far fewer instances (e.g. 2000) to test many more hyperparameter combinations. What is your best model's RMSE?

In [2]:
# Stops the SSL error when trying to fetch the dataset
import ssl
ssl._create_default_https_context = ssl._create_stdlib_context

In [5]:
from sklearn.datasets import fetch_california_housing

df_housing = fetch_california_housing(as_frame=True)

In [8]:
print(df_housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [6]:
df_housing.data

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [9]:
X = df_housing.data.values
X.shape

(20640, 8)

In [10]:
y = df_hosuing.target
y.shape

(20640,)

In [11]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
X_scaled = std_scaler.fit_transform(X)

In [12]:
X_train_scaled = X_scaled[:16000]
X_test_scaled = X_scaled[16000:]
y_train = y[:16000]
y_test = y[16000:]

In [14]:
X_train_scaled.shape

(16000, 8)

In [15]:
from sklearn.svm import SVR

svr_rgr = SVR(random_state=42)

In [30]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

svr_param_grid = {
    "C" : loguniform(1e-3, 1e3), # search C from 0.001 to 1000
    "kernel" : ["linear", "rbf", "poly", "sigmoid"],
    "degree" : [2, 3, 4, 5],
    "coef0" : loguniform(1e-1, 1e3),
    "tol" : loguniform(1e-3, 1e3),
    "epsilon" : loguniform(1e-4, 1e1)
}

random_search_svr = RandomizedSearchCV(
    svr_rgr, svr_param_grid, n_iter=30, cv=5, 
    scoring="neg_root_mean_squared_error", random_state=42, n_jobs=-1
)

In [31]:
random_search_svr.fit(X_train_scaled[:2000], y_train[:2000])

In [32]:
random_search_svr.best_params_

{'C': 0.21481457181982683,
 'coef0': 1.2172958098369964,
 'degree': 2,
 'epsilon': 0.08585306974480478,
 'kernel': 'linear',
 'tol': 0.04848496183873291}

In [33]:
best_svr = random_search_svr.best_estimator_

In [35]:
from sklearn.model_selection import cross_val_score

cross_val_score(
    best_svr, X_train_scaled, y_train, 
    cv=3, scoring='neg_root_mean_squared_error').mean()

-1.2246697497211236