# California Housing Regression with SVM

_Exercise: Train and fine-tune an SVM regressor on the California housing dataset. You can use the original dataset rather than the tweaked version we used in Chapter 2. The original dataset can be fetched using `sklearn.datasets.fetch_california_housing()`. The targets represent hundreds of thousands of dollars. Since there are over 20,000 instances, SVMs can be slow, so for hyperparameter tuning you should use much less instances (e.g., 2,000), to test many more hyperparameter combinations. What is your best model's RMSE?_

In [28]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import root_mean_squared_error
from sklearn.svm import SVR, NuSVR
from sklearn.pipeline import Pipeline
from scipy.stats import loguniform, uniform

import numpy as np

RANDOM_STATE = 42

In [3]:
df = fetch_california_housing(as_frame=True)

In [4]:
df.DESCR

'.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n:Number of Instances: 20640\n\n:Number of Attributes: 8 numeric, predictive attributes and the target\n\n:Attribute Information:\n    - MedInc        median income in block group\n    - HouseAge      median house age in block group\n    - AveRooms      average number of rooms per household\n    - AveBedrms     average number of bedrooms per household\n    - Population    block group population\n    - AveOccup      average number of household members\n    - Latitude      block group latitude\n    - Longitude     block group longitude\n\n:Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe target variable is the median house value for California districts,\nexpressed in hundreds of thousands of dollars ($100,000).\n\nThis dataset was derived from the 1990 U.S

In [5]:
X_train, X_test, y_train, y_test = train_test_split(df.data, df.target, test_size=0.2, random_state=RANDOM_STATE)

In [6]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(16512, 8)
(4128, 8)
(16512,)
(4128,)


In [7]:
X_train.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16
17445,4.1563,4.0,5.645833,0.985119,915.0,2.723214,34.66,-120.48
14265,1.9425,36.0,4.002817,1.033803,1418.0,3.994366,32.69,-117.11
2271,3.5542,43.0,6.268421,1.134211,874.0,2.3,36.78,-119.8


In [8]:
y_train.head()

14196    1.030
8267     3.821
17445    1.726
14265    0.934
2271     0.965
Name: MedHouseVal, dtype: float64

In [29]:
pipeline = Pipeline([("scaler", StandardScaler()), ("clf", DummyRegressor())])

params = {
    "clf": [SVR(), NuSVR()],
    "clf__gamma": loguniform(0.001, 0.1),
    "clf__C": uniform(1, 10),
}

random_cv = RandomizedSearchCV(
    pipeline,
    params,
    n_iter=100,
    cv=3,
    random_state=RANDOM_STATE,
)

random_cv.fit(X_train[:2000], y_train[:2000])

In [30]:
print(random_cv.best_params_)

{'clf': NuSVR(), 'clf__C': 6.898708475605439, 'clf__gamma': 0.09073727166787789}


In [31]:
-cross_val_score(
    random_cv.best_estimator_, X_train, y_train, scoring="neg_root_mean_squared_error"
)

array([0.58030207, 0.5645094 , 0.57484258, 0.56377296, 0.59099454])

In [32]:
y_pred = random_cv.best_estimator_.predict(X_test)
mse = root_mean_squared_error(y_test, y_pred)
mse

0.5841502681756156