# Exercise - 11

Train and fine-tune an SVM regressor on the California housing dataset. You can use the original dataset rather than the tweaked version we used in Chapter 2. The original dataset can be fetched using `sklearn.datasets.fetch_california_housing()`. The targets represent hundreds of thousands of dollars. Since there are over 20,000 instances, SVMs can be slow, so for hyperparameter tuning you should use much less instances (e.g., 2,000), to test many more hyperparameter combinations. What is your best model's RMSE?

In [29]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score,RandomizedSearchCV
from sklearn.svm import LinearSVR, SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from scipy.stats import uniform, loguniform

## Loading Dataset

In [2]:
housing = fetch_california_housing(as_frame= True)

In [3]:
print(housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [4]:
X, y = housing.data, housing.target

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state= 42)

In [6]:
X_train.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
8158,4.2143,37.0,5.288235,0.973529,860.0,2.529412,33.81,-118.12
18368,5.3468,42.0,6.364322,1.08794,957.0,2.404523,37.16,-121.98
19197,3.9191,36.0,6.110063,1.059748,711.0,2.235849,38.45,-122.69
3746,6.3703,32.0,6.0,0.990196,1159.0,2.272549,34.16,-118.41
13073,2.3684,17.0,4.795858,1.035503,706.0,2.088757,38.57,-121.33


In [7]:
y_train

8158     2.285
18368    2.799
19197    1.830
3746     4.658
13073    1.500
         ...  
11284    2.292
11964    0.978
5390     2.221
860      2.835
15795    3.250
Name: MedHouseVal, Length: 15480, dtype: float64

## Linear SVR

In [8]:
l_svr = Pipeline([
    ('scaler', StandardScaler()),
    ('l_svr', LinearSVR(max_iter= 5000, random_state= 42))
])

In [9]:
l_svr.fit(X_train, y_train)

In [10]:
predictions = l_svr.predict(X_train)

In [11]:
mean_squared_error(y_train, predictions, squared= False)

0.9165468092031511

This RMSE means error of $91,000, which is pretty terrible.

## SVR with Gaussian RBF Kernel

In [12]:
svr = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR())
])

In [13]:
svr.fit(X_train, y_train)

In [15]:
predictions = svr.predict(X_train)
mean_squared_error(y_train, predictions, squared= False)

0.583850687488526

It's a bit better but still error of $58,000 is very high. Let's try fine tuning it.

## Fine Tuning

In [23]:
param_dist = {
    'svr__gamma': loguniform(0.001, 1),
    'svr__C': uniform(1, 10)
}

rnd_cv = RandomizedSearchCV(svr, param_dist, n_iter= 100, cv= 3, random_state= 42)

In [24]:
rnd_cv.fit(X_train[:2000], y_train[:2000])

In [30]:
np.sqrt(rnd_cv.best_score_)

0.8636979688025308

In [26]:
best_model = rnd_cv.best_estimator_

## On Test Set

In [28]:
predictions = best_model.predict(X_test)
mean_squared_error(y_test, predictions, squared= False)

0.5755468841792654

Still it is not that good. Conclusion, SVM doesnot scale well for large datasets.