Exercise: _Try a Support Vector Machine regressor (`sklearn.svm.SVR`) with various hyperparameters, such as `kernel="linear"` (with various values for the `C` hyperparameter) or `kernel="rbf"` (with various values for the `C` and `gamma` hyperparameters). Note that SVMs don't scale well to large datasets, so you should probably train your model on just the first 5,000 instances of the training set and use only 3-fold cross-validation, or else it will take hours. Don't worry about what the hyperparameters mean for now (see the SVM notebook if you're interested). How does the best `SVR` predictor perform?_

In [2]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request 
 
def load_housing_data(): 
    tarball_path = Path("datasets/housing.tgz") 
    if not tarball_path.is_file(): 
        Path("datasets").mkdir(parents=True, exist_ok=True) 
        url = "https://github.com/ageron/data/raw/main/housing.tgz" 
        urllib.request.urlretrieve(url, tarball_path) 
        with tarfile.open(tarball_path) as housing_tarball: 
            housing_tarball.extractall(path="datasets") 
    return pd.read_csv(Path("datasets/housing/housing.csv")) 
 
housing = load_housing_data()


In [3]:
housing.info()
housing.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
# Assuming 'housing' is your DataFrame
housing_labels = housing['housing_median_age']  # Target variable
housing_features = housing.drop(columns='housing_median_age')  # Features (drop target column)

housing_features.head()

Unnamed: 0,longitude,latitude,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [None]:
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import numpy as np

housing = housing.head(1000)


housing_labels = housing['housing_median_age']  # target variable
housing_features = housing.drop(columns='housing_median_age')  # features

# split the data into training and test
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_labels, test_size=0.2, random_state=42)

# define preprocessing steps
numeric_features = housing_features.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = housing_features.select_dtypes(include=[object]).columns.tolist()

# preprocessing transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean
            ('scaler', StandardScaler())  # Standardize numeric data
        ]), numeric_features),
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Impute missing categorical values
            ('onehot', OneHotEncoder())  # One-hot encode categorical features
        ]), categorical_features)
    ])

param_grid = [
    {'svr__kernel': ['linear'], 'svr__C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]}, # large C means: small margin, which can lead to overfitting.
    {'svr__kernel': ['rbf'], 'svr__C': [1.0, 3.0, 10., 30., 100., 300., 1000.0], # large ....
     'svr__gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]}, # large gamma means: single point to affect a smaller data area, which can lead to overfitting.
]

svr_pipeline = Pipeline([
    ("preprocessor", preprocessor),  # Preprocessing step
    ("svr", SVR())
])

grid_search = GridSearchCV(svr_pipeline, param_grid, cv=3, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)  # use X_train and y_train for fitting


In [11]:
print("Best hyperparameters:", grid_search.best_params_)

print("Best RMSE (negative):", grid_search.best_score_)

test_score = grid_search.score(X_test, y_test)
print("Test score (R^2):", test_score)

Best hyperparameters: {'svr__C': 10.0, 'svr__gamma': 0.03, 'svr__kernel': 'rbf'}
Best RMSE (negative): -10.876363476668962
Test score (R^2): -11.806956605214017


In [17]:
print(numeric_features)
print(categorical_features)
grid_search.fit(X_train, y_train) 

['longitude', 'latitude', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']
['ocean_proximity']
