<blockquote>
    <h1>Exercise 9.7</h1>
    <p>In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the <code>Auto</code> data set.</p>
    <ol>
        <li>Create a binary variable that takes on a $1$ for cars with gas mileage above the median, and a $0$ for cars with gas mileage below the median.</li>
        <li>Fit a support vector classifier to the data with various values of $\mathrm{cost}$, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.</li>
        <li>Now repeat 2, this time using SVMs with radial and polynomial basis kernels, with different values of $\mathrm{gamma}$ and $\mathrm{degree}$ and $\mathrm{cost}$. Comment on your results.</li>
        <li>Make some plots to back up your assertions in 2 and 3.</li>
    </ol>
</blockquote>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# https://stackoverflow.com/questions/34398054/ipython-notebook-cell-multiple-outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer  # https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

In [2]:
df = pd.read_csv("../../DataSets/Auto/Auto.csv")
df = df.set_index('name')
df.head()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
chevrolet chevelle malibu,18.0,8,307.0,130,3504,12.0,70,1
buick skylark 320,15.0,8,350.0,165,3693,11.5,70,1
plymouth satellite,18.0,8,318.0,150,3436,11.0,70,1
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1
ford torino,17.0,8,302.0,140,3449,10.5,70,1


<p>We know from our <a href="../../DataSets/Auto/Exploration.ipynb">Eploration notebook</a> for the <code>Auto</code> file that <code>horsepower</code> column contains $5$ missing values identified by the <code>'?'</code> string. As explained in the <a href="../../DataSets/Auto/Exploration.ipynb">Eploration notebook</a>, we will <i>coercively</i> convert the <code>horsepower</code> column from an <code>object</code> type to a <code>numeric</code> type using pandas <code>to_numeric()</code> method. This will cause all the <code>'?'</code> strings to be converted to <code>NaN</code> values. Finally, we will use the <code>dropna()</code> method to remove the rows with missing values.</p>

In [3]:
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
df.dropna(inplace=True)
df.head()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,70,1
buick skylark 320,15.0,8,350.0,165.0,3693,11.5,70,1
plymouth satellite,18.0,8,318.0,150.0,3436,11.0,70,1
amc rebel sst,16.0,8,304.0,150.0,3433,12.0,70,1
ford torino,17.0,8,302.0,140.0,3449,10.5,70,1


<h3>Exercise 9.7.1</h3>
<blockquote>
    <i>Create a binary variable that takes on a $1$ for cars with gas mileage above the median, and a $0$ for cars with gas mileage below the median.</i>
</blockquote>

In [4]:
median_mpg = df['mpg'].median()
df['mpg_binary'] = np.where(df['mpg'] > median_mpg, 1, 0)
df['origin'] = df['origin'].astype('category')

df_x = df.drop(['mpg', 'mpg_binary'], axis=1)
df_y = df[['mpg_binary']]

<h3>Exercise 9.7.2</h3>
<blockquote>
    <i>Fit a support vector classifier to the data with various values of $\mathrm{cost}$, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.</i>
</blockquote>

In [5]:
# support vector classifier
# use cross validation to find the optimum value for C
columns = df_x.columns


quantative_variables = [column for column in columns if column != 'origin']
quantative_transformer = Pipeline([
    ('scaler', StandardScaler()),
])

categorical_variables = ['origin']
categorical_transformer = Pipeline([
    ('transformer', OneHotEncoder()),
])

preprocessor = ColumnTransformer(
    transformers=[
        ('quant', quantative_transformer, quantative_variables),
        ('cat', categorical_transformer, categorical_variables)
    ]
)

linear_svm_class = Pipeline([
    ('preprocessor', preprocessor),
    ('linear_svc', SVC(kernel='linear'))
])

param_grid={
    'linear_svc__C': np.linspace(0, 1, 51),
}

grid_search_linear_svm_class = GridSearchCV(linear_svm_class, param_grid=param_grid, n_jobs=-1)
_ = grid_search_linear_svm_class.fit(df_x, df_y['mpg_binary'])

print(grid_search_linear_svm_class.best_estimator_)
print(grid_search_linear_svm_class.best_score_)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('quant',
                                                  Pipeline(memory=None,
                                                           steps=[('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True))],
                                                           verbose=False),
                                                  ['cylinders', 'displacement',
                                                   'horsepower', 'weight',
                                

<h3>Exercise 9.7.2</h3>
<blockquote>
    <i>Now repeat 2, this time using SVMs with radial and polynomial basis kernels, with different values of $\mathrm{gamma}$ and $\mathrm{degree}$ and $\mathrm{cost}$. Comment on your results.</i>
</blockquote>

In [6]:
# support vector machine with a polynomial kernel 
# use cross validation to find the optimum value for C and degree
poly_svm_class = Pipeline([
    ('preprocessor', preprocessor),
    ('poly_svc', SVC(kernel='poly')),
])

param_grid={
    'poly_svc__C': np.linspace(0, 0.5, 51),
    'poly_svc__degree': range(1, 5),
}

grid_search_poly_svm_class = GridSearchCV(poly_svm_class, param_grid=param_grid, n_jobs=-1)
_ = grid_search_poly_svm_class.fit(df_x, df_y['mpg_binary'])

print(grid_search_poly_svm_class.best_estimator_)
print(grid_search_poly_svm_class.best_score_)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('quant',
                                                  Pipeline(memory=None,
                                                           steps=[('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True))],
                                                           verbose=False),
                                                  ['cylinders', 'displacement',
                                                   'horsepower', 'weight',
                                

In [7]:
# support vector machine with a radial kernel 
# use cross validation to find the optimum value for C and gamma
rbf_svm_class = Pipeline([
    ('preprocessor', preprocessor),
    ('rbf_svc', SVC(kernel='rbf')),
])

param_grid={
    'rbf_svc__C': np.linspace(0, 1, 51),
    'rbf_svc__gamma': np.linspace(0, 1, 101),
}

grid_search_rbf_svm_class = GridSearchCV(rbf_svm_class, param_grid=param_grid, n_jobs=-1)
_ = grid_search_rbf_svm_class.fit(df_x, df_y['mpg_binary'])

print(grid_search_rbf_svm_class.best_estimator_)
print(grid_search_rbf_svm_class.best_score_)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('quant',
                                                  Pipeline(memory=None,
                                                           steps=[('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True))],
                                                           verbose=False),
                                                  ['cylinders', 'displacement',
                                                   'horsepower', 'weight',
                                