<blockquote>
    <h1>Exercise 9.8</h1>
    <p>This problem involves the <code>OJ</code> data set which is part of the <code>ISLR</code> package.</p>
    <ol>
        <li>Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.</li>
        <li>Fit a support vector classifier to the training data using $\mathrm{cost}=0.01$, with $\mathrm{Purchase}$ as the response and the other variables as predictors. Use the <code>summary()</code> function to produce summary statistics, and describe the results obtained.</li>
        <li>What are the training and test error rates?</li>
        <li>Use the <code>tune()</code> function to select an optimal $\mathrm{cost}$. Consider values in the range 0.01 to 10.</li>
        <li>Compute the training and test error rates using this new value for $\mathrm{cost}$.</li>
        <li>Repeat parts 2 through 5 using a support vector machine with a radial kernel. Use the default value for $\mathrm{gamma}$.</li>
        <li>Repeat parts 2 through 5 using a support vector machine with a polynomial kernel. Set $\mathrm{degree}=2$.</li>
        <li>Overall, which approach seems to give the best results on this data?</li>
    </ol>
</blockquote>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# https://stackoverflow.com/questions/34398054/ipython-notebook-cell-multiple-outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer  # https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
from sklearn.model_selection import GridSearchCV

In [2]:
df = pd.read_csv("../../DataSets/OJ/OJ.csv")
df.head()

df_y = df[['Purchase']]
df_x = df.drop('Purchase', axis=1)

Unnamed: 0,Purchase,WeekofPurchase,StoreID,PriceCH,PriceMM,DiscCH,DiscMM,SpecialCH,SpecialMM,LoyalCH,SalePriceMM,SalePriceCH,PriceDiff,Store7,PctDiscMM,PctDiscCH,ListPriceDiff,STORE
0,CH,237,1,1.75,1.99,0.0,0.0,0,0,0.5,1.99,1.75,0.24,No,0.0,0.0,0.24,1
1,CH,239,1,1.75,1.99,0.0,0.3,0,1,0.6,1.69,1.75,-0.06,No,0.150754,0.0,0.24,1
2,CH,245,1,1.86,2.09,0.17,0.0,0,0,0.68,2.09,1.69,0.4,No,0.0,0.091398,0.23,1
3,MM,227,1,1.69,1.69,0.0,0.0,0,0,0.4,1.69,1.69,0.0,No,0.0,0.0,0.0,1
4,CH,228,7,1.69,1.69,0.0,0.0,0,0,0.956535,1.69,1.69,0.0,Yes,0.0,0.0,0.0,0


<h3>Exercise 9.8.1</h3>
<blockquote>
    <i>Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.</i>
</blockquote>

In [3]:
df_x_train, df_x_test, df_y_train, df_y_test = train_test_split(df_x, df_y, train_size=800, random_state=0)

<h3>Exercise 9.8.2</h3>
<blockquote>
    <i>Fit a support vector classifier to the training data using $\mathrm{cost}=0.01$, with $\mathrm{Purchase}$ as the response and the other variables as predictors. Use the <code>summary()</code> function to produce summary statistics, and describe the results obtained.</i>
</blockquote>

In [4]:
# support vector classifier
# use cross validation to find the optimum value for C
columns = df_x.columns

quantative_variables = [
    'WeekofPurchase', 
    'PriceCH', 
    'PriceMM', 
    'DiscCH', 
    'DiscMM', 
    'LoyalCH', 
    'SalePriceMM', 
    'SalePriceCH', 
    'PriceDiff', 
    'PctDiscMM', 
    'PctDiscCH', 
    'ListPriceDiff'
]
quantative_transformer = Pipeline([
    ('scaler', StandardScaler()),
])

categorical_variables = [column for column in columns if not column in quantative_variables]
categorical_variables
categorical_transformer = Pipeline([
    ('transformer', OneHotEncoder()),
])

preprocessor = ColumnTransformer(
    transformers=[
        ('quant', quantative_transformer, quantative_variables),
        ('cat', categorical_transformer, categorical_variables)
    ]
)

linear_svm_class = Pipeline([
    ('preprocessor', preprocessor),
    ('linear_svc', SVC(kernel='linear', C=0.1))
])
_ = linear_svm_class.fit(df_x_train, df_y_train['Purchase'])
linear_svm_class.get_params()

['StoreID', 'SpecialCH', 'SpecialMM', 'Store7', 'STORE']

{'memory': None,
 'steps': [('preprocessor',
   ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                     transformer_weights=None,
                     transformers=[('quant',
                                    Pipeline(memory=None,
                                             steps=[('scaler',
                                                     StandardScaler(copy=True,
                                                                    with_mean=True,
                                                                    with_std=True))],
                                             verbose=False),
                                    ['WeekofPurchase', 'PriceCH', 'PriceMM',
                                     'DiscCH', 'DiscMM', 'LoyalCH', 'SalePriceMM',
                                     'SalePriceCH', 'PriceDiff', 'PctDiscMM',
                                     'PctDiscCH', 'ListPriceDiff']),
                                   ('cat',
        

<h3>Exercise 9.8.3</h3>
<blockquote>
    <i>What are the training and test error rates?</i>
</blockquote>

In [5]:
print(f'training error = {1 - linear_svm_class.score(df_x_train, df_y_train):.3f} %')
print(f'test error = {1 - linear_svm_class.score(df_x_test, df_y_test):.3f} %')

training error = 0.155 %
test error = 0.178 %


<h3>Exercise 9.8.4</h3>
<blockquote>
    <i>Use the <code>tune()</code> function to select an optimal $\mathrm{cost}$. Consider values in the range 0.01 to 10.</i>
</blockquote>

In [6]:
linear_svm_class = Pipeline([
    ('preprocessor', preprocessor),
    ('linear_svc', SVC(kernel='linear'))
])

param_grid={
    'linear_svc__C': np.linspace(0, 10, 101),
}

grid_search_linear_svm_class = GridSearchCV(linear_svm_class, param_grid=param_grid, n_jobs=-1)
# use training set for cross validation, so we use a training set, validation set, and test set
_ = grid_search_linear_svm_class.fit(df_x_train, df_y_train['Purchase'])


# print(grid_search_linear_svm_class.best_estimator_)
print(f'cross-validated test error = {1 - grid_search_linear_svm_class.best_score_:.3f} %')

cross-validated test error = 0.164 %


<h3>Exercise 9.8.5</h3>
<blockquote>
    <i>Compute the training and test error rates using this new value for $\mathrm{cost}$.</i>
</blockquote>

In [7]:
print(f'training error = {1 - grid_search_linear_svm_class.best_estimator_.score(df_x_train, df_y_train):.3f} %')
print(f'test error = {1 - grid_search_linear_svm_class.best_estimator_.score(df_x_test, df_y_test):.3f} %')

training error = 0.156 %
test error = 0.185 %


<h3>Exercise 9.8.6</h3>
<blockquote>
    <i>Repeat parts 2 through 5 using a support vector machine with a radial kernel. Use the default value for $\mathrm{gamma}$.</i>
</blockquote>

In [8]:
# step 2
rbf_svm_class = Pipeline([
    ('preprocessor', preprocessor),
    ('rbf_svc', SVC(kernel='rbf', C=0.1, gamma='auto'))
])
_ = rbf_svm_class.fit(df_x_train, df_y_train['Purchase'])
rbf_svm_class.get_params()

{'memory': None,
 'steps': [('preprocessor',
   ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                     transformer_weights=None,
                     transformers=[('quant',
                                    Pipeline(memory=None,
                                             steps=[('scaler',
                                                     StandardScaler(copy=True,
                                                                    with_mean=True,
                                                                    with_std=True))],
                                             verbose=False),
                                    ['WeekofPurchase', 'PriceCH', 'PriceMM',
                                     'DiscCH', 'DiscMM', 'LoyalCH', 'SalePriceMM',
                                     'SalePriceCH', 'PriceDiff', 'PctDiscMM',
                                     'PctDiscCH', 'ListPriceDiff']),
                                   ('cat',
        

In [9]:
# step 3
print(f'training error = {1 - rbf_svm_class.score(df_x_train, df_y_train):.3f} %')
print(f'test error = {1 - rbf_svm_class.score(df_x_test, df_y_test):.3f} %')

training error = 0.171 %
test error = 0.178 %


In [10]:
# step 4
rbf_svm_class = Pipeline([
    ('preprocessor', preprocessor),
    ('rbf_svc', SVC(kernel='rbf', gamma='auto'))
])

param_grid={
    'rbf_svc__C': np.linspace(0, 10, 101),
}

grid_search_rbf_svm_class = GridSearchCV(rbf_svm_class, param_grid=param_grid, n_jobs=-1)
# use training set for cross validation, so we use a training set, validation set, and test set
_ = grid_search_rbf_svm_class.fit(df_x_train, df_y_train['Purchase'])


# print(grid_search_rbf_svm_class.best_estimator_)
print(f'cross-validated test error = {1 - grid_search_rbf_svm_class.best_score_:.3f} %')

cross-validated test error = 0.166 %


In [11]:
# step 5
print(f'training error = {1 - grid_search_rbf_svm_class.best_estimator_.score(df_x_train, df_y_train):.3f} %')
print(f'test error = {1 - grid_search_rbf_svm_class.best_estimator_.score(df_x_test, df_y_test):.3f} %')

training error = 0.145 %
test error = 0.189 %


<h3>Exercise 9.8.7</h3>
<blockquote>
    <i>Repeat parts 2 through 5 using a support vector machine with a polynomial kernel. Set $\mathrm{degree}=2$.</i>
</blockquote>

In [12]:
# step 2
poly_svm_class = Pipeline([
    ('preprocessor', preprocessor),
    ('poly_svc', SVC(kernel='poly', C=0.1, degree=2))
])
_ = poly_svm_class.fit(df_x_train, df_y_train['Purchase'])
poly_svm_class.get_params()

{'memory': None,
 'steps': [('preprocessor',
   ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                     transformer_weights=None,
                     transformers=[('quant',
                                    Pipeline(memory=None,
                                             steps=[('scaler',
                                                     StandardScaler(copy=True,
                                                                    with_mean=True,
                                                                    with_std=True))],
                                             verbose=False),
                                    ['WeekofPurchase', 'PriceCH', 'PriceMM',
                                     'DiscCH', 'DiscMM', 'LoyalCH', 'SalePriceMM',
                                     'SalePriceCH', 'PriceDiff', 'PctDiscMM',
                                     'PctDiscCH', 'ListPriceDiff']),
                                   ('cat',
        

In [13]:
# step 3
print(f'training error = {1 - poly_svm_class.score(df_x_train, df_y_train):.3f} %')
print(f'test error = {1 - poly_svm_class.score(df_x_test, df_y_test):.3f} %')

training error = 0.198 %
test error = 0.200 %


In [14]:
# step 4
poly_svm_class = Pipeline([
    ('preprocessor', preprocessor),
    ('poly_svc', SVC(kernel='poly', degree=2))
])

param_grid={
    'poly_svc__C': np.linspace(0, 10, 101),
}

grid_search_poly_svm_class = GridSearchCV(poly_svm_class, param_grid=param_grid, n_jobs=-1)
# use training set for cross validation, so we use a training set, validation set, and test set
_ = grid_search_poly_svm_class.fit(df_x_train, df_y_train['Purchase'])


# print(grid_search_poly_svm_class.best_estimator_)
print(f'cross-validated test error = {1 - grid_search_poly_svm_class.best_score_:.3f} %')

cross-validated test error = 0.167 %


In [15]:
# step 5
print(f'training error = {1 - grid_search_poly_svm_class.best_estimator_.score(df_x_train, df_y_train):.3f} %')
print(f'test error = {1 - grid_search_poly_svm_class.best_estimator_.score(df_x_test, df_y_test):.3f} %')

training error = 0.137 %
test error = 0.200 %


<h3>Exercise 9.8.8</h3>
<blockquote>
    <i>Overall, which approach seems to give the best results on this data?</i>
</blockquote>

<p>We will use the test error to pick our best model, so it seems that the linear kernel gives the best results.</p>