# Exercise 1

## Build a pipeline for Classification

We'll build a pipeline that includes scaling and hyperparameter tuning to classify wine quality. 

We'll use the **SVM classifier**. The hyperparameters we will tune are `C` and `gamma`. `C` controls the regularization strength. It is analogous to the `C` hyperparameter used in **logistic regression**. `gamma` controls the **kernel coefficient**.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

In [2]:
# prepare the data
df = pd.read_csv('../data/white-wine.csv')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [3]:
df['quality'] = df.quality.apply(lambda x: False if x < 5 else True)
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,True
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,True
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,True
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,True
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,True


In [4]:
df.groupby('quality').quality.count()

quality
False     183
True     4715
Name: quality, dtype: int64

In [5]:
# prepare data
X = df.drop('quality', axis=1).values
y = df.quality.values

print(type(X), X.shape)
print(type(y), y.shape)

<class 'numpy.ndarray'> (4898, 11)
<class 'numpy.ndarray'> (4898,)


Setup the pipeline with the following steps:

- Scaling, called `scaler` with `StandardScaler()`.
- Classification, called `SVM` with `SVC()`.

In [6]:
steps = [('scaller', StandardScaler()), ('svm', SVC())]
pipeline = Pipeline(steps)

Specify the hyperparameter space using the following notation: `step_name__parameter_name`. 

Here, the `step_name` is `svm`, and the `parameter_names` are `C and `gamma`.

In [7]:
# Specify the hyperparameter space
parameters = {'svm__C':[1, 10, 100],
              'svm__gamma':[0.1, 0.01]}

Create training and test sets, with 20% of the data used for the test set. Use a random state of 21.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

Instantiate `GridSearchCV` with the pipeline and hyperparameter space and fit it to the training set. Use 3-fold cross-validation (This is the default, so you don't have to specify it).

In [9]:
cv = GridSearchCV(pipeline, param_grid=parameters)

Fit the data, predict the labels of the test set and compute the metrics.

In [10]:
# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))



Accuracy: 0.9693877551020408
              precision    recall  f1-score   support

       False       0.43      0.10      0.17        29
        True       0.97      1.00      0.98       951

   micro avg       0.97      0.97      0.97       980
   macro avg       0.70      0.55      0.58       980
weighted avg       0.96      0.97      0.96       980

Tuned Model Parameters: {'svm__C': 100, 'svm__gamma': 0.01}
