1. Load the data set into Python using, e.g., load_wine or genfromtxt, as
appropriate. In the case of the USPS dataset, merge the original training
and test sets into one dataset.

## Load Wine data and ZIP code data (merge the ZIP code test and train data)

In [1]:
import numpy as np
from sklearn.datasets import load_wine

wine = load_wine()

# 257 columns; column 1 contains the digit id (0-9); other columns are the 256 grayscale values

# 7291 rows
zc = np.genfromtxt("zip.train", delimiter=' ', dtype=None)

# 2007 rows
zc2 = np.genfromtxt("zip.test", delimiter=' ')

zipcodes = np.concatenate((zc, zc2), axis=0)
print(zipcodes.shape)

(9298, 257)


In [4]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

2. Divide the dataset into a training set and a test set. You may use the
function train_test_split. Use your birthday in the format DDMM as
random_state (omit leading zeros if any).

## Split datasets into test and train set.

In [5]:
from sklearn.model_selection import train_test_split

X_wine = wine.data
y_wine = wine.target

X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(X_wine,
                                                                        y_wine,
                                                                        random_state=308)

print(X_train_wine.shape, X_test_wine.shape, y_train_wine.shape, y_test_wine.shape)

y_zipcodes = zipcodes[:, 0] # Just target the first column, which contains the labels.
X_zipcodes = np.delete(zipcodes, 0, axis=1) # Get all columns except the first one.

print(X_zipcodes.shape)
print(y_zipcodes.shape)

X_train_zipcodes, X_test_zipcodes, y_train_zipcodes, y_test_zipcodes = train_test_split(X_zipcodes,
                                                                                       y_zipcodes,
                                                                                       random_state=308)

print(X_train_zipcodes.shape, X_test_zipcodes.shape, y_train_zipcodes.shape, y_test_zipcodes.shape)

(133, 13) (45, 13) (133,) (45,)
(9298, 256)
(9298,)
(6973, 256) (2325, 256) (6973,) (2325,)


3. Using cross-validation and the training set only, estimate the generalization accuracy of the SVM with the default values of the parameters. You may use the function cross_val_score.

In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

svm = SVC()

# Default value for cv is changing from 3 to 5, so I explicitly set cv to 3 to remove the warning
cv_score_wine = cross_val_score(svm, X_train_wine, y_train_wine, cv=3)
cv_score_zipcodes = cross_val_score(svm, X_train_zipcodes, y_train_zipcodes, cv=3)

print("Cross Validation Scores:\nWine: {}\nZIP Codes: {}".format(cv_score_wine, cv_score_zipcodes))
print("Avg. Cross Validation Scores:\nWine: {}\nZIP Codes: {}".format(cv_score_wine.mean(), cv_score_zipcodes.mean()))



Cross Validation Scores:
Wine: [0.42222222 0.4        0.37209302]
ZIP Codes: [0.96005155 0.96385542 0.96510125]
Avg. Cross Validation Scores:
Wine: 0.39810508182601206
ZIP Codes: 0.9630027391799795


In [6]:
print("Avg. Error Rate:\nWine: {}\nZIP Codes: {}".format(1 - cv_score_wine.mean(), 1 - cv_score_zipcodes.mean()))

Avg. Error Rate:
Wine: 0.6018949181739879
ZIP Codes: 0.03699726082002053


4. Find the test error rate of the SVM with the default values of parameters, compare it with the estimate obtained in the previous task (task 3), and write your observations in a markdown cell of your Jupyter notebook.

In [7]:
svm1 = SVC().fit(X_train_wine, y_train_wine)
svm2 = SVC().fit(X_train_zipcodes, y_train_zipcodes)

# Test accuracy for wine
svm_score_wine = svm1.score(X_test_wine, y_test_wine)
# Test accuracy for ZIP codes
svm_score_zipcodes = svm2.score(X_test_zipcodes, y_test_zipcodes)

print("Test Error Rates:\nWine: {}\nZIP Codes: {}".format(1 - svm_score_wine, 1 - svm_score_zipcodes))



Test Error Rates:
Wine: 0.3555555555555555
ZIP Codes: 0.03741935483870973


In [8]:
print("Accuracy:\nWine: {}\nZIP Codes: {}".format(svm_score_wine, svm_score_zipcodes))

Accuracy:
Wine: 0.6444444444444445
ZIP Codes: 0.9625806451612903


# 3. Estimate of the generalization accuracy of SVM

> Predicted Error Rates (Using Cross-Validation):
* Wine: 0.6018949181739879 = **60.2%**
* ZIP codes: 0.03699726082002053 = **3.7%**

# 4. Test Error Rates:

> Actual Test Error Rates:
* Wine: 0.3555555555555555 = **35.6%**
* ZIP codes: 0.03741935483870973 = **3.7%**

As we can see, the predicted and actual test error rate for the Wine dataset (using SVM with the default parameters) is quite different. In fact, the actual test error rate was almost half of that predicted using cross-validation.

On the other hand, the predicted and actual test error rates for the ZIP codes dataset were almost identical (both were approximately 3.7%).

This means that the predicted test error rate for the ZIP codes dataset was accuarate, while the one for the Wine dataset was not.

5. Create a pipeline for SVM involving data normalization and SVC, and use grid search and cross-validation to tune parameters C and gamma for the pipeline, avoiding data snooping and data leakage. You may use the scikit-learn class GridSearchCV. Experiment with different ways of doing normalization (such as StandardScaler, MinMaxScaler, RobustScaler, and Normalizer). Which ways are appropriate for either dataset? (The answer, which should be written in your Jupyter notebook, may depend on the results that you obtain for the next task.)

6. Fit the GridSearchCV object of task 5 to the training set and use it to predict the test labels. Write the resulting test accuracy in your Jupyter notebook.

### 5 & 6 Create pipeline for SVN; use grid search to tune parameters; fit GridSearchCV object to training set and predict test set labels.

Create pipeline and GridSearchCV objects for Wine Dataset; fit them and get accuracies.

In [19]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import MinMaxScaler, Normalizer, RobustScaler, StandardScaler

param_grid = {'svc__C': [0.01, 0.1, 1, 10, 100],
              'svc__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

minmax_pipe = make_pipeline(MinMaxScaler(), SVC())
norm_pipe = make_pipeline(Normalizer(), SVC())
robust_pipe = make_pipeline(RobustScaler(), SVC())
standard_pipe = make_pipeline(StandardScaler(), SVC())

minmax_grid = GridSearchCV(minmax_pipe, param_grid=param_grid, cv=5)
norm_grid = GridSearchCV(norm_pipe, param_grid=param_grid, cv=5)
robust_grid = GridSearchCV(robust_pipe, param_grid=param_grid, cv=5)
standard_grid = GridSearchCV(standard_pipe, param_grid=param_grid, cv=5)

grids = [minmax_grid, norm_grid, robust_grid, standard_grid]
best_grids_wine = []

for grid in grids:
    
    grid.fit(X_train_wine, y_train_wine)
    print("\nGrid:", grid.estimator.steps[0])
    print("Best cross-validation accuracy:", grid.best_score_)
    print("Test set score:", grid.score(X_test_wine, y_test_wine))
    print("Best parameters:", grid.best_params_)
    best_grids_wine.append(grid.best_estimator_)




Grid: ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1)))
Best cross-validation accuracy: 0.9924812030075187
Test set score: 1.0
Best parameters: {'svc__C': 1, 'svc__gamma': 1}





Grid: ('normalizer', Normalizer(copy=True, norm='l2'))
Best cross-validation accuracy: 0.9172932330827067
Test set score: 0.9555555555555556
Best parameters: {'svc__C': 100, 'svc__gamma': 100}





Grid: ('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True))
Best cross-validation accuracy: 0.9849624060150376
Test set score: 1.0
Best parameters: {'svc__C': 10, 'svc__gamma': 0.01}

Grid: ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))
Best cross-validation accuracy: 0.9849624060150376
Test set score: 1.0
Best parameters: {'svc__C': 1, 'svc__gamma': 0.1}




Create pipelines and GridSearchCV objects for ZIP codes dataset; fit them and get accuracies. **Takes ages to run**

In [20]:
minmax_pipe = make_pipeline(MinMaxScaler(), SVC())
norm_pipe = make_pipeline(Normalizer(), SVC())
robust_pipe = make_pipeline(RobustScaler(), SVC())
standard_pipe = make_pipeline(StandardScaler(), SVC())

minmax_grid = GridSearchCV(minmax_pipe, param_grid=param_grid, cv=5)
norm_grid = GridSearchCV(norm_pipe, param_grid=param_grid, cv=5)
robust_grid = GridSearchCV(robust_pipe, param_grid=param_grid, cv=5)
standard_grid = GridSearchCV(standard_pipe, param_grid=param_grid, cv=5)

grids = [minmax_grid, norm_grid, robust_grid, standard_grid]
best_grids_zipcodes = []

for grid in grids:
    
    grid.fit(X_train_zipcodes, y_train_zipcodes)
    print("\nGrid:", grid.estimator.steps[0])
    print("Best cross-validation accuracy:", grid.best_score_)
    print("Test set score:", grid.score(X_test_zipcodes, y_test_zipcodes))
    print("Best parameters:", grid.best_params_)
    best_grids_zipcodes.append(grid.best_estimator_)


Grid: ('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1)))
Best cross-validation accuracy: 0.9716047612218557
Test set score: 0.9681720430107527
Best parameters: {'svc__C': 10, 'svc__gamma': 0.01}

Grid: ('normalizer', Normalizer(copy=True, norm='l2'))
Best cross-validation accuracy: 0.9747597877527606
Test set score: 0.9703225806451613
Best parameters: {'svc__C': 10, 'svc__gamma': 1}

Grid: ('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True))
Best cross-validation accuracy: 0.8792485300444572
Test set score: 0.8713978494623655
Best parameters: {'svc__C': 100, 'svc__gamma': 0.001}

Grid: ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))
Best cross-validation accuracy: 0.9670156317223577
Test set score: 0.9625806451612903
Best parameters: {'svc__C': 10, 'svc__gamma': 0.001}


## 5 & 6 Results:

### Wine Dataset:

minmaxscaler:
* Best cross-validation accuracy: 0.9924812030075187
* Test set score: 1.0
* Test Error Rate: 0
* Best Parameters for SVM {'svc__C': 1, 'svc__gamma': 1}

normalizer:
* Best cross-validation accuracy: 0.9172932330827067
* Test set score: 0.9555555555555556
* Test Error Rate: 0.04444444444
* Best parameters for SVM {'svc__C': 100, 'svc__gamma': 100}

robustscaler:
* Best cross-validation accuracy: 0.9849624060150376
* Test set score: 1.0
* Test Error Rate: 0
* Best parameters for SVM {'svc__C': 10, 'svc__gamma': 0.01}

standardscaler:
* Best cross-validation accuracy: 0.9849624060150376
* Test set score: 1.0
* Test Error Rate: 0
* Best parameters for SVM {'svc__C': 1, 'svc__gamma': 0.1}

> For the wine dataset, minmaxscaler had the highest cross-validation accuracy out of them all (~ 99.2%); robustscaler and standard scaler both had the next highest (~ 98.5%) while normalizer had the worst (~ 91.7%).
In terms of accuracy on the test set, minmaxscaler, robustscaler and standardscaler all had 100% accuracy, while the normalizer only had ~ 95.6% accuracy. From this we can see that the minmaxscaler (with SVM parameters C=1 and gamma=1) is the best one to use (as it had the highest cross-validation accuracy and test set accuracy), although robustscaler and standardscaler would be suitable alternatives.

### ZIP Code Dataset:

minmaxscaler:
* Best cross-validation accuracy: 0.9716047612218557
* Test set score: 0.9681720430107527
* Test Error Rate: 0.03182795698
* Best Parameters for SVM {'svc__C': 10, 'svc__gamma': 0.01}

normalizer:
* Best cross-validation accuracy: 0.9747597877527606
* Test set score: 0.9703225806451613
* Test Error Rate: 0.02967741935
* Best parameters for SVM {'svc__C': 10, 'svc__gamma': 1}

robustscaler:
* Best cross-validation accuracy: 0.8792485300444572
* Test set score: 0.8713978494623655
* Test Error Rate: 0.12860215053
* Best parameters for SVM {'svc__C': 100, 'svc__gamma': 0.001}

standardscaler:
* Best cross-validation accuracy: 0.9670156317223577
* Test set score: 0.9625806451612903
* Test Error Rate: 0.03741935483
* Best parameters for SVM {'svc__C': 10, 'svc__gamma': 0.001}

> For the ZIP code dataset, the normalizer had the highest cross-validation accuracy (~ 97.5%) and the highest test set score (~ 97.0%), making it the best way of normalising this dataset out of the 4. For cross-validation accuracy and test set accuracy, minmaxscaler came second with ~ 97.1% and ~ 96.8% respectively, followed by standard scaler, which had ~ 96.7% and ~ 96.2% respectively. robustscaler had the worst accuracies, with a cross-validation accuracy of ~ 87.9% and test set accuracy of ~ 87.1%. In this case, normalizer is the best to use as it had the highest cross-validation accuracy and highest test set accuracy. minmaxscaler or standardscaler could be suitable alternatives.

### 7. Get conformity scores for Wine dataset

In [11]:
from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

# Best estimator for Wine based on task 5.
minmax_pipe = make_pipeline(MinMaxScaler(), SVC(C=1, gamma=1))

kf = KFold(shuffle=True, random_state=308, n_splits=5)

# This will contain the conformity scores for each fold
conformity_scores = []
actual_labels = []

folds = []

for rest_index, fold_index in kf.split(wine.data):
    print("Current fold:", fold_index)
    print("The rest of the training set:", rest_index)
    
    X_rest, X_fold = wine.data[rest_index], wine.data[fold_index]
    y_rest, y_fold = wine.target[rest_index], wine.target[fold_index]
    
    actual_labels.append(y_fold)
    
    minmax_pipe.fit(X_rest,y_rest)
    
    conformity_scores.append(minmax_pipe.decision_function(X_fold))
    print(minmax_pipe.score(X_fold,y_fold))


print(conformity_scores)

Current fold: [  8  10  11  13  18  22  39  44  56  60  64  65  67  70  71  84  86  98
 103 105 108 115 116 121 123 124 128 132 135 136 137 146 148 153 166 168]
The rest of the training set: [  0   1   2   3   4   5   6   7   9  12  14  15  16  17  19  20  21  23
  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  40  41  42
  43  45  46  47  48  49  50  51  52  53  54  55  57  58  59  61  62  63
  66  68  69  72  73  74  75  76  77  78  79  80  81  82  83  85  87  88
  89  90  91  92  93  94  95  96  97  99 100 101 102 104 106 107 109 110
 111 112 113 114 117 118 119 120 122 125 126 127 129 130 131 133 134 138
 139 140 141 142 143 144 145 147 149 150 151 152 154 155 156 157 158 159
 160 161 162 163 164 165 167 169 170 171 172 173 174 175 176 177]
1.0
Current fold: [  0   4   5   9  15  19  21  27  28  30  36  41  53  66  69  72  75  79
  83  85  87  96  97 100 101 107 112 113 114 125 130 138 139 152 163 167]
The rest of the training set: [  1   2   3   6   7   8  10  11  12 