# Assignment 4

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import mylib as my

## Training a linear soft-margin SVM model
* **(10 points)** Fetch the [Cleveland heart disease](https://archive.ics.uci.edu/ml/datasets/heart+disease) from the UC Irvine Machine Learning Repository. Use the data from `processed.cleveland.data` file. 

In [2]:
df = pd.read_csv(my.download_zip_and_open_a_file('https://archive.ics.uci.edu/static/public/45/heart+disease.zip', 'processed.cleveland.data'), header=None)
df.columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


* **(5 points)** To handle missing values, which are marked with ? in the original file, you have a few options. One of the simplest strategies is to remove any examples that contain one or more missing values. Implement this strategy. Show the dataframe after this change.

In [3]:
df.replace('?', np.nan, inplace=True)
df.dropna(inplace=True)
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,7.0,1
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3


* **(5 points)** Re-code the output column to have only two classes: 0 for the absence of the disease and 1 for its presence. The original output column includes five classes (0 for the absence of heart disease and 1, 2, 3, 4 for different variants of it). Re-code the values 2, 3, and 4 as 1.

In [4]:
df.target = df.target.map({
    0: 0,
    1: 1,
    2: 1,
    3: 1,
    4: 1
})
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,1
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,7.0,1
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,1
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,1


* **(5 points)** Split the data in an 80/20 manner into to sets: one for training and another for testing.

In [5]:
y = df.iloc[:,-1].values
X = df.iloc[:,:-1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state= 42)

print(f'''
X_train shape: {X_train.shape}
X_test shape: {X_test.shape}
y_train shape: {y_train.shape}
y_test shape: {y_test.shape}
''')


X_train shape: (237, 13)
X_test shape: (60, 13)
y_train shape: (237,)
y_test shape: (60,)



* **(10 points)** Standardize the data, including all input features, to prepare it for the next steps. Ensure that the test data is not exposed or leaked during this standardization process.

In [7]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

* **(15 points)** Train a linear soft-margin SVM (using sklearn) on the standardized data. Use grid search to find the best value for the `C` parameter. Print the best `C` value.

In [None]:
# TODO

* **(5 points)** Show the testing accuracy of your model using the best value of C and print/plot its confusion matrix.


In [None]:
# TODO

## Multiclass classification with One vs. One

While Support Vector Machines (SVMs) are by design binary classifiers, the One vs. One (OvO) method can be used to adapt them for multiclass classification problems. The algorithm works as follows:

Given a dataset with C classes, create and train $C(C-1)/2$ binary classifiers, one for each class against every other class. For prediction, an unseen example is given to all these binary classifiers, each returning its own prediction.  The class that wins the majority vote is returned as the final class.

* **(30 points)** (30 points) Define a class named `OneVOne` that implements the One vs. One algorithm to generalize the linear Support Vector Classification (SVC) models to multiclass classification. This class should follow the structure of a typical SciKit Learn estimator by inheriting from `sklearn.base.BaseEstimator` and should include `fit and `predict` functions.

In [None]:
# TODO

* **(15 points)** Test your implementation using a dataset created with SciKit Learn's `make_blobs` function, featuring 2 features and 5 centers, with a random state set to 11 to ensure replicability. Plot the decision regions produced by your implementation alongside the regions generated by SciKit Learn's implementation of the same algorithm. You can activate SciKit Learn's implementation using the `decision_function_shape='ovo'` parameter.

In [None]:
# TODO