# Supervised classification

## Setup
### Imports

In [None]:
import pandas as pd                                     # for dataset manipulation (DataFrames)
import numpy as np                                      # allows some mathematical operations
import matplotlib.pyplot as plt                         # library used to display graphs
import seaborn as sns                                   # more convenient visualisation library for dataframes
import time                                             # for execution time measurement
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit, cross_val_predict
from sklearn.datasets import load_breast_cancer
from sklearn import metrics

### Loading the dataset

Today's dataset is the [Breast cancer Wisconsin dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset).

In [None]:
breast_cancer_dataset = load_breast_cancer()
df = pd.DataFrame(np.c_[breast_cancer_dataset['data'], breast_cancer_dataset['target']],
                  columns= np.append(breast_cancer_dataset['feature_names'], ['target']))

## Observing the dataset

Using what you have learned in the previous lessons, examine the dataset and see what you can learn about it. In the cells underneath, write any information that seems relevant to you. Focus in particular on the understanding of what the dataset and its columns contain.

**Comment**

This dataset contains 569 observations with 30 features and 2 classes. The observations are instances of possible breast cancer, the features represent breast measurements to evaluate wether the patient is ill (212) or healthy (357). This dataset is imbalanced.

In [None]:
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0.0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0.0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0.0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0.0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0.0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0.0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0.0


In [None]:
print(df.shape)
print(df["target"].value_counts())

(569, 31)
1.0    357
0.0    212
Name: target, dtype: int64


## Preparing the data

1. Define the input (X) and output (y) of the regression.
2. Using `scikit-learn`'s `train_test_split` function, separate X and y into training and test sets. Make it so that 80% of the data is used for training, and 20% for testing.

In [None]:
#1.

X = df.drop(['target'], axis = 1)
y = df.target

#2.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## Performing classification

Consider the following algorithms: Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis and SVM. For each of these algorithms, answer the following questions:
1. What kind of decision boundary is this algorithm able to learn?
2. Find and import the sklearn implementation of the algorithm.
3. Fit the model to your data.
4. Create a `y_pred` variable containing the result of the prediction of the model on the test set.

**Answer**

1. Decision boudaries are linear for the Logistic Regression and the Linear Discriminant Analysis but quadratic for the Quadratic Discriminant Analysis. SVM is trickier : it uses linear decision boundaries, but can resolve non-linear problems by applying a non-linear function to the data to make the classes linearly separable.

In [None]:
#2.

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn import svm

logisticRegr = LogisticRegression(random_state=42)
lda = LinearDiscriminantAnalysis()
qda = QuadraticDiscriminantAnalysis()
support_vec = svm.SVC(kernel='linear',random_state=42)

#3.

logisticRegr.fit(X_train, y_train)
lda.fit(X_train, y_train)
qda.fit(X_train, y_train)
support_vec.fit(X_train, y_train)

#4.

y_pred_log = logisticRegr.predict(X_test)
y_pred_lda = lda.predict(X_test)
y_pred_qda = qda.predict(X_test)
y_pred_svm = support_vec.predict(X_test)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Evaluating the model

Answer the following questions for each of the models created above:

1. Compute the accuracy of the model.
2. Display the confusion matrix.
3. Compute the precision, recall and f1-score.
4. Comment on your results and compare with the other algorithms.


In [None]:
#1.

from sklearn.metrics import accuracy_score

accuracy_log = accuracy_score(y_test, y_pred_log)
print("Accuracy Score log :", accuracy_log)

accuracy_lda = accuracy_score(y_test, y_pred_lda)
print("Accuracy Score lda :", accuracy_lda)

accuracy_qda = accuracy_score(y_test, y_pred_qda)
print("Accuracy Score qda :", accuracy_qda)

accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("Accuracy Score svm :", accuracy_svm)

Accuracy Score log : 0.9649122807017544
Accuracy Score lda : 0.956140350877193
Accuracy Score qda : 0.956140350877193
Accuracy Score svm : 0.956140350877193


In [None]:
#2.

from sklearn.metrics import confusion_matrix

def conf_mat (target_pred, target_test) :
  classification_confusion_matrix = confusion_matrix(target_test, target_pred)
  label_names = pd.Series(['ill', 'healthy'])

  print(f"Accuracy:{accuracy_score(target_pred, target_test)}")
  return pd.DataFrame(classification_confusion_matrix,
                columns='predicted ' + label_names,
                index='is ' + label_names)\
      .div(classification_confusion_matrix.sum(axis=1), axis=0) # converts the numbers into percentages

In [None]:
conf_mat(y_pred_log,y_test)

Accuracy:0.9649122807017544


Unnamed: 0,predicted ill,predicted healthy
is ill,0.930233,0.069767
is healthy,0.014085,0.985915


In [None]:
conf_mat(y_pred_lda,y_test)

Accuracy:0.956140350877193


Unnamed: 0,predicted ill,predicted healthy
is ill,0.906977,0.093023
is healthy,0.014085,0.985915


In [None]:
conf_mat(y_pred_qda,y_test)

Accuracy:0.956140350877193


Unnamed: 0,predicted ill,predicted healthy
is ill,0.953488,0.046512
is healthy,0.042254,0.957746


In [None]:
conf_mat(y_pred_svm,y_test)

Accuracy:0.956140350877193


Unnamed: 0,predicted ill,predicted healthy
is ill,0.906977,0.093023
is healthy,0.014085,0.985915


In [None]:
#3.

print("Precision log:",metrics.precision_score(y_test, y_pred_log))
print("Precision lda:",metrics.precision_score(y_test, y_pred_lda))
print("Precision qda:",metrics.precision_score(y_test, y_pred_qda))
print("Precision svm:",metrics.precision_score(y_test, y_pred_svm))

print("\nRecall log :",metrics.recall_score(y_test, y_pred_log))
print("Recall lda :",metrics.recall_score(y_test, y_pred_lda))
print("Recall qda :",metrics.recall_score(y_test, y_pred_qda))
print("Recall svm :",metrics.recall_score(y_test, y_pred_svm))

print("\nf1 log :",metrics.recall_score(y_test, y_pred_log))
print("f1 lda :",metrics.recall_score(y_test, y_pred_lda))
print("f1 qda :",metrics.recall_score(y_test, y_pred_qda))
print("f1 svm :",metrics.recall_score(y_test, y_pred_svm))

Precision log: 0.958904109589041
Precision lda: 0.9459459459459459
Precision qda: 0.9714285714285714
Precision svm: 0.9459459459459459

Recall log : 0.9859154929577465
Recall lda : 0.9859154929577465
Recall qda : 0.9577464788732394
Recall svm : 0.9859154929577465

f1 log : 0.9859154929577465
f1 lda : 0.9859154929577465
f1 qda : 0.9577464788732394
f1 svm : 0.9859154929577465


**Answer**

4. The four tested algorithm perform well on this dataset (above 95% accuracy and above 95% f1-score). One can deduce that observations were almost linearly seperables. Otherwise, algorithms with linear decision boudaries would not have performed this well.