### PPHA 30545

This notebook has been designed to enable you to practice with LDA, QDA and KNN classfiers. You will also learn how to perfrom the validation set approach to estimate the test error.

We will be using the [Wisconsin Breast Cancer](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) dataset. The dataset contains a set of features that describe characteristics of the cell nuclei present in the image and if a patient was diagnosed with a (M) malignant or (B) benign tumor. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
# We are going to read in the data and print out the number of dimensions and the first few rows
data = pd.read_csv("data.csv")
print(data.shape)
data.head()

(569, 33)


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [3]:
# For this exercise, we are ging to limit the set of predictors to the first 10 characteristics
data = data.iloc[:,1:12]
print(data.shape)

(569, 11)


In [4]:
# We'll split our dataframe in to the predictors (X) and the label (y)
X = data.drop(['diagnosis'], axis = 1)
y = data['diagnosis']

In [5]:
# We'll convert our label to binary values. 1 for malignant and 0 for malign
# The code below first converts the label to category data type and then use cat.codes for encoding
y = data['diagnosis'].astype('category').cat.codes



In [6]:
# Train test split
from sklearn.model_selection import train_test_split

# We can specify the fraction of the test size using test_size paramter
# random_state allows us to specify a seed for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Linear Discriminant Analysis

In [7]:
# We'll import the LinearDiscriminantAnalysis class from scikit-learn package 
# and build our classifier using the default parameters. 

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda_model = LinearDiscriminantAnalysis()

In [8]:
# Let's train our model using the fit method
lda_model.fit(X_train, y_train)

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

In [9]:
# Using predict method to find the set of predictions
y_pred = lda_model.predict(X_test)

In [10]:
# Calculating the the percentage of correctly classified labels from the test set
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.935672514619883

### Quadratic Discriminant Analysis

In [11]:
# Simlarly, we'll use the QuadraticDiscriminantAnalysis class to build a QDA model on our data

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
qda_model = QuadraticDiscriminantAnalysis()
qda_model.fit(X_train, y_train)

QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
                              store_covariance=False, tol=0.0001)

In [12]:
y_pred = qda_model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9415204678362573

In [13]:
# We can also find out the posterior probabilities for each predicted label
# The columns represent the probability for label 0 and 1 respectively
y_posterior = qda_model.predict_proba(X_test)
y_posterior[:10]

array([[9.94910774e-001, 5.08922559e-003],
       [1.26863665e-030, 1.00000000e+000],
       [1.79117816e-005, 9.99982088e-001],
       [9.93915798e-001, 6.08420228e-003],
       [9.99085112e-001, 9.14887683e-004],
       [1.70109130e-047, 1.00000000e+000],
       [4.38023167e-104, 1.00000000e+000],
       [3.83891905e-010, 1.00000000e+000],
       [5.31009287e-005, 9.99946899e-001],
       [9.99894409e-001, 1.05591180e-004]])

### K nearest neighbor

In [14]:
# The KNeighborsClassifier class from scikit-learn will allow us to build a KNN model
# We're setting the number of neighbors to 5 below

from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

y_pred = knn_model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9005847953216374

### Validation Set Approach

In [15]:
# We'll simply loop the steps that we used to build the LDA classifier above with varying 
# seed values (random_state parameter) to get different splits for the train and validation
# set in each iteration. We'll print out the validation set error for misclassified labels
# in each case

for i, seed in enumerate([16, 78, 244]):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=seed)
    lda_model = LinearDiscriminantAnalysis()
    lda_model.fit(X_train, y_train)
    y_pred = lda_model.predict(X_val)
    validation_set_error = 1 - accuracy_score(y_val, y_pred)
    
    print("Validation set error with set", i+1 ," is: " , validation_set_error)
    

Validation set error with set 1  is:  0.04678362573099415
Validation set error with set 2  is:  0.06432748538011701
Validation set error with set 3  is:  0.06432748538011701
