In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Part 2

In [3]:
#read dataframe
df = pd.read_csv('illness.csv')

In [4]:
#inspect head of dataframe
df.head()

Unnamed: 0,plas_gl,bp,result,skin_th,preg,insulin,bmi,ped,age
0,122,64,positive,32,1,156,35.1,0.692,30
1,80,74,negative,11,1,60,30.0,0.527,22
2,100,70,negative,26,0,50,30.8,0.597,21
3,119,64,negative,18,0,92,34.9,0.725,23
4,162,76,positive,56,0,100,53.2,0.759,25


# Logistic Regression

In [5]:
#without scaling
X = df.drop('result',1).values
y = df['result'].values

#split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=40)

#Instantiate logistic regression classifier
logreg = LogisticRegression()

#fit the classifier on the data
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

#print out results
print(classification_report(y_test,y_pred))
print(logreg.score(X_test,y_test))

             precision    recall  f1-score   support

   negative       0.83      0.91      0.87        80
   positive       0.72      0.55      0.62        33

avg / total       0.80      0.81      0.80       113

0.8053097345132744


Let's firstly talk about the confusion matrix. Precision is defined as the proportion of positive identifications that were actually identified as correct. Recall is define as the proportion of actual positives that are identified correctly. F1 score takes into account both recall and precision and gives us a measure of the preciseness and robustness of the model. Classification reports are often used to analyse a models performance although I will not specifically be mentioning it in this assignment I will still include it in the code.

As we can see from above the accuracy is 80.53%. Let's see if we can improve this accuracy. One method we could try regularisation. Regularisation would be a good technique to use here as the range of the data differs greatly. In the code below we will find an optimal value of C and which normalisation technique we should use. 

In [6]:
#import GridSearchCV
from sklearn.model_selection import GridSearchCV

#Instaniate the logistic regression classifier
logreg = LogisticRegression()

#defining parameter grid 
penalty = ['l1','l2']
C = np.logspace(-4,4,20)
hyperparameters = dict(C=C, penalty=penalty)

clf = GridSearchCV(logreg,hyperparameters,cv=5)

#fit grid search to data
best_model = clf.fit(X,y)

print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])

Best Penalty: l1
Best C: 1.623776739188721


As we can see the optimal penalty is 'l1' and the optimal value for C = 1.62 . Let's update the parameters to their optimals and see how this effects the accuracy. Remember this data will need to be scaled to avoid bias.

In [7]:
#Import Standard Scaler
from sklearn.preprocessing import StandardScaler

#Instantiate the logistic regression classifier
logreg = LogisticRegression(penalty = 'l1', C = 1.62)

#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=40)

#scale the data
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)
X_test = sc_x.transform(X_test)

#Fit it to the training data
logreg.fit(X_train, y_train)

#compute and print accuracy
print(classification_report(y_test,y_pred))
print(logreg.score(X_test,y_test))

             precision    recall  f1-score   support

   negative       0.83      0.91      0.87        80
   positive       0.72      0.55      0.62        33

avg / total       0.80      0.81      0.80       113

0.8407079646017699


After both tuning and using the optimal parameters the accuracy has increased from 80.5% to 84.07%. Let's take a look as a precaution to see if the data is overfitted or underfitted.

In [8]:
print(logreg.score(X_test,y_test))
print(logreg.score(X_train,y_train))

0.8407079646017699
0.7756653992395437


As we can see the accuracy on the test data and the training data is similar and therefore the data is neither underfitted nor overfitted. This is no surprise as l1 regularisation in itself is a solution to overfitting. Let's finally calculate a 95% confidence interval for the accuracy in order to get a better idea of the variance of this accuracy.

In [27]:
#calculate mean of cross validation accuracies
p = np.mean(cross_val_score(logreg,X,y,cv=20))

#calculate upper and lower bound for confidence interval
ci_upper = p + np.sqrt((p*(1-p))/20)
ci_lower = p - np.sqrt((p*(1-p))/20)

#append confidence interval to list and print it
confidence_interval = [ci_lower,ci_upper]
confidence_interval

[0.6947295208346654, 0.8780258971219909]

We can say that we are 95% confident that the true acuracy lies between 69.47% and 87.80% . 

# KNN Classifier

Let's firstly test a KNN Classifier on the data without doing any scaling of the data or any optimisation of the classifier.

In [38]:
#Import Classifier
from sklearn.neighbors import KNeighborsClassifier

#Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

#Instantiate the classifier and fit it on the training data
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

#compute and print accuracy
print(classification_report(y_test,y_pred))
print(knn.score(X_test,y_test))

[[65 13]
 [17 18]]
             precision    recall  f1-score   support

   negative       0.79      0.83      0.81        78
   positive       0.58      0.51      0.55        35

avg / total       0.73      0.73      0.73       113

0.7345132743362832


As we can see without tuning any parameters or scaling the data we get an accuracy of 72.56%. In the above example n_neighbors = 5 which is set by default. Let's now find an optimal value for n_neighbors and see if this improves the accuracy.

In [19]:
#Instantiate the KNeighborsClassifier classifier
knn = KNeighborsClassifier()

#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

#defining parameter grid
metrics = ['euclidean','minkowski','manhattan']
neighbors = np.arange(1,16)
leaf_size = np.arange(10,100,10)
param_grid = dict(metric=metrics, n_neighbors=neighbors, leaf_size=leaf_size)


#Completing grid search on parameters given
grid_search = GridSearchCV(knn,param_grid, cv=10, scoring='accuracy')
grid_search.fit(X_train,y_train)

#Print optimal values
print('Best Metric:', grid_search.best_estimator_.get_params()['metric'])
print('Best n_neighbors:', grid_search.best_estimator_.get_params()['n_neighbors'])
print('Best Leaf Size:', grid_search.best_estimator_.get_params()['leaf_size'])

Best Metric: euclidean
Best n_neighbors: 7
Best Leaf Size: 10


As we can see the best metric is 'euclidean' , the best neighbors = 15 and the best leaf size is 10. Let's now use these in our model to see if we can improve accuracy.

In [20]:
#Instantiate the KNeighborsClassifier classifier
knn = KNeighborsClassifier(metric = 'euclidean', n_neighbors = 7,leaf_size=10)

#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

#Fit classifier on training data
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

#compute and print accuracy
print(classification_report(y_test,y_pred))
print(knn.score(X_test,y_test))

             precision    recall  f1-score   support

   negative       0.80      0.87      0.83        78
   positive       0.64      0.51      0.57        35

avg / total       0.75      0.76      0.75       113

0.7610619469026548


As we can see we get an accuracy of 76.11% which is an improvement on the original acuracy we got for the knn model. Let's now check if scaling the data will improve accuracy. Lets check which scaler gives us the best score starting with StandardScaler(Standard Scaler removes the mean and scales the data to unit variance).

In [21]:
#Instantiate the KNeighborsClassifier
knn = KNeighborsClassifier(metric='euclidean', n_neighbors = 7, leaf_size=10)

#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

#Scale the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#Fit on training data
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

#compute and print accuracy
print(classification_report(y_test,y_pred))
print(knn.score(X_test,y_test))

             precision    recall  f1-score   support

   negative       0.79      0.87      0.83        78
   positive       0.63      0.49      0.55        35

avg / total       0.74      0.75      0.74       113

0.7522123893805309


As we can see after scaling and with optimal parameters we got an accuracy score of 75.2% which is a worse accuracy score without scaling so let's now check how RobustScaler affects the accuracy(Robust Scalers centering and scaling statistics are based on percentiles and are therefore not influenced by a few number of very large marginal outliers).

In [22]:
#Import RobustScaler
from sklearn.preprocessing import RobustScaler

#Instantiate the KNeighborsClassifier
knn = KNeighborsClassifier(metric='euclidean', n_neighbors = 7, leaf_size=10)

#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

#Scale the data
scaler = RobustScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#Fit on training data
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

#compute and print accuracy
print(classification_report(y_test,y_pred))
print(knn.score(X_test,y_test))

             precision    recall  f1-score   support

   negative       0.78      0.85      0.81        78
   positive       0.57      0.46      0.51        35

avg / total       0.71      0.73      0.72       113

0.7256637168141593


As we can see the accuracy for RobustScaler is 72.57% which is worse than the accuracy from StandardScaler, let's finally check how MinMax Scaler effects accuracy(MinMax Scaler rescales the data set such that all feature values are in the range [0,1]).

In [23]:
#Import MinMax Scaler
from sklearn.preprocessing import MinMaxScaler

#Instantiate the KNeighborsClassifier
knn = KNeighborsClassifier(metric='euclidean', n_neighbors = 7, leaf_size=10)

#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

#Scale the data
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#Fit on training data
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

#compute and print accuracy
print(classification_report(y_test,y_pred))
print(knn.score(X_test,y_test))

             precision    recall  f1-score   support

   negative       0.79      0.90      0.84        78
   positive       0.67      0.46      0.54        35

avg / total       0.75      0.76      0.75       113

0.7610619469026548


As we can see when we scale our data using MinMaxScaler we achieve an acurracy of 76.11% which is the highest accuracy we have received so far from the various scaling algorithms we used. This accuracy is also the same as the accuracy we got before we scaled the data. This is because scaling the data does not always improve the accuracy and in some cases it can even decrease the accuracy. Let's explore this more by computing a 95% confidence interval.

In [26]:
p = np.mean(cross_val_score(knn,X,y,cv=20))
ci_upper = p + np.sqrt((p*(1-p))/20)
ci_lower = p - np.sqrt((p*(1-p))/20)
confidence_interval = [ci_lower,ci_upper]
confidence_interval

[0.6169917664178473, 0.8183023512292114]

From the confidence interval, we can say that we are 95% confident that the true accuracy lies between 61.70% and 81.83% .  