#### Jérémy TREMBLAY

# TP2 : KNN and classification

In [40]:
# Import the library that will be used in this notebook.
import pandas as pd
import numpy as np
import random
import math

# Import the pyplot module from matplotlib with the plt alias.
import matplotlib.pyplot as plt

# Import the sklearn modules.
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

## Task 1: Dataset analysis

**Consigne :** À partir des fonctions du dataset, répondez aux questions suivantes :
* Combien de classes sont présentes dans la base de données ?
* Combien de caractéristiques descriptives de ces classes et de quels types ?
* Combien d’exemples dans la base de données ? Et par classe ?

In [41]:
# Specify the relative path of the diabete file.
file_path = 'datasets/diabetes.csv'

# Load the database into a DataFrame.
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame with head.
print(df.head())

   Glucose  BloodPressure  Insulin   BMI  DiabetesPedigreeFunction  Age  \
0      148             72        0  33.6                     0.627   50   
1       85             66        0  26.6                     0.351   31   
2      183             64        0  23.3                     0.672   32   
3       89             66       94  28.1                     0.167   21   
4      137             40      168  43.1                     2.288   33   

   Outcome  
0        1  
1        0  
2        1  
3        0  
4        1  


In [42]:
print(df.isnull().any())

Glucose                     False
BloodPressure               False
Insulin                     False
BMI                         False
DiabetesPedigreeFunction    False
Age                         False
Outcome                     False
dtype: bool


The dataset is already clean, we can easily read it now and search some information.

In [43]:
# Know the dimensions of the dataframe.
df.shape

(767, 7)

There is 767 rows and 7 columns, let's check the content more in detail with some stats.

In [44]:
# Display usefull information about the dataset.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 767 entries, 0 to 766
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Glucose                   767 non-null    int64  
 1   BloodPressure             767 non-null    int64  
 2   Insulin                   767 non-null    int64  
 3   BMI                       767 non-null    float64
 4   DiabetesPedigreeFunction  767 non-null    float64
 5   Age                       767 non-null    int64  
 6   Outcome                   767 non-null    int64  
dtypes: float64(2), int64(5)
memory usage: 42.1 KB


In [45]:
df.describe()

Unnamed: 0,Glucose,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,767.0,767.0,767.0,767.0,767.0,767.0,767.0
mean,120.9309,69.104302,79.90352,31.994654,0.472081,33.254237,0.349413
std,31.977581,19.36841,115.283105,7.889095,0.331496,11.762079,0.477096
min,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,99.0,62.0,0.0,27.3,0.2435,24.0,0.0
50%,117.0,72.0,32.0,32.0,0.374,29.0,0.0
75%,140.5,80.0,127.5,36.6,0.6265,41.0,1.0
max,199.0,122.0,846.0,67.1,2.42,81.0,1.0


In [46]:
df.Outcome.value_counts()

0    499
1    268
Name: Outcome, dtype: int64

We can see in this dataset two different clases:
- The diabetics (`Outcome` is set to 1)
- The non-diabetics (`Outcome` is set to 0)

With the data analysed, we can see that there is more non-diabetic people that diabetic people (fortunately).

We can see 6 descriptives caracteristics in this dataset (plus the outcome):
- The glucose level (`Glucose`), stocked in an integer format.
- The blood pressure (`BloodPressure`), stocked in an integer format.
- The insulin level (`Insulin`) stocked in an integer format.
- The BMI (Body Mass Index) of the person  (`BMI`), stocked in a float format.
- The Diabetes pedigree function (a function which scores likelihood of diabetes based on family history) of the person (`DiabetesPedigreeFunction`), stocked in a float format.
- The age of the person (`Age`), stocked in an integer format.

There is at total 767 examples, 499 are non-diabetics people and 268 are diabetics.

## Task 2: Prepare train and test dataset

**Consigne :** Préparez les données de manière à ce que l’on puisse prédire les classes `Outcome`. Puis, en exploitant la librairie `scikit-learn`, proposez un découpage de la base de données à hauteur de 2/3 d'apprentissage et 1/3 de test.

First, let's get our XY (labels) and X (descriptors) fields:

In [47]:
X = df[df.columns.difference(["Outcome"])]
print(X)
Y = df["Outcome"].values
print(Y)

     Age   BMI  BloodPressure  DiabetesPedigreeFunction  Glucose  Insulin
0     50  33.6             72                     0.627      148        0
1     31  26.6             66                     0.351       85        0
2     32  23.3             64                     0.672      183        0
3     21  28.1             66                     0.167       89       94
4     33  43.1             40                     2.288      137      168
..   ...   ...            ...                       ...      ...      ...
762   33  22.5             62                     0.142       89        0
763   63  32.9             76                     0.171      101      180
764   27  36.8             70                     0.340      122        0
765   30  26.2             72                     0.245      121      112
766   47  30.1             60                     0.349      126        0

[767 rows x 6 columns]
[1 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0
 1 1 1 0 0 0 

## Task 3: Train and see the predictions of our model

Now let's split our data between a train and a test dataset:

In [52]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42) # 1/3 for the test.
print("Train: ", len(X_train), ", ", len(Y_train))
print("Test: ", len(X_test), ", ", len(Y_test))

Train:  513 ,  513
Test:  254 ,  254


Now we can see that we have a 2/3 split dataset for ou model, this is a good separation between the train and the test steps. Let's create and train our model.

In [62]:
model = KNeighborsClassifier(n_neighbors=1)
model.fit(X_train, Y_train) # Train step.

train_prediction = model.predict(X_train) # Test step but with the train data.
score = model.score(X_train, Y_train)
print ("Accuracy of the train dataset: ", score)

Accuracy of the train dataset:  1.0


Obviously here, when we give the data the model was trained with to test, we see that it always found the good answers (logic, it already knows the answers). This is called as overfitting. This model will be less precise with data which it was not train with.
Let's now try with real data : test data.

In [63]:
test_prediction = model.predict(X_test) # Test step.
score = model.score(X_test, Y_test)

print("Prediction: ", test_prediction)
print("Real value: ", Y_test)
print ("Accuracy of the test dataset: ", score)

Prediction:  [1 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1
 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0
 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 0 0
 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0
 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0
 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 1 0
 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
Real value:  [1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0
 1 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1
 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0
 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0
 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 1
 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0
 0 1 1 1

We can see here that our model predict correctly around 70% of the results. It is also possible to compare each mistake individually to understand why our model choose the wrond answer but it is a good start for a non-parameterized model.

## Task 4: Display confusion matrix

**Consigne :** Analysez la matrice de confusion proposée par les prédictions du modèle sur la base de test. Que pouvez-vous observer ?

In [64]:
# Create the confusion matrix for the train data.
confusion_matrix(Y_train, train_prediction)

array([[334,   0],
       [  0, 179]])

In [72]:
# Do the same thing for the test data.
confusion_matrix(Y_test, test_prediction)

array([[128,  37],
       [ 42,  47]])

We can see that the model predict correcly all the cases for the train data. It recognises all the true positives and false negatives.

For the test data, it has found 128 true positives, 47 true negatives, 42 false positives and 37 false negatives.
We can see it mainly find the true positives and the false negatives, which is good. It has made some mistakes and found some false positive cases, but this is not really a bad thing, because this means that some people were recognised as diabetics but in reality they are not. However, the false negatives is much more harmful, because some people will not receive treatments and risk death because they was not detected as diabetics.

## Task 5: train and test different models with parameters

**Consigne :** Testez plusieurs valeurs de paramètre `n_neighbors` et sélectionnez celle qui vous semble vous procurer le modèle le plus performant. Avec par exemple `n_neighbors` appartenant à [1, 20].

In [73]:
# Let's loop until we found the best value for the parameter of our model.
for i in range(1, 21):
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train, Y_train) # Train step.
    train_prediction = model.predict(X_train) # Test step but with the train data.
    train_score = model.score(X_train, Y_train)
    test_prediction = model.predict(X_test) # Test step.
    test_score = model.score(X_test, Y_test)
    print ("For n_neighbors=", i, "we have these accuracy, for train: ", train_score, "for test: ", test_score)

For n_neighbors= 1 , we have these accuracy, for train:  1.0 , for test:  0.6889763779527559
For n_neighbors= 2 , we have these accuracy, for train:  0.8460038986354775 , for test:  0.7086614173228346
For n_neighbors= 3 , we have these accuracy, for train:  0.8460038986354775 , for test:  0.7086614173228346
For n_neighbors= 4 , we have these accuracy, for train:  0.8206627680311891 , for test:  0.7362204724409449
For n_neighbors= 5 , we have these accuracy, for train:  0.8167641325536062 , for test:  0.7401574803149606
For n_neighbors= 6 , we have these accuracy, for train:  0.797270955165692 , for test:  0.7480314960629921
For n_neighbors= 7 , we have these accuracy, for train:  0.7914230019493177 , for test:  0.7362204724409449
For n_neighbors= 8 , we have these accuracy, for train:  0.7953216374269005 , for test:  0.7283464566929134
For n_neighbors= 9 , we have these accuracy, for train:  0.7875243664717348 , for test:  0.7362204724409449
For n_neighbors= 10 , we have these accuracy

We can see here that the best accuracy for the test dataset is around 0.75, and this happends when `n_neighbors` = 11, so here the best parameter seems to be 11.