# KNeighbors classification

In this notebook, we are aiming at building a [KNeighbors](https://fr.wikipedia.org/wiki/M%C3%A9thode_des_k_plus_proches_voisins) model that can allow us to recognize pneumoonia on Xrays image data.

This algorithm is supervised(which means we use labels: "NORMAL" and "PNEUMONIA"). It computes the distance between datapoints in a vectorized space in order to make clusters.  
Our dataset is composed of 5856 xrays images containing "train", "test" and "validation" subcategories.

1.   Import libraries
2.   Mount our google drive
3.   Format our data
4.   Benchmark with our "k" parameter
5.   Build our model
6.   Confusion matrix
7.   Save model


![alt text](https://ykhoa.org/d/images/PI/55943_Pneumonia_anatomy_PI.jpg "Pneumonia")

 

## 1. Import libraries

We import here libraries to manipulate data, images, plots...  
We particularly need some [sklearn](https://scikit-learn.org/stable/) functions that implements the KNN algorithm and other usefull tools such as metrics or confusion matrix plot.

In [None]:
from google.colab import drive                        # mount google drive data
import numpy as np                                    # vector computation
import cv2 as cv                                      # image manipulation
import os                                             # handle filesystem
import matplotlib.pyplot as plt                       # plot library
from sklearn.metrics import plot_confusion_matrix     # plot conf matrix
import pandas as pd                                   # handle dataframes
from sklearn.model_selection import train_test_split  # data formatting
from sklearn.neighbors import KNeighborsClassifier    # model algorithm
from sklearn import metrics                           # model metrics
from sklearn.metrics import classification_report     # model analysis



## 2. Mount our google drive to access the dataset

We host our data on a drive to avoid re-uploading it each time on our google colab notebook.  
We set variables to store path of train, test and val part of dataset.  
* Train data will be used to ...train our model.  
* Test data will be used to ... test our model.  
* Val data will be used to ... validate our model after the training.

In [None]:
drive.mount('/content/drive', force_remount=True)
train_path = '/content/drive/MyDrive/chest_Xray/chest_Xray/train'
test_path = '/content/drive/MyDrive/chest_Xray/chest_Xray/test'
val_path = '/content/drive/MyDrive/chest_Xray/chest_Xray/val'

train_path_norm = f"{train_path}/NORMAL"
train_path_pneu = f"{train_path}/PNEUMONIA"

test_path_norm = f"{test_path}/NORMAL"
test_path_pneu = f"{test_path}/PNEUMONIA"

val_path_norm = f"{val_path}/NORMAL"
val_path_pneu = f"{val_path}/PNEUMONIA"

## 3.   Format our data
Our algorithm can't learn from raw images: we need to formate it in numpy arrays in order to have fast & efficient computation.

We also know from our data analysis notebook that the sizes of our images data need to be standardized, our `dim` variable define standard format at 50/50 pixels.

In [None]:
from pandas.core.arrays import boolean
dir_base = "/content/drive/MyDrive/chest_Xray/chest_Xray"
dim = (300,300)

def get_data(path: str, label: str, dtype: str, process_flipp: boolean = False) -> pd.DataFrame:
  '''
  Get data from directory and return it as a pandas Dataframe with 2 columns "images"/"labels"
  :param path: path of the directory where we fecth our data
  :param label: label of the data in the directory
  :param dtype: "test", "train" or "val"
  :param process_flipp: boolean to flip our images 
  :return: pandas dataframe containing image vector and label
  '''
  result = []
  gen = (item for item in os.listdir(path) if item.endswith('.jpeg'))
  for i in gen:
    img_path = f"{dir_base}/{dtype}/{label}/{i}"
    img = cv.imread(img_path)
    img = cv.resize(img, dim, interpolation = cv.INTER_AREA).reshape(1,-1)
    result.append((img, label))
    if process_flipp:
      result.append((get_flipped_image_arr(img_path), label))
  return pd.DataFrame(result, columns=['image', 'label'], index=None)


In [None]:
def get_flipped_image_arr(path:str) -> np.array:
  '''
  Return flipped image
  :path: path of the image
  :return: numpy array containing image pixels  
  '''
  original = cv.imread(path) 
  img = cv.flip(original, 1)   # flip horizontally
  img = cv.resize(img, dim ,interpolation = cv.INTER_AREA).reshape(1,-1)
  return img

In [None]:
def plot_data(k_values, accuracy):
  '''
  plot 
    x number of neighbor
    y accuracy
  :param k_values:  different values of k we trie to benchmark
  :param accuracy:  accuracy per k value
  ''' 
  fig = plt.figure()
  fig.subplots_adjust(top=0.8)
  ax1 = fig.add_subplot()
  ax1.set_ylabel('Accuracy')
  ax1.set_xlabel('K val')
  plt.plot(k_values,accuracy,label='Accuracy for k params')
  plt.scatter(k_values,accuracy,c=k_values,alpha=1)
  plt.legend()
  plt.show()



In [None]:
def confusion_matrix(model, x, y):
  '''
  Plot confusion matrix
  :param model: Knn instance
  :param x: data image to array test subset
  :param y: data labels to array test subset
  '''
  disp = plot_confusion_matrix(model, x, y, cmap=plt.cm.Blues, normalize=None)
  plt.show()

In [None]:
data_normal = get_data(train_path_norm, 'NORMAL',  'train', True)
data_pneu = get_data(train_path_pneu, 'PNEUMONIA', 'train', False)

In [None]:
data_normal.head()

In [None]:
data_pneu.head()

In [None]:
data = pd.concat([data_normal, data_pneu])
images = data.image
y = data.label

X = [image[0] for image in images.values]
X = np.array(X, dtype=object)

In [None]:
print(f"""
  X shape: {X.shape}
  y shape: {y.shape}
""")

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.70, random_state=30)

In [None]:
print('X train shape:',X_train.shape)
print('Y train shape:',y_train.shape)

print('X test shape:',X_test.shape)
print('Y test shape:',y_test.shape)

In [None]:
print('Classes: ',np.unique(y_train))


## 4. Benchmark with our "k" parameter

We notice here that the best number of neighbors is 10 in order to have the best accuracy.

We chose the accuracy metric among others [metrics available for classification](https://scikit-learn.org/stable/modules/model_evaluation.html).

In [None]:
k_values = [1, 3, 5, 10, 20, 50, 100]
accuracy_values = []
for k in k_values:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train,y_train)
    predictions = model.predict(X_test)
    acc = metrics.accuracy_score(y_test, predictions)
    accuracy_values.append(acc)
    print('Accuracy for k={}:'.format(str(k)),acc)
    print('\n')
    print(classification_report(y_test, predictions))
    print('**************************************************')
    print('\n')
plot_data(k_values,accuracy_values)

## 5.   Build our model
We instanciate the classifier from sklearn library with the best "k" parameter we benchmarked earlier.

`knn_model.fit` take our data in `x` and our labels in `y`.

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=10)
knn_model.fit(X,y)

In [None]:
# test data
test_data_norm = get_data(test_path_norm,'NORMAL','test',False)
test_data_pneu = get_data(test_path_pneu,'PNEUMONIA','test',False)

test_data_total = pd.concat([test_data_norm,test_data_pneu])

y_test_data = test_data_total.label

X_test_data = []
for i in test_data_total.image.values:
    X_test_data.append(i[0])
    
X_test_data = np.array(X_test_data)

## 6. Confusion matrix
We use a module from sklearn to generate a [confusion matrix](https://fr.wikipedia.org/wiki/Matrice_de_confusion).  
It is usefull to measure the efficiency of our model.
We notice that we have a annoying number of false positive: model classified a normal image as "pneumonia" 132 time. In an other hand, the model almost never failed in recognizing a pneumonia image.

In [None]:
predictions_test = knn_model.predict(X_test_data)
acc_test = metrics.accuracy_score(y_test_data, predictions_test)
print('Accuracy for test',acc_test)
print(classification_report(y_test_data, predictions_test))
confusion_matrix(knn_model,X_test_data,y_test_data)

## 7. Save model

We save our model under [onnx](https://fr.wikipedia.org/wiki/Open_Neural_Network_Exchange) format in order to benchmark our knn model with others model in our cuctom script.  
It is usefull for interoperability and to visualise our model mor clearly.

In [None]:
!pip install skl2onnx

In [None]:
from skl2onnx import convert_sklearn

from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, 4]))]
onx = convert_sklearn(knn_model, initial_types=initial_type)
with open("knn_model_pneumonia.onnx", "wb") as f:
    f.write(onx.SerializeToString())