## Diagnosis of Acute Lymphomatic Leukemia with Various Classification Algorithms 

This notebook covers the code and accuracy reports for the following algorithms. For further information about the data and the problem statement please view the [README.md](https://github.com/GV-9wj/Acute-Lymphomatic-Leukemia-ALL-IDB-prediction/blob/main/README.md) file.

##### Classification Algorithms used in this notebook are:

1. K Nearest Neighbours Classifier [KNN sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
2. Support Vector Machine Classifier [SVC sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
3. Naive Bayes Classifier [GNB sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)


##### Process being followed in this notebook
The process we will follow to use the images for training and testing are:

1. Importing the necessary libraries


2. Defining functions that will:
    1. Convert the image into an array of pixel intensities
    2. Convert the image into a normalized form of the colour histogram of the image


3. Deifne the variables we will be using for train and test
    1. The variable **rawImages** will act as the data we use for predicting the labels using raw image data (pixel data)
    2. The **features** list will act as the data we use for predicting the labels using the histogram data.
    3. The **labels** list will act as the predictor variable for our problem statement


4. Read the files one by one and for each file and Compute the aformentioned variables and append it to the list one by one.


5. Split the data into train and test data.
    1. One split for the raw pixel data
    2. Another split for the histogram data.


6. Train and Test the models, one for each type of variable
    1. For Training we will use the following classifiers
        1. K-Nearest Neigbours 
        2. Support Vector Classifier
        3. Naive Bayes Classifier
    2. For Testing the data and then evaluvating it we will use the following metrics
        1. *model.score()*
        2. A classification report that gives us the precision and the recall and also the F1 score, where:
            1. *Precision* is the fraction of how many of the items that are selected are relevant.
            2. *Recall* is the fraction of how many relevant items are selected.
            3. *F1 score* which is the harmonic mean between the Precision and the Recall
        3. A *confusion matrix* which gives us a matrix of the true negatives($C_{0, 0}$), false negatives($C_{1, 0}$), true positives($C_{0, 1}$) false positives($C_{1, 1}$)

### Step 1. Importing the necessary libraries

We will first import the basic functions required for image processing. These funtions can be found in the modules [`cv2`](https://github.com/skvark/opencv-python) and [`imutils`](https://github.com/jrosebr1/imutils) both of which are used for basic image processing. <br /> 
Then we shall import the functions required to read through files from the `os` module. The function used in this notebook is [`listdir` ](https://docs.python.org/3/library/os.html#os.listdir). <br />
Finally we will import the classifiers required for the project, namely KNN, SVM and Naive Bayes, and also import the modules required for training and testing the data from [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and [`sklearn.metrics`](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics)

In [1]:
import numpy as np
import imutils
import cv2
import os 

from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB as gnb
from sklearn.model_selection import train_test_split
from sklearn import metrics

### Step 2: Defining the functions required to change the images into machine readable format.
There are two functions that are being used in this kernel, `image_to_feature_vector` and also `extract_color_histogram`. Let us look at them in detail

#### 1. `image_to_feature_vector`
This function takes into arguments the image file and the size (resolution) of the image. We will first resize the image into a 32X32 grid and hence change the dimensions of the image using the [`cv2.resize`](https://www.tutorialkart.com/opencv/python/opencv-python-resize-image/) function. This function takes the image, resizes it into the desired size, in our case 32X32, and then for each grid in the 32X32 image (total of 1024) it extracts the RGB value ([What is RGB](https://www.scantips.com/basics1b.html#:~:text=A%20digital%20color%20image%20pixel,Red%2C%20Green%2C%20Blue)). This means it returns 3 values for each and every grid. Finally this 32X32X3 array is flattened using the `.flatten()` function, part of the numpy module, to flatten out the array into a numpy array and this numpy array holds a total of 3072 numbers. 

![Fig. 1 Flow Chart for the `image_to_feature_vector` function](data/FlowFunc1.jpg)

#### 2. `extract_color_histogram`
This function takes into arguments the  image and the bins into which we want to split the colour histogram. The first part of the function `cv2.cvtColor()` takes the IMAGE converts it into HSV workspace.<br />
The second function we use `cv2.calchist()`converts the workspace into a 3-DHistogram [(What is a 3-D Histogram)](http://dofideas.com/h3stogram-interactive-3d-color-histogram-en/#:~:text=In%20image%20processing%20and%20photography,set%20of%20all%20possible%20colors.).<br />
In this function we use the HSV image and the 0, 1, 2, in the second argument refers to the channels. Since we want to work with RGB we are using 3 channels and naming them 0,1, 2 respectively. The bins we take are 8 each, this means we want the values of intensity histogram for every 8 pixels of the image. The next argument is the ranges for the workspace and since we are on the HSV workspace we use the ranges of [0, 256] for each channel.<br />
The `is_cv2()` function is used to check the version of OpenCv because OpenCv handles normalization in one way for OpenCv 2.4.x and another way for OpenCv 3. [(What is Normalization?)](https://en.wikipedia.org/wiki/Normalization_(image_processing)#:~:text=In%20image%20processing%2C%20normalization%20is,range%20of%20pixel%20intensity%20values.&text=The%20purpose%20of%20dynamic%20range,senses%2C%20hence%20the%20term%20normalization.)<br />
Finally after normalization we flatten it to a numpy array for machine readablity.

![Fig. 1 Flow Chart for the `extract_color_histogram` function](data/FlowFunc2.jpg)

In [2]:
def image_to_feature_vector(image, size=(32, 32)):
    # resize the image to a fixed size, then flatten the image into
    # a list of raw pixel intensities
    return cv2.resize(image, size).flatten()



def extract_color_histogram(image, bins=(8, 8, 8)):
    # extract a 3D color histogram from the HSV color space using
    # the supplied number of `bins` per channel
    hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
    hist = cv2.calcHist([hsv], [0, 1, 2], None, bins, [0, 256, 0, 256, 0, 256])
    # handle normalizing the histogram if we are using OpenCV 2.4.X
    if imutils.is_cv2():
        hist = cv2.normalize(hist)
    # otherwise, perform "in place" normalization in OpenCV 3
    else:
        cv2.normalize(hist, hist)
    # return the flattened histogram as
    return hist.flatten()

### Step 3 Deifning the variables to be used in model training and testing
Within this step we will define 3 variables, **rawImages, features,** and **labels**. The rawImage vector will take the output from the `image_to_feature_vector` function. The features variable will take the output from the `extract_color_histogram` function. Finally the labels variable will carry the labels of the person, 0 if they are healthy and 1 if they are suffering from leukemia. This can be extracted from the file name. The reason we are looking at two independent variables is so that we compare the results of using pixel data alone or colour histogram data which holds color intensity values.

In [3]:
rawImages = []
features = []
labels = []

### Step 4 Assigning values to the predefinied variables
In this step we will be using the pre-defined functions and loop through the files to assign values to them. The function `os.listdir` is used to get a list of all the files in the folder. We will then loop through the list and read each and every file into an image. Finally we will use the functions to generate the two major variables, **rawImages** and **features**. The **labels** variable, which is our predictor variable can be derived using image-file name's last character. 

In [4]:
Datapath = 'Data/ALL_IDB2/img/'
imagePaths = os.listdir(Datapath)

# loop over the input images
for file in imagePaths:
    # load the image and extract the class label (assuming that our
    # path as the format: /path/to/dataset/{class}.{image_num}.jpg
    image = cv2.imread(Datapath + file)
    label = int(file[6])
    # extract raw pixel intensity "features", followed by a color
    # histogram to characterize the color distribution of the pixels
    # in the image
    pixels = image_to_feature_vector(image)
    hist = extract_color_histogram(image)
    # update the raw images, features, and labels matricies,
    # respectively
    rawImages.append(pixels)
    features.append(hist)
    labels.append(label)

Here we just want to convert it into a numpy array for better machinability. We are also checking the size of our variables. 

In [5]:
# show some information on the memory consumed by the raw images
# matrix and features matrix
rawImages = np.array(rawImages)
features = np.array(features)
labels = np.array(labels)
print("[INFO] pixels matrix: {:.2f}MB".format(
    rawImages.nbytes / (1024 * 1000.0)))
print("[INFO] features matrix: {:.2f}MB".format(
    features.nbytes / (1024 * 1000.0)))

[INFO] pixels matrix: 0.78MB
[INFO] features matrix: 0.52MB


In [6]:
len(rawImages), len(features), len(labels)

(260, 260, 260)

### Step 5 Splitting the data into train and test
Using the function `train_etst_split()` we will split the dataset into train data and test data.<br />
Here we are conducting two splits, one for the Image Pixed data along with its labels and another with the Image's color histogram data along with the labels of the images. This way we will need to train and evaluvate our model for the raw pixel intensities and also for the color histogram.

In [7]:
# partition the data into training and testing splits, using 75%
# of the data for training and the remaining 25% for testing
(trainImage, testImage, trainImageLabel, testImageLabel) = train_test_split(rawImages, 
                                                                            labels, 
                                                                            test_size=0.25,
                                                                            random_state = 123)
(trainFeat, testFeat, trainFeatLabel, testFeatLabel) = train_test_split(features, 
                                                                        labels, 
                                                                        test_size=0.25, 
                                                                        random_state = 123)

### Step 6 Training, Testing and evaluvating the classifiers
Within this step we will be calling the classifiers and then training them using the data defined above and finally testing them. This process will be repetitive for each and every classifier, and for each classifier we will be doing it twice one for the raw pixel data and another for the color histograms. 

#### K - Nearest Neighbours classifier
In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

###### For the Raw-Pixel data

In [8]:
# train and evaluate a k-NN classifer on the raw pixel intensities
print("[INFO] evaluating raw pixel accuracy for KNN...")
knn_model_image = KNN()
knn_model_image.fit(trainImage, trainImageLabel)

y_pred_knn_image = knn_model_image.predict(testImage)

print("\nK-Nearest Neighbours Test Accuracy for image data: {} %".
      format(metrics.accuracy_score(testImageLabel,
                                            y_pred_knn_image)*100))
print("-------------------------------------------------------------------------------------------------")
print("\nClassification Report : \n {}".format(metrics.classification_report(testImageLabel, 
                                                                                     y_pred_knn_image)))
print("-------------------------------------------------------------------------------------------------")
print("\n Confusion Matrix :- \n{}".format(metrics.confusion_matrix(testImageLabel, 
                                                                            y_pred_knn_image)))

[INFO] evaluating raw pixel accuracy for KNN...

K-Nearest Neighbours Test Accuracy for image data: 73.84615384615385 %
-------------------------------------------------------------------------------------------------

Classification Report : 
               precision    recall  f1-score   support

           0       0.66      0.78      0.71        27
           1       0.82      0.71      0.76        38

    accuracy                           0.74        65
   macro avg       0.74      0.74      0.74        65
weighted avg       0.75      0.74      0.74        65

-------------------------------------------------------------------------------------------------

 Confusion Matrix :- 
[[21  6]
 [11 27]]


###### For the Colour histogram data

In [9]:
# train and evaluate a k-NN classifer on the histogram
# representations
print("[INFO] evaluating histogram accuracy for KNN...")
knn_model_hist = KNN()
knn_model_hist.fit(trainFeat, trainFeatLabel)

y_pred_knn_hist = knn_model_hist.predict(testFeat)

print("\nK-Nearest Neighbours Test Accuracy for Colour histogram data: {} %".
      format(metrics.accuracy_score(testFeatLabel,
                                            y_pred_knn_hist)*100))
print("-------------------------------------------------------------------------------------------------")
print("\nClassification Report : \n {}".format(metrics.classification_report(testFeatLabel, 
                                                                                     y_pred_knn_hist)))
print("-------------------------------------------------------------------------------------------------")
print("\n Confusion Matrix :- \n{}".format(metrics.confusion_matrix(testFeatLabel, 
                                                                            y_pred_knn_hist)))

[INFO] evaluating histogram accuracy for KNN...

K-Nearest Neighbours Test Accuracy for image data: 92.3076923076923 %
-------------------------------------------------------------------------------------------------

Classification Report : 
               precision    recall  f1-score   support

           0       0.88      0.97      0.92        30
           1       0.97      0.89      0.93        35

    accuracy                           0.92        65
   macro avg       0.92      0.93      0.92        65
weighted avg       0.93      0.92      0.92        65

-------------------------------------------------------------------------------------------------

 Confusion Matrix :- 
[[29  1]
 [ 4 31]]


#### Support Vector Machine Classifier
A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problems. After giving an SVM model sets of labeled training data for each category, they're able to categorize new text. So you're working on a text classification problem.

###### For the Raw-Pixel data

In [10]:
# train and evaluate a k-NN classifer on the raw pixel intensities
print("[INFO] evaluating raw pixel accuracy for SVC...")
svc_model_image = SVC()
svc_model_image.fit(trainImage, trainImageLabel)

y_pred_svc_image = svc_model_image.predict(testImage)

print("\nSupport Vector Machine Test Accuracy for image data: {} %".
      format(metrics.accuracy_score(testImageLabel,
                                            y_pred_svc_image)*100))
print("-------------------------------------------------------------------------------------------------")
print("\nClassification Report : \n {}".format(metrics.classification_report(testImageLabel, 
                                                                                     y_pred_svc_image)))
print("-------------------------------------------------------------------------------------------------")
print("\n Confusion Matrix :- \n{}".format(metrics.confusion_matrix(testImageLabel, 
                                                                            y_pred_svc_image)))

[INFO] evaluating raw pixel accuracy for SVC...

Support Vector Machine Test Accuracy for image data: 76.92307692307693 %
-------------------------------------------------------------------------------------------------

Classification Report : 
               precision    recall  f1-score   support

           0       0.73      0.70      0.72        27
           1       0.79      0.82      0.81        38

    accuracy                           0.77        65
   macro avg       0.76      0.76      0.76        65
weighted avg       0.77      0.77      0.77        65

-------------------------------------------------------------------------------------------------

 Confusion Matrix :- 
[[19  8]
 [ 7 31]]


###### For the Colour histogram data

In [11]:
# train and evaluate a k-NN classifer on the histogram
# representations
print("[INFO] evaluating histogram accuracy for SVC...")
svc_model_hist = SVC()
svc_model_hist.fit(trainFeat, trainFeatLabel)

y_pred_svc_hist = svc_model_hist.predict(testFeat)

print("\nSupport Vector Machine Test Accuracy for Colour histogram data: {} %".
      format(metrics.accuracy_score(testFeatLabel,
                                            y_pred_svc_hist)*100))
print("-------------------------------------------------------------------------------------------------")
print("\nClassification Report : \n {}".format(metrics.classification_report(testFeatLabel, 
                                                                                     y_pred_svc_hist)))
print("-------------------------------------------------------------------------------------------------")
print("\n Confusion Matrix :- \n{}".format(metrics.confusion_matrix(testFeatLabel, 
                                                                            y_pred_svc_hist)))

[INFO] evaluating histogram accuracy for SVC...

Support Vector Machine Test Accuracy for image data: 87.6923076923077 %
-------------------------------------------------------------------------------------------------

Classification Report : 
               precision    recall  f1-score   support

           0       0.79      1.00      0.88        30
           1       1.00      0.77      0.87        35

    accuracy                           0.88        65
   macro avg       0.89      0.89      0.88        65
weighted avg       0.90      0.88      0.88        65

-------------------------------------------------------------------------------------------------

 Confusion Matrix :- 
[[30  0]
 [ 8 27]]


#### Naive Bayes Classifier
In statistics, Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong independence assumptions between the features. They are among the simplest Bayesian network models. 

###### For the Raw-Pixel data

In [12]:
# train and evaluate a k-NN classifer on the raw pixel intensities
print("[INFO] evaluating raw pixel accuracy for Naive Bayes...")
gnb_model_image = gnb()
gnb_model_image.fit(trainImage, trainImageLabel)

y_pred_gnb_image = gnb_model_image.predict(testImage)

print("\nNaive Bayes Classifier Test Accuracy for image data: {} %".
      format(metrics.accuracy_score(testImageLabel,
                                            y_pred_gnb_image)*100))
print("-------------------------------------------------------------------------------------------------")
print("\nClassification Report : \n {}".format(metrics.classification_report(testImageLabel, 
                                                                                     y_pred_gnb_image)))
print("-------------------------------------------------------------------------------------------------")
print("\n Confusion Matrix :- \n{}".format(metrics.confusion_matrix(testImageLabel, 
                                                                            y_pred_gnb_image)))

[INFO] evaluating raw pixel accuracy for Naive Bayes...

Naive Bayes Classifier Test Accuracy for image data: 72.3076923076923 %
-------------------------------------------------------------------------------------------------

Classification Report : 
               precision    recall  f1-score   support

           0       0.76      0.48      0.59        27
           1       0.71      0.89      0.79        38

    accuracy                           0.72        65
   macro avg       0.74      0.69      0.69        65
weighted avg       0.73      0.72      0.71        65

-------------------------------------------------------------------------------------------------

 Confusion Matrix :- 
[[13 14]
 [ 4 34]]


###### For the Colour histogram data

In [13]:
# train and evaluate a k-NN classifer on the histogram
# representations
print("[INFO] evaluating histogram accuracy for Naive Bayes Classifer...")
gnb_model_hist = gnb()
gnb_model_hist.fit(trainFeat, trainFeatLabel)

y_pred_gnb_hist = gnb_model_hist.predict(testFeat)

print("\nNaive Bayes Classifier Test Accuracy for Colour histogram data: {} %".
      format(metrics.accuracy_score(testFeatLabel,
                                            y_pred_gnb_hist)*100))
print("-------------------------------------------------------------------------------------------------")
print("\nClassification Report : \n {}".format(metrics.classification_report(testFeatLabel, 
                                                                                     y_pred_gnb_hist)))
print("-------------------------------------------------------------------------------------------------")
print("\n Confusion Matrix :- \n{}".format(metrics.confusion_matrix(testFeatLabel, 
                                                                            y_pred_gnb_hist)))

[INFO] evaluating histogram accuracy for Naive Bayes Classifer...

Naive Bayes Classifier Test Accuracy for image data: 72.3076923076923 %
-------------------------------------------------------------------------------------------------

Classification Report : 
               precision    recall  f1-score   support

           0       0.93      0.43      0.59        30
           1       0.67      0.97      0.79        35

    accuracy                           0.72        65
   macro avg       0.80      0.70      0.69        65
weighted avg       0.79      0.72      0.70        65

-------------------------------------------------------------------------------------------------

 Confusion Matrix :- 
[[13 17]
 [ 1 34]]


## Results for Class 0 (Patient without Leukeamia) Image data

|MODEL                      |Precision | Recall | f1 Score| Test accuracy|
|---------------------------|----------|--------|---------|--------------|
| ``K-Nearest Neigbours``   | 0.66     | 0.78   |0.71     | 73.84 %      |
| ``Support Vector Machine``| 0.73     | 0.70   |0.72     | 76.92 %      |
| `` Naive Bayes``          | 0.76     | 0.48   |0.59     | 72.30 %      |


## Results for Class 1 (Patient with Leukeamia) Image Data

|MODEL                      |Precision | Recall | f1 Score| Test accuracy|
|---------------------------|----------|--------|---------|--------------|
| ``K-Nearest Neigbours``   | 0.82     | 0.71   |0.76     | 73.84 %      |
| ``Support Vector Machine``| 0.79     | 0.82   |0.81     | 76.92 %      |
| `` Naive Bayes``          | 0.71     | 0.89   |0.79     | 72.30 %      |

## Results for Class 0 (Patient without Leukeamia) Color Histogram data

|MODEL                      |Precision | Recall | f1 Score| Test accuracy|
|---------------------------|----------|--------|---------|--------------|
| ``K-Nearest Neigbours``   | 0.88     | 0.97   |0.92     | 92.30 %      |
| ``Support Vector Machine``| 0.79     | 1.00   |0.88     | 87.69 %      |
| `` Naive Bayes``          | 0.93     | 0.43   |0.59     | 72.30 %      |

72.3076923076923 %
-------------------------------------------------------------------------------------------------

Classification Report : 
               precision    recall  f1-score   support

           0                   0.59        30
           1       0.67      0.97      0.79        35

## Results for Class 1 (Patient with Leukeamia) Color Histogram Data

|MODEL                      |Precision | Recall | f1 Score| Test accuracy|
|---------------------------|----------|--------|---------|--------------|
| ``K-Nearest Neigbours``   | 0.97     | 0.89   |0.93     | 92.30 %      |
| ``Support Vector Machine``| 0.79     | 0.82   |0.81     | 87.69 %      |
| `` Naive Bayes``          | 1.00     | 0.77   |0.87     | 72.30 %      |