<a href="https://colab.research.google.com/github/AkritiGhosh/MalariaDetectionML/blob/master/MalariaDetectionML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Malaria Detection using Machine Learning
This is a basic Machine Learning Model which uses Random Forest classifier to detect malaria. 

###WORKING: - 
It uses images of cells of patients, and on the basis of contour area.
Cells infected with malaria are distinguished from the healthy one due to the presence of extra spots in the images. The model checks for different boundaries in the image and stores the areas of each ring in a csv file, called area_dataset.csv

This csv file is then used as a structured dataset to be fed into the Random Forest model.


#  Cell 0: Importing Libraries
1. Glob : (Used in Cell 1) the glob module is used to retrieve files/pathnames matching a specified pattern.

2. OpenCV : (Used in Cell 1) OpenCV is a library of programming functions mainly aimed at real-time computer vision. Reading the image, blurring and contouring of images is done using OpenCV

3. CSV : (Used in Cell 1 and 2) Library used for creation, manipulation and traversal of csv files.

4. Pandas :  (Used in Cell 2 and 4) Pandas is used for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. 

5. Sklearn/ Scikit-Learn :  (Used in Cell 5, 6 and 7) A machine learning library, which features various machine learning models including clustering, regression and classification models. For eg. SVM and Random Forest. 
It also features different metrics for analysis of the created model.


In [None]:
import glob
import cv2
import csv
import pandas as pd
import io
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import joblib

## Cell 1 : Data Acquisition and Preprocessing
The following code is the process of reding images from local computer and preprocessing it to give a structured dataset. It has been commented after creation of csv file.
###Step 1: Reading images 
The dataset is read from a subdirectory to the current directory called 'cell_images'. Inside it are 2 subfolders named - Parasitized and Uninfected. Each directory files are read, sequentially. The dataset is a collection of cell images with 13,779 infected images and 13,779 healthy cell images.
###Step 2: Grayscale and Blurring
Since, color is not an important factor in this classification, the image, after reading are converted into grayscale; and is blurred using Gaussian Blur to remove unnecessary noise.
###Step 3: Finding contours and their areas
Threshold() is used to fix some value which draws a boundary line between two set of data. From this boundary, contours are found and their areas are evaluated.

This area is now stored in separate csv files for Infected and Uninfected images.


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [6]:
!unzip '/content/drive/My Drive/cell_images.zip'

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_118.png  
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_126.png  
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_134.png  
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_141.png  
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_168.png  
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_175.png  
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_183.png  
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_221.png  
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_222.png  
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_87.png  
 extracting: cell_images/Uninfected/C236ThinF_IMG_20151127_102428_cell_91.png  
 extracting: cell_images/Uninfected/C236ThinF_

In [None]:
### Dataset(CSV) creation using Cell Images => Unstructured to Structured

## Reading image path
name = 'Uninfected'
name = 'Parasitized'
for name in ['Uninfected', 'Parasitized']
  img_path_infected = glob.glob('cell_images/'+name+'/*.png')

  ## For every image in this path
  for img_name in img_path_infected:
    print(img_name)
      ## Read images
    img = cv2.imread(img_name)

      ## Gaussian Blur - Noise removal
    img = cv2.GaussianBlur(img,(5,5),0)
      ## Conversion to Grayscale
    im_gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
      
      ## Color Thresholding
    ret,thresh = cv2.threshold(im_gray,127,255,0)
      ## Contour detection
    contours,_ = cv2.findContours(thresh,1,2)
    for contour in contours:
        cv2.drawContours(im_gray, contours, -1, (0,255,0), 3)
        print(contour)
    
      ## Saving the area of contours in csv file
    with open('area.csv', mode ='a') as contour_list:
        fieldnames = ['file_name', 'area1', 'area2', 'area3', 'area4', 'area5']
        area=[]
        for i in range(5):
            try:
                area.append(cv2.contourArea(contours[i]))
            except:
                area.append('0')
    
        writer = csv.DictWriter(contour_list, fieldnames=fieldnames)
        writer.writerow({'file_name':name, 'area1':area[0], 'area2':area[1], 'area3':area[2], 'area4':area[3], 'area5':area[4]})
        

# Cell 2: Merging the CSV files
Two separate csv files were created for each of the classes - Parasitized and Uninfected. These files are merged together to form a single csv file.

The following code reads all csv files in the given location and concatinates it into a single file called area_dataset.csv

In [None]:
# ### Merging different csv (ie infected.csv and uninfected.csv) files into a single file (area_dataset.csv)

# extension = 'csv'
# all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# #combine all files in the list
# combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
# #export to csv
# combined_csv.to_csv( "area_dataset.csv", index=False, encoding='utf-8-sig')

# Cell 3 : Uploading file from local computer to google colab
This process is used to select a file from local computer and save it in a variable. This will store the csv file which will be converted to a Pandas Dataframe.

It is used in Google Colab, not necessary in a jupyter notebook or any other IDEs.

In [None]:
### reading dataset/csv  in Google Colab
from google.colab import files
uploaded = files.upload()
print(type(uploaded))

In [None]:
!unzip 

# Cell 4 : Creation of Pandas Dataframe
The csv file is read and stored as a Pandas dataframe. When working in platforms other than Colab, use

`*data = pd.read("location_of_file/filename.csv")*`

In [None]:
data = pd.read_csv(io.StringIO(uploaded['area_dataset.csv'].decode('utf-8')))
print(type(data))

<class 'pandas.core.frame.DataFrame'>


# Cell 5 : Dataset Split
The dataset is split into 2 parts : - 
1. x - the features i.e. the areas of contours
2. y - the labels or classes

The format of the dataset is - 
Label ; area1  ; area2 ;   area3 ;  area4  ; area5

The 1st columns (Label) is **'y'** and the remaining is **'x'**

The dataset is also split into training and testing dataset with the ratio of 75% and 25% respectively.

In [None]:
x = data.drop(["Label"], axis=1)
y = data["Label"]
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.25, random_state = 4)

# Cell 6 : Training the model
The model used for this classification is Random Forest Classifier. Model creation is a lot easier using the sklearn library. Parameter are set accordingly. 
Then the model is trained using the training dataset (x_train, y_train)

In [None]:
model = RandomForestClassifier(n_estimators=100, max_depth=5)
model.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

# Cell 7 : Model testing and Predictions 
The model is tested against the testing dataset and the predicted values are stored in y_predict.

These predictions are then tested against the actual values of the testing dataset, to evaluate the accuracy of the model.
The model is showing an accuracy of 90%

The model can be saved if required, using joblib.dump().

In [None]:
y_predict = model.predict(x_test)
accuracy = metrics.classification_report(y_predict, y_test)
print(accuracy)

              precision    recall  f1-score   support

 Parasitized       0.90      0.89      0.90      3428
  Uninfected       0.89      0.91      0.90      3462

    accuracy                           0.90      6890
   macro avg       0.90      0.90      0.90      6890
weighted avg       0.90      0.90      0.90      6890

