# ICS 485 - Machine Learning Term Project
### Lumpy Skin Disease Classification
### Authors: Amaan Izhar (201781130), Omar Pervez Khan (201746530)
### Section 02, Dr. Irfan Ahmad


## Overview

In this project, we applied various machine learning algorithms on an initally imbalanced dataset and recorded the results of both before and after applying oversampling and undersampling procedures. The problem at hand is a binary classification problem of predicting whether a disease is present or not. The metrics used for evaluating the performance of the models were primarily Recall and F1-score. 

## Background

Lumpy Skin Disease (LSD) is a viral infection in cattle originating in Africa which then spread to Middle East, Asia and Eastern Europe. 
The notable characteristics of this disease are fever, enlarged superficial lymph nodes, and multiple nodules on the skin. Since this infection is mainly found in cattle, its classification and detection is vital in ensuring their survival. Therefore, we utilize machine learning algorithms to detect LSD infections that were recorded in our dataset.

## System Setup and Architecture

Hardware
- Local Machines (G14 and M1 Air)
  
Software / Packages
- Jupyter
- Sklearn
- Matplotlib
- Seaborn
- Numpy
- Pandas
- Imblearn

Architecture / Pipeline
1. Preprocess the data
2. Perform data analysis
3. Feature Selection using Chi2 method
4. Scale the data using StandardScaler
5. Select 5 core machine learning algorithms along with 3 ensemble algorithms
6. Tune hyperparameters of selected algorithms
7. Train algoritms on imbalanced dataset and record results
8. Train algoritms on oversampled dataset and record results
9. Train algoritms on undersampled dataset and record results
10. Take the best model on the basis of the Recall 

## Datasets Used

Name: Lumpy Skin Disease  
Link: https://data.mendeley.com/datasets/7pyhbzb2n9/1  
Citation: Afshari Safavi, Ehsanallah (2021), “Lumpy Skin disease dataset”, Mendeley Data, V1, doi: 10.17632/7pyhbzb2n9.1

## Experiments

In this section, we will explain the methodology of selection of algorithms.

Step 1. We built classifiers on the whole imbalanced dataset
The core classifiers and the reason for their selection:
- Logistic Regression 
  - Linear classifier and the problem deals with binary classification
- LinearSVC or Support Vector Machine
  - Linear classifier and the problem deals with binary classification
- Decision Tree
  - Complex linear decision boundary to deal with an imbalanced dataset
- Random Forest
  - Same reason as Decision Tree but including voting as well
- SGD (Stochastic Gradient Descent)
  - An optimizing technique to find the optimal loss and build a decision boundary accordingly

Step 2. We selected the best classifiers from Step 1 on the basis of their Recall score and applied the following ensemble methods:
- Voting Classifier (between best 3 models from Step 1)
  - The classifier ensembles 3 core models and average their output on a voting criteria
  - This could enhance the performance of individual models in Step 1 taking into account the strength of each model
- Adaboost (on the best classifier from Step 1)
  - Boost the best model to further fine tune it
- Bagging Classifer (on the best classifier from Step 1)
  - Reduce variance of a black-box estimator such as Decision Tree

Step 3. We repeated Step 1 and Step 2 after applying SMOTE oversampling and Random undersampling to the original dataset and recorded the results.

Step 4. Once we have their performance measures - particularly recall, accuracy, and f1-score - we conduct an analysis in the section below.

## Results

**Imbalanced Dataset**  

Imbalanced:

Core Classifiers
| Classifier                  	| Recall 	| F1   	| Accuracy 	|
|-----------------------------	|--------	|------	|----------	|
| Logistic Regression         	| 0.79  	| 0.77 	| 0.95     	|
| SVM                         	| 0.80   	| 0.77 	| 0.95     	|
| Decision Trees              	| 0.90   	| 0.91 	| 0.98     	|
| Random Forest               	| 0.87   	| 0.89 	| 0.97     	|
| Stochastic Gradient Descent 	| 0.82   	| 0.78 	| 0.95     	|

Ensemble Methods
| Classifier          	| Recall 	| F1   	| Accuracy 	|
|---------------------	|--------	|------	|----------	|
| Voting Classifier   	| 0.91  	| 0.90 	| 0.98     	|
| Adaboost Classifier 	| 0.90   	| 0.91	| 0.98     	|
| Bagging Classifier  	| 0.92   	| 0.92 	| 0.98     	|
---

**Oversampled Dataset**

Core Classifiers
| Classifier                  	| Recall 	| F1   	| Accuracy 	|
|-----------------------------	|--------	|------	|----------	|
| Logistic Regression         	| 0.54   	| 0.67 	| 0.90     	|
| SVM                         	| 0.52   	| 0.65 	| 0.89     	|
| Decision Tree               	| 0.84   	| 0.88 	| 0.97     	|
| Random Forest               	| 0.85   	| 0.87 	| 0.97     	|
| Stochastic Gradient Descent 	| 0.53   	| 0.67 	| 0.89     	|

Ensemble Methods
| Classifier          	| Recall 	| F1   	| Accuracy 	|
|---------------------	|--------	|------	|----------	|
| Voting Classifier   	| 0.85   	| 0.88 	| 0.97     	|
| Adaboost Classifier 	| 0.85   	| 0.87 	| 0.97     	|
| Bagging Classifier  	| 0.84   	| 0.88 	| 0.97    	|
---

**Undersampled Dataset**

Core Classifiers
| Classifier                  	| Recall 	| F1   	| Accuracy 	|
|-----------------------------	|--------	|------	|----------	|
| Logistic Regression         	| 0.52  	| 0.65 	| 0.87     	|
| SVM                         	| 0.50   	| 0.64 	| 0.87     	|
| Decision Trees              	| 0.69   	| 0.80 	| 0.94     	|
| Random Forest               	| 0.73   	| 0.83 	| 0.95     	|
| Stochastic Gradient Descent 	| 0.53   	| 0.67 	| 0.88     	|

Ensemble Methods
| Classifier          	| Recall 	| F1   	| Accuracy 	|
|---------------------	|--------	|------	|----------	|
| Voting Classifier   	| 0.73   	| 0.83 	| 0.95     	|
| Adaboost Classifier 	| 0.75   	| 0.84 	| 0.95     	|
| Bagging Classifier  	| 0.71   	| 0.81 	| 0.94     	|


> Note: Confusion Matrix and Classification Report are in their respective notebooks.

## Result Analysis

**Imbalanced Dataset**
On an average, the error in F1 on the imbalanced dataset was calculated to be 17.6% based on individual classifiers. The best recall in the core classifiers was Decision Tree. Furthermore, the error in F1 in ensemble methods was calculated to be 9%, therefore improving significantly. The best ensemble method came out to be Bagging Classifier.

**Undersampled Dataset**
On an average, the error in F1 on the undersampled dataset was calculated to be 28.2% based on individual classifiers. The best F1 in the core classifiers was Random Forest. Furthermore, the error in F1 in ensemble methods was calculated to be 17%, therefore improving significantly. The best ensemble method came out to be Adaboost Classifier.

**Oversampled Dataset**
On an average, the error in F1 on the oversampled dataset was calculated to be 25.2% based on individual classifiers. The best F1 in the core classifiers was Random Forest as well. Furthermore, the error in F1 in ensemble methods was calculated to be 12.4%, therefore improving significantly. The best ensemble method was a tie between Adaboost Classifier and Bagging Classifier.

Although undersampling dataset performs better on average error rates, our focus was on improving the recall and F1 of the individual model's error rate. Evidently from the table, there is a tie between Adaboost and Bagging for the oversampled data, so we refer to the Confusion Matrices generated to further enhance our analysis.

In the confusion matrix of the Adaboost Classifier, we see that the number of correctly predicted disease (label 1) was 534 as compared to the Bagging Classifier where the number was 547. This signifies that Bagging Classifier is superior in terms of proper classification of having a disease. 

In conclusion, *Bagging Classifier (with a base estimator of Random Forest) on an oversampled dataset performed the best* and was able to recall more samples that had the disease. Therefore, a machine learning ecosystem can be deployed that uses this model and aids in agricultural scientists/biologists in detecting the disease and ensuring the immunity of cattle through isolating the infected ones.