
# FYP - Vector Borne Disease Prediction with xAI

## Introduction
Vectors are living organisms that can transmit infectious pathogens between humans, or from animals to humans. Many of these vectors are bloodsucking insects, which ingest disease-producing microorganisms during a blood meal from an infected host (human or animal) and later transmit it into a new host, after the pathogen has replicated. Often, once a vector becomes infectious, they are capable of transmitting the pathogen for the rest of their life during each subsequent bite/blood meal.

Vector-borne diseases are human illnesses caused by parasites, viruses and bacteria that are transmitted by vectors. Every year there are more than 700,000 deaths from vector borne diseases such as malaria, dengue, yellow fever, Japanese encephalitis and West Nile Fever.

## Libraries

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, mean_squared_error

from tabulate import tabulate

In [2]:
from google.colab import drive

## Set-Up

In [3]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
data_folder = '/content/drive/MyDrive/VB Disease Dataset'

## Dataset Analysis

In [5]:
# Read the dataset
train = pd.read_csv('/content/drive/MyDrive/VB Disease Dataset/train.csv')
test = pd.read_csv('/content/drive/MyDrive/VB Disease Dataset/test.csv')

In [6]:
data = train

In [7]:
print(data.head())

   id  sudden_fever  headache  mouth_bleed  nose_bleed  muscle_pain  \
0   0           1.0       1.0          0.0         1.0          1.0   
1   1           0.0       0.0          0.0         0.0          0.0   
2   2           0.0       1.0          1.0         1.0          0.0   
3   3           0.0       0.0          1.0         1.0          1.0   
4   4           0.0       0.0          0.0         0.0          0.0   

   joint_pain  vomiting  rash  diarrhea  ...  breathing_restriction  \
0         1.0       1.0   0.0       1.0  ...                    0.0   
1         0.0       1.0   0.0       1.0  ...                    0.0   
2         1.0       1.0   1.0       1.0  ...                    1.0   
3         1.0       0.0   1.0       0.0  ...                    0.0   
4         0.0       0.0   0.0       1.0  ...                    0.0   

   toe_inflammation  finger_inflammation  lips_irritation  itchiness  ulcers  \
0               0.0                  0.0              0.0        0

In [8]:
print(data.shape)

(707, 66)


In [9]:
print(data.describe())

               id  sudden_fever    headache  mouth_bleed  nose_bleed  \
count  707.000000    707.000000  707.000000   707.000000  707.000000   
mean   353.000000      0.503536    0.449788     0.459689    0.487977   
std    204.237607      0.500341    0.497825     0.498725    0.500209   
min      0.000000      0.000000    0.000000     0.000000    0.000000   
25%    176.500000      0.000000    0.000000     0.000000    0.000000   
50%    353.000000      1.000000    0.000000     0.000000    0.000000   
75%    529.500000      1.000000    1.000000     1.000000    1.000000   
max    706.000000      1.000000    1.000000     1.000000    1.000000   

       muscle_pain  joint_pain    vomiting        rash    diarrhea  ...  \
count   707.000000  707.000000  707.000000  707.000000  707.000000  ...   
mean      0.517680    0.449788    0.441301    0.487977    0.390382  ...   
std       0.500041    0.497825    0.496894    0.500209    0.488181  ...   
min       0.000000    0.000000    0.000000    0.000

In [10]:
target_types = list(train['prognosis'].unique())
target_types

['Lyme_disease',
 'Tungiasis',
 'Zika',
 'Rift_Valley_fever',
 'West_Nile_fever',
 'Malaria',
 'Chikungunya',
 'Plague',
 'Dengue',
 'Yellow_Fever',
 'Japanese_encephalitis']

## EDA

In [11]:
# Create figure
fig = px.histogram(train['prognosis'], color_discrete_sequence=['#636EFA'])

# Set Title and x/y axis labels
fig.update_layout(
    xaxis_title="Disease",
    yaxis_title="Frequency",
    showlegend=False,
    font=dict(size=14),
    title={
        'text': "Train Prognosis Distribution",
        'y': 0.95,
        'x': 0.5
    }
)

# Display
fig.show()


In [12]:
# Create figure
fig = px.imshow(train.corr())

# Set Title and x/y axis labels
fig.update_layout(
    showlegend=False,
    font=dict(size=14),
    title={
        'text': "Train Dataset Correlation",
        'y': 0.95,
        'x': 0.49
    }
)

# Display
fig.show()





## Pre-Processing

### Basic Pre-processing

In [13]:
# Remove any missing values
data = data.dropna()

# Convert 'prognosis' text to lowercase
data['prognosis'] = data['prognosis'].str.lower()

# Remove punctuation from 'prognosis' text
data['prognosis'] = data['prognosis'].str.replace('[^\w\s]', '', regex=True)

# Display the modified DataFrame
print(data.head())


   id  sudden_fever  headache  mouth_bleed  nose_bleed  muscle_pain  \
0   0           1.0       1.0          0.0         1.0          1.0   
1   1           0.0       0.0          0.0         0.0          0.0   
2   2           0.0       1.0          1.0         1.0          0.0   
3   3           0.0       0.0          1.0         1.0          1.0   
4   4           0.0       0.0          0.0         0.0          0.0   

   joint_pain  vomiting  rash  diarrhea  ...  breathing_restriction  \
0         1.0       1.0   0.0       1.0  ...                    0.0   
1         0.0       1.0   0.0       1.0  ...                    0.0   
2         1.0       1.0   1.0       1.0  ...                    1.0   
3         1.0       0.0   1.0       0.0  ...                    0.0   
4         0.0       0.0   0.0       1.0  ...                    0.0   

   toe_inflammation  finger_inflammation  lips_irritation  itchiness  ulcers  \
0               0.0                  0.0              0.0        0

In [14]:
print(f'[INFO] Shapes:'
      f'\n train: {train.shape}'
      f'\n test: {test.shape}\n')

print(f'[INFO] Any missing values:'
      f'\n train: {train.isna().any().any()}'
      f'\n test: {test.isna().any().any()}')

[INFO] Shapes:
 train: (707, 66)
 test: (303, 65)

[INFO] Any missing values:
 train: False
 test: False


### Target Encoding

In [28]:
target_types = list(data['prognosis'].unique())
target_types

['lyme_disease',
 'tungiasis',
 'zika',
 'rift_valley_fever',
 'west_nile_fever',
 'malaria',
 'chikungunya',
 'plague',
 'dengue',
 'yellow_fever',
 'japanese_encephalitis']

In [29]:
out_mapping = {}
for index , i in enumerate(target_types):
    out_mapping[i] = index;
out_mapping
data['prognosis'] = data['prognosis'].replace(out_mapping)
data.head()

Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,2,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0
3,3,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,3


In [30]:
correlation_matrix = data.corr()

# Set up the matplotlib figure
plt.figure(figsize=(35,30))

# Create the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")

# Add a title
plt.title('Correlation Heatmap')

# Show the plot
plt.show()

Output hidden; open in https://colab.research.google.com to view.

In [31]:
X = data.drop(['prognosis'],axis=1)
Y = data['prognosis']
X_train , X_test , Y_train , Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

## Model Building

### Random Forests & Naive Bayes

In [32]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Train Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, Y_train)

# Predictions using Naive Bayes classifier
nb_predictions = nb_classifier.predict(X_test)

# Evaluate Naive Bayes classifier
nb_accuracy = accuracy_score(Y_test, nb_predictions)
print("Naive Bayes Classifier Accuracy:", nb_accuracy)
print("Naive Bayes Classifier Report:\n", classification_report(Y_test, nb_predictions, zero_division=0))

# Train Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, Y_train)

# Predictions using Random Forest classifier
rf_predictions = rf_classifier.predict(X_test)

# Evaluate Random Forest classifier
rf_accuracy = accuracy_score(Y_test, rf_predictions)
print("\nRandom Forest Classifier Accuracy:", rf_accuracy)
print("Random Forest Classifier Report:\n", classification_report(Y_test, rf_predictions, zero_division=0))

Naive Bayes Classifier Accuracy: 0.29577464788732394
Naive Bayes Classifier Report:
               precision    recall  f1-score   support

           0       0.29      0.91      0.43        11
           1       0.53      0.67      0.59        12
           2       0.00      0.00      0.00        13
           3       0.25      0.08      0.12        12
           4       0.00      0.00      0.00        18
           5       0.21      0.40      0.28        10
           6       0.64      0.75      0.69        12
           7       0.00      0.00      0.00        16
           8       0.17      0.33      0.22         6
           9       0.31      0.53      0.39        15
          10       0.00      0.00      0.00        17

    accuracy                           0.30       142
   macro avg       0.22      0.33      0.25       142
weighted avg       0.20      0.30      0.22       142


Random Forest Classifier Accuracy: 0.3028169014084507
Random Forest Classifier Report:
              

## Model Evaluation & Comparision

In [35]:
models = {
    'm_regression':LogisticRegression(max_iter=4000),
    'm_forest':RandomForestClassifier(max_depth=5, min_samples_leaf=4, random_state=0),
    'SVC':SVC(kernel='linear', C=1.0),
    'KNN':KNeighborsClassifier(n_neighbors=3),
    'XGBoost': XGBClassifier(learning_rate =0.1, n_estimators=600, max_depth=5)
}
scores = ['Accuracy']
res = np.zeros(shape=(len(models),len(scores)))

In [37]:
for i,key in enumerate(models):
    mod = models[key].fit(X_train,Y_train)
    y_p = mod.predict(X_test)

    res[i][0] = accuracy_score(y_p,Y_test)

In [38]:
table_np = tabulate(res, headers=scores, showindex=list(models.keys()), tablefmt='pretty')
print(table_np)

+--------------+---------------------+
|              |      Accuracy       |
+--------------+---------------------+
| m_regression | 0.28169014084507044 |
|   m_forest   | 0.29577464788732394 |
|     SVC      | 0.3028169014084507  |
|     KNN      | 0.1267605633802817  |
|   XGBoost    | 0.2676056338028169  |
+--------------+---------------------+


## Model Analysis
Refitting and Analysis with GridSearchCV

In [41]:
from sklearn.model_selection import GridSearchCV

# Support Vector Classifier
svm_param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto'],
}

svm_grid_search = GridSearchCV(SVC(random_state=42), svm_param_grid, refit=True, verbose=3, cv=3)
svm_grid_search.fit(X_train, Y_train)

# Random Forest Classifier
rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_grid_search = GridSearchCV(RandomForestClassifier(random_state=42), rf_param_grid, refit=True, verbose=3, cv=3)
rf_grid_search.fit(X_train, Y_train)

# Naive Bayes Classifier (No hyperparameters to tune)
nb_classifier = MultinomialNB()
# No hyperparameters for Naive Bayes, fit the model directly
nb_classifier.fit(X_train, Y_train)

# Evaluate models
svm_best_model = svm_grid_search.best_estimator_
rf_best_model = rf_grid_search.best_estimator_

svm_accuracy = accuracy_score(Y_test, svm_best_model.predict(X_test))
rf_accuracy = accuracy_score(Y_test, rf_best_model.predict(X_test))
nb_accuracy = accuracy_score(Y_test, nb_classifier.predict(X_test))

print("SVM Classifier:")
print("Best Parameters Found:", svm_grid_search.best_params_)
print("Best Cross-validation Score:", svm_grid_search.best_score_)
print("Accuracy of the best model:", svm_accuracy * 100, "%")
print(classification_report(Y_test, svm_best_model.predict(X_test)))

print("\nRandom Forest Classifier:")
print("Best Parameters Found:", rf_grid_search.best_params_)
print("Best Cross-validation Score:", rf_grid_search.best_score_)
print("Accuracy of the best model:", rf_accuracy * 100, "%")
print(classification_report(Y_test, rf_best_model.predict(X_test)))

print("\nNaive Bayes Classifier:")
print("Accuracy of the model:", nb_accuracy * 100, "%")
print(classification_report(Y_test, nb_classifier.predict(X_test)))


Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV 1/3] END .C=0.1, gamma=scale, kernel=linear;, score=0.312 total time=  10.1s
[CV 2/3] END .C=0.1, gamma=scale, kernel=linear;, score=0.293 total time=  11.1s
[CV 3/3] END .C=0.1, gamma=scale, kernel=linear;, score=0.271 total time=   8.0s
[CV 1/3] END ....C=0.1, gamma=scale, kernel=rbf;, score=0.122 total time=   0.1s
[CV 2/3] END ....C=0.1, gamma=scale, kernel=rbf;, score=0.117 total time=   0.1s
[CV 3/3] END ....C=0.1, gamma=scale, kernel=rbf;, score=0.117 total time=   0.1s
[CV 1/3] END ...C=0.1, gamma=scale, kernel=poly;, score=0.101 total time=   0.1s
[CV 2/3] END ...C=0.1, gamma=scale, kernel=poly;, score=0.112 total time=   0.0s
[CV 3/3] END ...C=0.1, gamma=scale, kernel=poly;, score=0.117 total time=   0.0s
[CV 1/3] END ..C=0.1, gamma=auto, kernel=linear;, score=0.312 total time=   5.1s
[CV 2/3] END ..C=0.1, gamma=auto, kernel=linear;, score=0.293 total time=   5.5s
[CV 3/3] END ..C=0.1, gamma=auto, kernel=linear;


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



### Model Save

In [42]:
from joblib import dump

# Save the best Random Forest model
dump(rf_best_model, 'random_forest_model.joblib')

['random_forest_model.joblib']

### Testing

In [50]:
# Create an inverse mapping from numerical labels to text labels
inv_mapping = {v: k for k, v in out_mapping.items()}

print(inv_mapping)


{0: 'lyme_disease', 1: 'tungiasis', 2: 'zika', 3: 'rift_valley_fever', 4: 'west_nile_fever', 5: 'malaria', 6: 'chikungunya', 7: 'plague', 8: 'dengue', 9: 'yellow_fever', 10: 'japanese_encephalitis'}


In [43]:
from joblib import load

# Load the saved Random Forest model
loaded_rf_model = load('random_forest_model.joblib')

# Use the loaded model for prediction
predicted_labels = loaded_rf_model.predict(X_test)


In [49]:
# Get predicted labels for the first few samples
predicted_labels = rf_best_model.predict(X_test)
predicted_probabilities = rf_best_model.predict_proba(X_test)

# Convert numerical labels to text labels using the inverse mapping
predicted_diseases = [inv_mapping[label] for label in predicted_labels]

# Print predicted diseases and probabilities for the first few samples
for i in range(5):  # Adjust the range as needed
    print("Sample", i+1, "predicted disease:", predicted_diseases[i])
    print("Sample", i+1, "predicted probability:", max(predicted_probabilities[i]) * 100, "%")


Sample 1 predicted disease: chikungunya
Sample 1 predicted probability: 30.757711691902873 %
Sample 2 predicted disease: dengue
Sample 2 predicted probability: 24.701716420834078 %
Sample 3 predicted disease: dengue
Sample 3 predicted probability: 24.50604991413816 %
Sample 4 predicted disease: yellow_fever
Sample 4 predicted probability: 18.645529633764927 %
Sample 5 predicted disease: dengue
Sample 5 predicted probability: 29.122389572931382 %
