In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, classification_report

# Disease Prediction using Naive Bayes Classifier

## Analyse and Prepare the data
    
We will use next Kaggle dataset: [Disease Prediction Kaggle Dataset](https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning)

### Load

In [2]:
train_df = pd.read_csv('../../../datasets/DiseasePrediction/Training.csv')
test_df = pd.read_csv('../../../datasets/DiseasePrediction/Testing.csv')

### Inspect the data

In [3]:
print(f'Train df shape: {train_df.shape}')
print(f'Test df shape: {test_df.shape}')

Train df shape: (4920, 134)
Test df shape: (42, 133)


In [4]:
train_df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis,Unnamed: 133
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,


In [5]:
test_df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,0,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Allergy
2,0,0,0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,GERD
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Chronic cholestasis
4,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,Drug Reaction


In [6]:
# Check for NaN values
print(train_df.columns[train_df.isna().any()])
print(test_df.columns[test_df.isna().any()])

Index(['Unnamed: 133'], dtype='object')
Index([], dtype='object')


### Clean the dataset

The dataset contains multiple symptoms as features and the target column prognosis, which represents the disease. There's an additional Unnamed: 133 column, which appears to be unnecessary and will be dropped

In [7]:
train_df.drop(columns='Unnamed: 133', inplace=True)
train_df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection


## Split to train/test sets

Let's splitting data into features (X) and target (y)

In [8]:
X_train = train_df.drop(columns=['prognosis'])
y_train = train_df['prognosis']

X_test = test_df.drop(columns=['prognosis'])
y_test = test_df['prognosis']

## Train the model

Since the features are binary (0/1) values representing symptoms, Bernoulli Naive Bayes would be an appropriate classifier.

In [9]:
# Initialize the Naive Bayes classifier
bnb = BernoulliNB()

# Train the model
bnb.fit(X_train, y_train)


## Evaluate the model

In [10]:
# Make predictions on the test set
y_pred = bnb.predict(X_test)


In [11]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(f'accuracy: {accuracy}')
print(classification_rep)

accuracy: 1.0
                                         precision    recall  f1-score   support

(vertigo) Paroymsal  Positional Vertigo       1.00      1.00      1.00         1
                                   AIDS       1.00      1.00      1.00         1
                                   Acne       1.00      1.00      1.00         1
                    Alcoholic hepatitis       1.00      1.00      1.00         1
                                Allergy       1.00      1.00      1.00         1
                              Arthritis       1.00      1.00      1.00         1
                       Bronchial Asthma       1.00      1.00      1.00         1
                   Cervical spondylosis       1.00      1.00      1.00         1
                            Chicken pox       1.00      1.00      1.00         1
                    Chronic cholestasis       1.00      1.00      1.00         1
                            Common Cold       1.00      1.00      1.00         1
             

### Report Analyses

This classification report gives an overview of the performance of the model for each class (disease) using precision, recall, and F1-score, along with the overall accuracy of the model. 

*Accuracy*: 1.00: The model correctly predicted 100% of the cases in the test set.


For each disease:

- *Precision*: The proportion of true positive predictions out of all predictions made for that class (i.e., how many of the predicted cases of a disease were actually correct).
- *Recall*: The proportion of true positive predictions out of all actual occurrences of that disease (i.e., how many of the actual cases of a disease were correctly identified).
- *F1-score*: The harmonic mean of precision and recall. A high F1-score indicates a good balance between precision and recall.
- *Support*: The number of actual instances for that class in the test set.


For example, for AIDS, the precision, recall, and F1-score are all 1.00, meaning all predictions for AIDS were correct, and all actual cases of AIDS were identified.

- *Macro avg*: The unweighted average of the precision, recall, and F1-scores across all classes.
- *Weighted avg*: The average of precision, recall, and F1-scores weighted by the number of instances in each class. This metric takes class imbalance into account.

In our case, every class (disease) has perfect precision, recall, and F1-score, indicating that the model performed flawlessly on this test set.