# <font color='289C4E'>Classification with a Tabular Vector Borne Disease Dataset
<font><a class='anchor' id='top'></a>

## <font color='289C4E'>`Task`: Develop ML/DL models to Classify the various types of Vector Borne diseases<font><a class='anchor' id='top'></a>

### <font color='289C4E'>Understanding the task:<font><a class='anchor' id='top'></a>

<div class="alert alert-block alert-info"> 

📌 Classify the various types of Vector Borne diseases. 

What does this really mean? 

This is a multi-class classification problem, where we need to classify the presence of certain symptoms to the correct virus.

The binary features in our dataset likely represent the presence or absence of certain symptoms, and each virus is associated with a specific combination of symptoms. Therefore, we will need to train a machine learning model that can learn the patterns in the symptom data and use this information to predict the virus associated with a particular combination of symptoms.

</div>



<center>
<img src="https://media1.giphy.com/media/mDYzVAlAxLUGEg44cJ/giphy.gif?cid=ecf05e47mtrfo96qjga8rx7ewxc72lbzn8sm0z1wgv0n8s3a&rid=giphy.gif&ct=g" width=400>
</center>

# <font color='289C4E'>Table of contents<font><a class='anchor' id='top'></a>
- [Classification with a Tabular Vector Borne Disease Dataset](#1)
    - [Task & Understanding the task:](#2)
- [Data Description & Understanding I](#3)
    - [Data Understanding II](#3.1)
    - [Data Understanding III, Understanding our Features in Depth](#3.2)
    - [Data Description & Understanding IV](#3.3)
    
- [Importing Libraries & Data](#6)
    - [Load Data Modelling Libraries](#7)
- [Making our datasets available in our coding environment](#8)
    - [Reading in our csv files and putting them into a dataframe object](#9)

# <font color='289C4E'>Data Description & Understanding I<font><a class='anchor' id='top'></a>


The dataset for this competition (both train and test) was generated from a deep learning model trained on the Vector Borne Disease Prediction dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance. Note that in the original dataset some prognoses contain spaces, but in the competition dataset spaces have been replaced with underscores to work with the MPA@K metric.

`Files`

`train.csv` - the training dataset; prognosis is the target

`test.csv` - the test dataset; your objective is to predict prognosis

`sample_submission.csv` - a sample submission file in the correct format



# <font color='289C4E'>Data Description II<font><a class='anchor' id='top'></a>


There are two datasets, The original dataset and the generated dataset, we will be using both to compare and contrast features etc.

Files from the generated dataset:

`train.csv` - the training dataset; `prognosis is the target`

`test.csv` - the test dataset; your objective is to `predict prognosis`

`sample_submission.csv` - a sample submission file in the correct format

Files from the original dataset:

`testt.csv`

`trainn.csv`

# <font color='289C4E'>Data Description & Understanding III<font><a class='anchor' id='top'></a>

All symptoms and Prognosis included associated with 11 Vector Borne Diseases.

- `Chikungunya`
- `Dengue`
- `Zika`
- `Yellow Fever`
- `Raft Valley Fever`
- `West Nile Fever`
- `Malaria`
- `Tungiasis`
- `Japanese Encephalitis`
- `Plague`
- `Lyme Disease`

# <font color='289C4E'>Data Description & Understanding IV<font><a class='anchor' id='top'></a>

Here is a brief description of each symptom, (I know this is long but having a better understanding of our data can help us identify key correlations and information beforehand):

- `sudden_fever`: A sudden increase in body temperature, often accompanied by chills or sweating.

- `headache`: A pain or discomfort in the head, often felt as pressure or throbbing.

- `mouth_bleed`: Bleeding from the mouth, which can be caused by various factors such as injury, infection, or disease.

- `nose_bleed`: Bleeding from the nose, which can be caused by various factors such as injury, infection, or disease.

- `muscle_pain`: Pain or discomfort in the muscles, often accompanied by stiffness or weakness.

- `joint_pain`: Pain or discomfort in the joints, often accompanied by swelling or stiffness.

- `vomiting`: The forceful expulsion of stomach contents through the mouth, often due to illness or infection.

- `rash`: A change in the appearance or texture of the skin, often characterized by redness, itching, or bumps.

- `diarrhea`: Frequent and watery bowel movements, often accompanied by stomach cramps or pain.

- `hypotension`: A lower than normal blood pressure, which can cause symptoms such as dizziness, fainting, or fatigue.

- `pleural_effusion`: A buildup of fluid between the layers of tissue that line the lungs and chest wall, often caused by infection or inflammation.

- `ascites`: A buildup of fluid in the abdominal cavity, often caused by liver disease or cancer.

- `gastro_bleeding`: Bleeding in the gastrointestinal tract, which can be caused by various factors such as ulcers, inflammation, or cancer.

- `swelling`: An enlargement or puffiness of a body part or area, often due to inflammation or injury.

- `nausea`: A feeling of discomfort or queasiness in the stomach, often accompanied by an urge to vomit.

- `chills`: A sensation of coldness or shivering, often accompanied by fever or infection.

- `myalgia`: Pain or discomfort in the muscles, similar to muscle pain.

- `digestion_trouble`: Difficulty in digesting food, often accompanied by symptoms such as bloating, gas, or heartburn.

- `fatigue`: A feeling of extreme tiredness or exhaustion, often accompanied by weakness or lack of motivation.

- `skin_lesions`: Abnormalities or changes in the appearance or texture of the skin, often characterized by bumps, blisters, or sores.
- `stomach_pain`: Pain or discomfort in the stomach or abdominal area, often accompanied by bloating or gas.
- `orbital_pain`: Pain or discomfort in the eye or orbital area, often accompanied by redness or swelling.
- `neck_pain`: Pain or discomfort in the neck or cervical spine, often accompanied by stiffness or limited range of motion.
- `weakness`: A lack of physical or mental strength, often accompanied by fatigue or lethargy.
- `back_pain`: Pain or discomfort in the back or lumbar spine, often accompanied by stiffness or limited range of motion.
- `weight_loss`: A decrease in body weight, often unintentional and accompanied by loss of muscle or fat tissue.
- `gum_bleed`: Bleeding from the gums, often caused by gum disease or injury.
- `jaundice`: A yellowing of the skin and whites of the eyes, often caused by liver disease or dysfunction.
- `coma`: A state of unconsciousness, in which a person is unresponsive to stimuli and cannot be awakened.
- `dizziness`: A sensation of lightheadedness or imbalance, often accompanied by a spinning sensation or loss of balance.
- `inflammation`: A response of the immune system to injury or infection, often characterized by redness,

- `Red eyes`: Redness, itching, and swelling of the eyes.
- `Loss of appetite`: A decrease in the desire to eat.
- `Urination loss`: Difficulty or inability to pass urine.
- `Slow heart rate`: A heart rate that is slower than normal.
- `Abdominal pain`: Pain in the stomach or abdominal area.
- `Light sensitivity`: Sensitivity to light.
- `Yellow skin`: Yellowing of the skin due to liver dysfunction or other medical conditions.
- `Yellow eyes`: Yellowing of the eyes due to liver dysfunction or other medical conditions.
- `Facial distortion`: A change in the appearance or shape of the face.
- `Microcephaly`: A condition where the head is smaller than normal due to abnormal brain development.
- `Rigor`: Shivering or trembling due to a fever or cold.
- `Bitter tongue`: A bitter taste in the mouth.
- `Convulsion`: Involuntary muscle contractions and spasms.
- `Anemia`: A condition where there are not enough red blood cells or hemoglobin in the blood.
- `Coca-cola urine`: Dark urine that resembles the color of coca-cola due to the presence of blood or other medical conditions.
- `Hypoglycemia`: Low blood sugar levels.
- `Prostration`: Extreme physical weakness and fatigue.
- `Hyperpyrexia`: An extremely high fever.
- `Stiff neck`: A stiffness and pain in the neck and shoulder area.
- `Irritability`: A state of being easily annoyed or agitated.
- `Confusion`: A state of disorientation or lack of understanding.
- `Tremor`: Involuntary shaking or trembling of the body.
- `Paralysis`: Loss of muscle function and control.
- `Lymph swells`: Swelling of the lymph nodes.
- `Breathing restriction`: Difficulty breathing or shortness of breath.
- `Toe inflammation`: Swelling and redness of the toes.
- `Finger inflammation`: Swelling and redness of the fingers.
- `Lips irritation`: Swelling and redness of the lips.
- `Itchiness`: A sensation that provokes the desire to scratch the affected area.
- `Ulcers`: Open sores on the skin or mucous membranes.
- `Toenail loss`: Loss of toenails due to trauma, injury or fungal infection.
- `Speech problem`: Difficulty in speaking or expressing thoughts.
- `Bullseye rash`: A circular rash with a central spot and a ring-like appearance around it.
- `Prognosis`: A prediction of the likely outcome or course of a medical condition.

# <font color='289C4E'>Importing Libraries & Data<font><a class='anchor' id='top'></a>

In [2]:
#load packages
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))

import pandas as pd #collection of functions for data processing and analysis modeled after R dataframes with SQL like features
print("pandas version: {}". format(pd.__version__))

import matplotlib #collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))

import numpy as np #foundational package for scientific computing
print("NumPy version: {}". format(np.__version__))

import scipy as sp #collection of functions for scientific computing and advance mathematics
print("SciPy version: {}". format(sp.__version__)) 

import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook
from IPython.display import HTML, display
print("IPython version: {}". format(IPython.__version__)) 

import sklearn #collection of machine learning algorithms
print("scikit-learn version: {}". format(sklearn.__version__))

#misc libraries
import random
import time

#ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)

Python version: 3.8.3 (v3.8.3:6f8c8320e9, May 13 2020, 16:29:34) 
[Clang 6.0 (clang-600.0.57)]
pandas version: 2.0.0
matplotlib version: 3.7.1
NumPy version: 1.24.2
SciPy version: 1.10.1
IPython version: 8.12.0
scikit-learn version: 1.2.2
-------------------------


In [33]:

#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.plotting import scatter_matrix

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
sns.set_style('darkgrid')
pylab.rcParams['figure.figsize'] = 12,8

rc = {
    "axes.facecolor": "#F8F8F8",
    "figure.facecolor": "#F8F8F8",
    "axes.edgecolor": "#000000",
    "grid.color": "#EBEBE7" + "30",
    "font.family": "serif",
    "axes.labelcolor": "#000000",
    "xtick.color": "#000000",
    "ytick.color": "#000000",
    "grid.alpha": 0.4
}

sns.set(rc=rc)
pal = ['#302c36', '#037d97', '#91013E', '#C09741',
           '#EC5B6D', '#90A6B1', '#6ca957', '#D8E3E2']



# <font color='289C4E'>Making our datasets available in our coding environment<font><a class='anchor' id='top'></a>

In [4]:
PATH = "/kaggle/input/playground-series-s3e13"

TRAIN_FILENAME = "/Users/richeyjay/Desktop/VectorBorneDiseaseClassification/env/Code/train.csv"
TEST_FILENAME = "/Users/richeyjay/Desktop/VectorBorneDiseaseClassification/env/Code/test.csv"


ORIGINAL_TRAIN_FILENAME = "/Users/richeyjay/Desktop/VectorBorneDiseaseClassification/env/Code/OriginalTrain.csv"
ORIGINAL_TEST_FILENAME = "/Users/richeyjay/Desktop/VectorBorneDiseaseClassification/env/Code/OriginalTest.csv"

SUBMISSION_FILENAME = "/kaggle/input/playground-series-s3e13/sample_submission.csv"


# <font color='289C4E'>Reading in our csv files and putting them into a dataframe object<font><a class='anchor' id='top'></a>

In [5]:
original_train_data = pd.read_csv(ORIGINAL_TRAIN_FILENAME)
print("ORIGINAL TRAIN DATA (ROWS,COL):")
print(original_train_data.shape)
print('-'*50)

original_test_data = pd.read_csv(ORIGINAL_TEST_FILENAME)
print("ORIGINAL TEST DATA (ROWS,COL):")
print(original_test_data.shape)
print('-'*50)

train_data = pd.read_csv(TRAIN_FILENAME)
print("TRAIN DATA (ROWS,COL):")
print(train_data.shape)
print('-'*50)

test_data = pd.read_csv(TEST_FILENAME)
print("TEST DATA (ROWS,COL):")
print(test_data.shape)
print('-'*50)

ORIGINAL TRAIN DATA (ROWS,COL):
(252, 65)
--------------------------------------------------
ORIGINAL TEST DATA (ROWS,COL):
(11, 65)
--------------------------------------------------
TRAIN DATA (ROWS,COL):
(707, 66)
--------------------------------------------------
TEST DATA (ROWS,COL):
(303, 65)
--------------------------------------------------


<div class="alert alert-block alert-info"> 

📌

- The original training data has 252 rows and 65 columns. 

- The original test data has 11 rows and 65 columns. 

- The train data has 707 rows and 66 columns. 

- The test data has 303 rows and 65 columns. 

</div>

# <font color='289C4E'>Data.info() to see data types and nunique() for unique vals <font><a class='anchor' id='top'></a>

In [6]:
print(original_train_data.info())
print('-'*50)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 65 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   sudden_fever           252 non-null    int64 
 1   headache               252 non-null    int64 
 2   mouth_bleed            252 non-null    int64 
 3   nose_bleed             252 non-null    int64 
 4   muscle_pain            252 non-null    int64 
 5   joint_pain             252 non-null    int64 
 6   vomiting               252 non-null    int64 
 7   rash                   252 non-null    int64 
 8   diarrhea               252 non-null    int64 
 9   hypotension            252 non-null    int64 
 10  pleural_effusion       252 non-null    int64 
 11  ascites                252 non-null    int64 
 12  gastro_bleeding        252 non-null    int64 
 13  swelling               252 non-null    int64 
 14  nausea                 252 non-null    int64 
 15  chills                 

In [7]:
print(original_test_data.info())
print('-'*50)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 65 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   sudden_fever           11 non-null     int64 
 1   headache               11 non-null     int64 
 2   mouth_bleed            11 non-null     int64 
 3   nose_bleed             11 non-null     int64 
 4   muscle_pain            11 non-null     int64 
 5   joint_pain             11 non-null     int64 
 6   vomiting               11 non-null     int64 
 7   rash                   11 non-null     int64 
 8   diarrhea               11 non-null     int64 
 9   hypotension            11 non-null     int64 
 10  pleural_effusion       11 non-null     int64 
 11  ascites                11 non-null     int64 
 12  gastro_bleeding        11 non-null     int64 
 13  swelling               11 non-null     int64 
 14  nausea                 11 non-null     int64 
 15  chills                 11

<div class="alert alert-block alert-info"> 

📌

- The original training data has dtypes: int64(64), object(1). 
(64 columns of integer type and the prognosis is of object type)

- The original test data has dtypes: int64(64), object(1). 
(64 columns of integer type and the prognosis is of object type)

</div>

In [8]:
print(train_data.info())
print('-'*50)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 707 entries, 0 to 706
Data columns (total 66 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     707 non-null    int64  
 1   sudden_fever           707 non-null    float64
 2   headache               707 non-null    float64
 3   mouth_bleed            707 non-null    float64
 4   nose_bleed             707 non-null    float64
 5   muscle_pain            707 non-null    float64
 6   joint_pain             707 non-null    float64
 7   vomiting               707 non-null    float64
 8   rash                   707 non-null    float64
 9   diarrhea               707 non-null    float64
 10  hypotension            707 non-null    float64
 11  pleural_effusion       707 non-null    float64
 12  ascites                707 non-null    float64
 13  gastro_bleeding        707 non-null    float64
 14  swelling               707 non-null    float64
 15  nausea

In [9]:
print(test_data.info())
print('-'*50)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 65 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     303 non-null    int64  
 1   sudden_fever           303 non-null    float64
 2   headache               303 non-null    float64
 3   mouth_bleed            303 non-null    float64
 4   nose_bleed             303 non-null    float64
 5   muscle_pain            303 non-null    float64
 6   joint_pain             303 non-null    float64
 7   vomiting               303 non-null    float64
 8   rash                   303 non-null    float64
 9   diarrhea               303 non-null    float64
 10  hypotension            303 non-null    float64
 11  pleural_effusion       303 non-null    float64
 12  ascites                303 non-null    float64
 13  gastro_bleeding        303 non-null    float64
 14  swelling               303 non-null    float64
 15  nausea

<div class="alert alert-block alert-info"> 

📌

- The train data has dtypes: float64(64), int64(1), object(1). 
(64 columns of float type, 1 int type and the prognosis is of object type) (1 int type is the 'id' column)

- The test data has dtypes: float64(64), int64(1). 
(64 columns of float type, no prognosis)

</div>

In [10]:
cols = original_train_data.columns.to_list()
print(cols)
original_train_data[cols].nunique()

['sudden_fever', 'headache', 'mouth_bleed', 'nose_bleed', 'muscle_pain', 'joint_pain', 'vomiting', 'rash', 'diarrhea', 'hypotension', 'pleural_effusion', 'ascites', 'gastro_bleeding', 'swelling', 'nausea', 'chills', 'myalgia', 'digestion_trouble', 'fatigue', 'skin_lesions', 'stomach_pain', 'orbital_pain', 'neck_pain', 'weakness', 'back_pain', 'weight_loss', 'gum_bleed', 'jaundice', 'coma', 'diziness', 'inflammation', 'red_eyes', 'loss_of_appetite', 'urination_loss', 'slow_heart_rate', 'abdominal_pain', 'light_sensitivity', 'yellow_skin', 'yellow_eyes', 'facial_distortion', 'microcephaly', 'rigor', 'bitter_tongue', 'convulsion', 'anemia', 'cocacola_urine', 'hypoglycemia', 'prostraction', 'hyperpyrexia', 'stiff_neck', 'irritability', 'confusion', 'tremor', 'paralysis', 'lymph_swells', 'breathing_restriction', 'toe_inflammation', 'finger_inflammation', 'lips_irritation', 'itchiness', 'ulcers', 'toenail_loss', 'speech_problem', 'bullseye_rash', 'prognosis']


sudden_fever       2
headache           2
mouth_bleed        2
nose_bleed         2
muscle_pain        2
                  ..
ulcers             2
toenail_loss       2
speech_problem     2
bullseye_rash      2
prognosis         11
Length: 65, dtype: int64

In [11]:
cols = original_test_data.columns.to_list()
print(cols)
original_test_data[cols].nunique()

['sudden_fever', 'headache', 'mouth_bleed', 'nose_bleed', 'muscle_pain', 'joint_pain', 'vomiting', 'rash', 'diarrhea', 'hypotension', 'pleural_effusion', 'ascites', 'gastro_bleeding', 'swelling', 'nausea', 'chills', 'myalgia', 'digestion_trouble', 'fatigue', 'skin_lesions', 'stomach_pain', 'orbital_pain', 'neck_pain', 'weakness', 'back_pain', 'weight_loss', 'gum_bleed', 'jaundice', 'coma', 'diziness', 'inflammation', 'red_eyes', 'loss_of_appetite', 'urination_loss', 'slow_heart_rate', 'abdominal_pain', 'light_sensitivity', 'yellow_skin', 'yellow_eyes', 'facial_distortion', 'microcephaly', 'rigor', 'bitter_tongue', 'convulsion', 'anemia', 'cocacola_urine', 'hypoglycemia', 'prostraction', 'hyperpyrexia', 'stiff_neck', 'irritability', 'confusion', 'tremor', 'paralysis', 'lymph_swells', 'breathing_restriction', 'toe_inflammation', 'finger_inflammation', 'lips_irritation', 'itchiness', 'ulcers', 'toenail_loss', 'speech_problem', 'bullseye_rash', 'prognosis']


sudden_fever       2
headache           2
mouth_bleed        2
nose_bleed         2
muscle_pain        2
                  ..
ulcers             2
toenail_loss       2
speech_problem     2
bullseye_rash      2
prognosis         11
Length: 65, dtype: int64

<div class="alert alert-block alert-info"> 

📌

In both original_train and original_test we see that most of the columns if not all are binary and the prognosis has 11 distinct types of classification and is categorical.

</div>

In [12]:
cols = train_data.columns.to_list()
print(cols)
train_data[cols].nunique()

['id', 'sudden_fever', 'headache', 'mouth_bleed', 'nose_bleed', 'muscle_pain', 'joint_pain', 'vomiting', 'rash', 'diarrhea', 'hypotension', 'pleural_effusion', 'ascites', 'gastro_bleeding', 'swelling', 'nausea', 'chills', 'myalgia', 'digestion_trouble', 'fatigue', 'skin_lesions', 'stomach_pain', 'orbital_pain', 'neck_pain', 'weakness', 'back_pain', 'weight_loss', 'gum_bleed', 'jaundice', 'coma', 'diziness', 'inflammation', 'red_eyes', 'loss_of_appetite', 'urination_loss', 'slow_heart_rate', 'abdominal_pain', 'light_sensitivity', 'yellow_skin', 'yellow_eyes', 'facial_distortion', 'microcephaly', 'rigor', 'bitter_tongue', 'convulsion', 'anemia', 'cocacola_urine', 'hypoglycemia', 'prostraction', 'hyperpyrexia', 'stiff_neck', 'irritability', 'confusion', 'tremor', 'paralysis', 'lymph_swells', 'breathing_restriction', 'toe_inflammation', 'finger_inflammation', 'lips_irritation', 'itchiness', 'ulcers', 'toenail_loss', 'speech_problem', 'bullseye_rash', 'prognosis']


id                707
sudden_fever        2
headache            2
mouth_bleed         2
nose_bleed          2
                 ... 
ulcers              2
toenail_loss        2
speech_problem      2
bullseye_rash       2
prognosis          11
Length: 66, dtype: int64

In [13]:
cols = test_data.columns.to_list()
print(cols)
test_data[cols].nunique()

['id', 'sudden_fever', 'headache', 'mouth_bleed', 'nose_bleed', 'muscle_pain', 'joint_pain', 'vomiting', 'rash', 'diarrhea', 'hypotension', 'pleural_effusion', 'ascites', 'gastro_bleeding', 'swelling', 'nausea', 'chills', 'myalgia', 'digestion_trouble', 'fatigue', 'skin_lesions', 'stomach_pain', 'orbital_pain', 'neck_pain', 'weakness', 'back_pain', 'weight_loss', 'gum_bleed', 'jaundice', 'coma', 'diziness', 'inflammation', 'red_eyes', 'loss_of_appetite', 'urination_loss', 'slow_heart_rate', 'abdominal_pain', 'light_sensitivity', 'yellow_skin', 'yellow_eyes', 'facial_distortion', 'microcephaly', 'rigor', 'bitter_tongue', 'convulsion', 'anemia', 'cocacola_urine', 'hypoglycemia', 'prostraction', 'hyperpyrexia', 'stiff_neck', 'irritability', 'confusion', 'tremor', 'paralysis', 'lymph_swells', 'breathing_restriction', 'toe_inflammation', 'finger_inflammation', 'lips_irritation', 'itchiness', 'ulcers', 'toenail_loss', 'speech_problem', 'bullseye_rash']


id                303
sudden_fever        2
headache            2
mouth_bleed         2
nose_bleed          2
                 ... 
itchiness           2
ulcers              2
toenail_loss        2
speech_problem      2
bullseye_rash       2
Length: 65, dtype: int64

<div class="alert alert-block alert-info"> 

📌

- In both train and test we see that most of the columns if not all are binary(but they are in float type we will probably change this to binary int later) 

- The train data has a prognosis columns that has 11 distinct types of classification and is categorical.

- There contains an 'id' column for both

- No prognosis column for test data

</div>

# <font color='289C4E'>Observing small samples of our datasets with .head() in order to get familiar with how our dataset looks and is organized<font><a class='anchor' id='top'></a>


In [14]:
original_train_data.head()

Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,0,1,1,1,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,Chikungunya
1,1,1,1,1,1,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,Chikungunya
2,0,1,0,1,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,Chikungunya
3,0,0,0,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,Chikungunya
4,1,0,0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,Chikungunya


In [15]:
original_test_data.head()

Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,1,0,0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,Chikungunya
1,1,0,0,0,1,1,1,1,0,1,...,0,0,0,0,0,0,1,0,0,Dengue
2,1,1,1,1,0,1,0,1,0,1,...,0,1,0,1,0,0,0,0,0,Rift Valley fever
3,1,1,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Yellow Fever
4,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Zika


In [16]:
train_data.head()

Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Lyme_disease
1,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Tungiasis
2,2,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,Lyme_disease
3,3,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Zika
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,Rift_Valley_fever


In [17]:
test_data.head()

Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,lymph_swells,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash
0,707,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,708,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,709,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,710,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,711,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# <font color='289C4E'>Checking for duplicate values<font><a class='anchor' id='top'></a>


<div class="alert alert-block alert-info"> 

📌 Select duplicate rows based on all columns. 

</div>

In [19]:
duplicate = original_train_data[original_train_data.duplicated()]
 
print("Duplicate Rows in the Original Train Dataframe:")
 
# Print the resultant Dataframe
duplicate

Duplicate Rows in the Original Train Dataframe:


Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis


In [20]:
duplicate = original_test_data[original_test_data.duplicated()]
 
print("Duplicate Rows in the Original Test Dataframe:")
 
# Print the resultant Dataframe
duplicate

Duplicate Rows in the Original Test Dataframe:


Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis


In [21]:
duplicate = train_data[train_data.duplicated()]
 
print("Duplicate Rows in the Train Dataframe:")
 
# Print the resultant Dataframe
duplicate

Duplicate Rows in the Train Dataframe:


Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis


In [22]:
duplicate = test_data[test_data.duplicated()]
 
print("Duplicate Rows in the Test Dataframe:")
 
# Print the resultant Dataframe
duplicate

Duplicate Rows in the Test Dataframe:


Unnamed: 0,id,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,...,lymph_swells,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash


<div class="alert alert-block alert-info"> 

📌  No Duplicates in any of the datasets

</div>

# <font color='289C4E'>Inventory of Data Types<font><a class='anchor' id='top'></a>

All of our features are `Numerical`

`Numerical` data refers to the data that is in the form of numbers, and not in any language or descriptive form. Often referred to as quantitative data, numerical data is collected in number form and stands different from any form of number data types due to its ability to be statistically and arithmetically calculated. 

There exists two subtypes of numerical data:

- `Discrete`

Discrete data – Discrete data is used to represent countable items. It can take both numerical and categorical forms and group them into a list. This list can be finite or infinite too. 
Discrete data basically takes countable numbers like 1, 2, 3, 4, 5, and so on. In the case of infinity, these numbers will keep going on. 

- `Continuous`

Continuous data – As the name says, this form has data in the form of intervals. Or simply said ranges. Continuous numerical data represent measurements and their intervals fall on a number line. Hence, it doesn’t involve taking counts of the items. 

Therefore most of these features if not all are Binary.

`*NOTE*` The features in train and test are floats we might have to cast them into binary


<div class="alert alert-block alert-info"> 

📌 Prognosis is an Object and is categorical(we will have to turn this into categorical data so the ML model can read it)

</div>
 



# <font color='289C4E'>Data Cleaning for 'id'<font><a class='anchor' id='top'></a>

In [23]:
#delete the ID column
drop_column = ['id']
train_data.drop(drop_column, axis=1, inplace = True)
print(train_data.isnull().sum())
print("-"*10)

test_data.drop(drop_column, axis=1, inplace = True)
print(test_data.isnull().sum())
print("-"*10)



sudden_fever      0
headache          0
mouth_bleed       0
nose_bleed        0
muscle_pain       0
                 ..
ulcers            0
toenail_loss      0
speech_problem    0
bullseye_rash     0
prognosis         0
Length: 65, dtype: int64
----------
sudden_fever      0
headache          0
mouth_bleed       0
nose_bleed        0
muscle_pain       0
                 ..
itchiness         0
ulcers            0
toenail_loss      0
speech_problem    0
bullseye_rash     0
Length: 64, dtype: int64
----------


# <font color='289C4E'>Type casting floats to ints and combining datasets<font><a class='anchor' id='top'></a>

In [74]:
original_train_data


Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,0,1,1,1,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,Chikungunya
1,1,1,1,1,1,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,Chikungunya
2,0,1,0,1,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,Chikungunya
3,0,0,0,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,Chikungunya
4,1,0,0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,Chikungunya
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,0,0,1,1,0,1,1,1,0,0,...,0,1,0,0,1,0,0,1,1,Lyme disease
248,0,1,1,1,1,0,1,0,0,0,...,1,1,1,0,1,0,1,1,1,Lyme disease
249,0,1,1,1,0,0,0,1,1,1,...,1,1,0,0,1,0,1,1,1,Lyme disease
250,0,0,0,1,0,1,1,0,1,1,...,0,0,1,1,1,1,1,1,1,Lyme disease


In [75]:
train_data

Unnamed: 0,sudden_fever,headache,mouth_bleed,nose_bleed,muscle_pain,joint_pain,vomiting,rash,diarrhea,hypotension,...,breathing_restriction,toe_inflammation,finger_inflammation,lips_irritation,itchiness,ulcers,toenail_loss,speech_problem,bullseye_rash,prognosis
0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7
2,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,3
3,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
702,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
703,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
704,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10
705,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,5


# <font color='289C4E'>Exploratory Data Analysis (EDA)<font><a class='anchor' id='top'></a>

In [69]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB

# Encode the output classes
label_encoder = LabelEncoder()
train_data["prognosis"] = label_encoder.fit_transform(train_data["prognosis"])

# Split the train_data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(train_data.drop("prognosis", axis=1), train_data["prognosis"], test_size=0.2)

# Train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Test the classifier on the testing set
accuracy = clf.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

# Decode the predicted output labels
y_pred = clf.predict(X_test)
y_pred = label_encoder.inverse_transform(y_pred)


Accuracy: 0.33
