## Disease Symptoms and Patient Profile Project

In this project we are going to study the connection between the symptoms and medical history of patients to some diseases.
<br>The file we have available is in csv format and we are going to read it, right after doing some essential imports for the project.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import joblib

from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import StratifiedKFold, GridSearchCV

from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.metrics import roc_curve, auc, plot_roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report

from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from colorama import Fore
from colorama import Style

import warnings
warnings.filterwarnings('ignore')

The dataset was available on Kaggle at the following link: https://www.kaggle.com/datasets/uom190346a/disease-symptoms-and-patient-profile-dataset/ .
It contains records about 349 patients and the columns of the dataset are:
<br>Disease: The name of the disease or medical condition.
<br>Fever: Indicates whether the patient has a fever (Yes/No).
<br>Cough: Indicates whether the patient has a cough (Yes/No).
<br>Fatigue: Indicates whether the patient experiences fatigue (Yes/No).
<br>Difficulty Breathing: Indicates whether the patient has difficulty breathing (Yes/No).
<br>Age: The age of the patient in years.
<br>Gender: The gender of the patient (Male/Female).
<br>Blood Pressure: The blood pressure level of the patient (Normal/High).
<br>Cholesterol Level: The cholesterol level of the patient (Normal/High).
<br>Outcome Variable: The outcome variable indicating the result of the diagnosis or assessment for the specific disease (Positive/Negative).

In [2]:
df=pd.read_csv('Disease_symptom_and_patient_profile_dataset.csv')

In [3]:
df

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,Influenza,Yes,No,Yes,Yes,19,Female,Low,Normal,Positive
1,Common Cold,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
2,Eczema,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
3,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive
4,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive
...,...,...,...,...,...,...,...,...,...,...
344,Stroke,Yes,No,Yes,No,80,Female,High,High,Positive
345,Stroke,Yes,No,Yes,No,85,Male,High,High,Positive
346,Stroke,Yes,No,Yes,No,85,Male,High,High,Positive
347,Stroke,Yes,No,Yes,No,90,Female,High,High,Positive


Next let us check if any of the values are null:

In [4]:
df.isnull().sum()

Disease                 0
Fever                   0
Cough                   0
Fatigue                 0
Difficulty Breathing    0
Age                     0
Gender                  0
Blood Pressure          0
Cholesterol Level       0
Outcome Variable        0
dtype: int64

Good to see that there are no null values in the DataFrame.
Next we are going to check for duplicated values:

In [13]:
df.loc[df.duplicated()]

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
4,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive
7,Influenza,Yes,Yes,Yes,Yes,25,Female,Normal,Normal,Positive
9,Hyperthyroidism,No,Yes,No,No,28,Female,Normal,Normal,Negative
23,Dengue Fever,Yes,No,Yes,No,30,Female,Normal,Normal,Negative
35,Asthma,Yes,Yes,No,Yes,30,Female,Normal,Normal,Positive
40,Bronchitis,Yes,Yes,Yes,Yes,30,Male,High,High,Positive
59,Asthma,No,Yes,Yes,Yes,35,Female,High,Normal,Negative
69,Pneumonia,Yes,Yes,Yes,Yes,35,Female,Normal,Normal,Negative
73,Rubella,Yes,No,Yes,No,35,Female,High,Normal,Negative
76,Asthma,Yes,Yes,No,Yes,35,Male,Normal,Normal,Positive


As we can see at first glance, while the `df.duplicated()` filter reports the rows which have duplicate values in the 'True' or 'False' columns, these patients are actually not duplicates because the age and the disease are always different. This is the reason why we are not going to filter out these values, but keep them as unique values since the 'Age' and 'Disease' columns are always different.

We will plot now each one of the columns to get a sense of the distribution of their values.