# **Multiple Disease Prediction**

---



#**Objective**
The main objective of multiple disease prediction is to develop accurate and efficient models that can analyze various medical data inputs to predict the likelihood of multiple diseases in an individual. This helps in early detection, timely intervention, and personalized healthcare planning.
It aims to enhance preventive measures, reduce healthcare costs, and improve overall public health by leveraging advanced algorithms and data analytics to identify patterns and risk factors associated with different diseases.

#**Data Source**
The dataset used in this project was downloaded from [github](https://github.com/YBIFoundation/Dataset/raw/main/MultipleDiseasePrediction.csv)

This dataset is a collection of data on **multiple disease prediction** based on symptoms and demographic factors. Some of the diseases that can be predicted using this dataset are diabetes, heart disease, kidney disease, breast cancer, liver disease, malaria, and pneumonia. The dataset has **4920 rows** and **18 columns**, where each row represents a patient and each column represents a feature or a label. The features include **age, gender, polyuria, polydipsia, sudden weight loss, weakness, polyphagia, genital thrush, visual blurring, itching, irritability, delayed healing, partial paresis, muscle stiffness, alopecia, and obesity**. The class label indicates whether the patient is positive or negative for any disease, and the disease label indicates the name of the disease that the patient has (if any).

#**Import Library**

In [1]:
import pandas as pd

#**Import Data**

In [2]:
df = pd.read_csv('https://github.com/YBIFoundation/Dataset/raw/main/MultipleDiseasePrediction.csv')

In [3]:
df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4920 entries, 0 to 4919
Columns: 133 entries, itching to prognosis
dtypes: int64(132), object(1)
memory usage: 5.0+ MB


In [4]:
df.shape

(4920, 133)

In [5]:
df['prognosis'].value_counts()

Fungal infection                           120
Hepatitis C                                120
Hepatitis E                                120
Alcoholic hepatitis                        120
Tuberculosis                               120
Common Cold                                120
Pneumonia                                  120
Dimorphic hemmorhoids(piles)               120
Heart attack                               120
Varicose veins                             120
Hypothyroidism                             120
Hyperthyroidism                            120
Hypoglycemia                               120
Osteoarthristis                            120
Arthritis                                  120
(vertigo) Paroymsal  Positional Vertigo    120
Acne                                       120
Urinary tract infection                    120
Psoriasis                                  120
Hepatitis D                                120
Hepatitis B                                120
Allergy      

In [6]:
df['prognosis'].nunique()

41

#**Describe Data**

In [9]:
df.describe()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,pus_filled_pimples,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze
count,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,...,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0,4920.0
mean,0.137805,0.159756,0.021951,0.045122,0.021951,0.162195,0.139024,0.045122,0.045122,0.021951,...,0.021951,0.021951,0.021951,0.023171,0.023171,0.023171,0.023171,0.023171,0.023171,0.023171
std,0.34473,0.366417,0.146539,0.207593,0.146539,0.368667,0.346007,0.207593,0.207593,0.146539,...,0.146539,0.146539,0.146539,0.150461,0.150461,0.150461,0.150461,0.150461,0.150461,0.150461
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
df.columns

Index(['itching', 'skin_rash', 'nodal_skin_eruptions', 'continuous_sneezing',
       'shivering', 'chills', 'joint_pain', 'stomach_pain', 'acidity',
       'ulcers_on_tongue',
       ...
       'blackheads', 'scurring', 'skin_peeling', 'silver_like_dusting',
       'small_dents_in_nails', 'inflammatory_nails', 'blister',
       'red_sore_around_nose', 'yellow_crust_ooze', 'prognosis'],
      dtype='object', length=133)

#**Define Target Variable (y) and Feature Variables (X)**

In [10]:
y = df['prognosis']
X = df.drop(['prognosis'],axis=1)

#**Train Test Split**

In [11]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=2529)

In [12]:
df.shape, X_train.shape, X_test.shape

((4920, 133), (3690, 132), (1230, 132))

In [13]:
y_train.value_counts()

Varicose veins                             90
Acne                                       90
Bronchial Asthma                           90
Diabetes                                   90
Peptic ulcer diseae                        90
hepatitis A                                90
Tuberculosis                               90
Psoriasis                                  90
Paralysis (brain hemorrhage)               90
Arthritis                                  90
Hepatitis D                                90
Osteoarthristis                            90
Jaundice                                   90
Hypertension                               90
Dimorphic hemmorhoids(piles)               90
AIDS                                       90
Pneumonia                                  90
Common Cold                                90
Impetigo                                   90
Hepatitis C                                90
Hyperthyroidism                            90
Chronic cholestasis               

#**Modeling**

In [14]:
from sklearn.neighbors import KNeighborsClassifier

In [15]:
model = KNeighborsClassifier()

In [16]:
model.fit(X_train,y_train)

#**Prediction**

In [17]:
y_pred = model.predict(X_test)

#**Model Evaluation**

In [18]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [19]:
confusion_matrix(y_test,y_pred)

array([[30,  0,  0, ...,  0,  0,  0],
       [ 0, 30,  0, ...,  0,  0,  0],
       [ 0,  0, 30, ...,  0,  0,  0],
       ...,
       [ 0,  0,  0, ..., 30,  0,  0],
       [ 0,  0,  0, ...,  0, 30,  0],
       [ 0,  0,  0, ...,  0,  0, 30]])

In [20]:
print(classification_report(y_test,y_pred))

                                         precision    recall  f1-score   support

(vertigo) Paroymsal  Positional Vertigo       1.00      1.00      1.00        30
                                   AIDS       1.00      1.00      1.00        30
                                   Acne       1.00      1.00      1.00        30
                    Alcoholic hepatitis       1.00      1.00      1.00        30
                                Allergy       1.00      1.00      1.00        30
                              Arthritis       1.00      1.00      1.00        30
                       Bronchial Asthma       1.00      1.00      1.00        30
                   Cervical spondylosis       1.00      1.00      1.00        30
                            Chicken pox       1.00      1.00      1.00        30
                    Chronic cholestasis       1.00      1.00      1.00        30
                            Common Cold       1.00      1.00      1.00        30
                           

#**Explanation**

* I have imported the pandas library, which is a popular tool for data analysis
and manipulation in Python.
* I have loaded the dataset from a GitHub repository using the pd.read_csv function. The dataset contains 4920 rows and 133 columns, where each row represents a patient and each column represents a symptom or the prognosis (the final diagnosis).
* I have used the df.head function to display the first five rows of the dataset, which gives you a glimpse of the data structure and values.
* I have used the df.info function to get some basic information about the dataset, such as the data types, the number of non-null values, and the memory usage.
* I have used the df.shape function to get the dimensions of the dataset, which are (4920, 133).
* I have used the df['prognosis'].value_counts function to get the frequency distribution of the prognosis column, which shows that there are 41 unique diseases in the dataset and each disease has 120 occurrences.
* I have used the df['prognosis'].nunique function to get the number of unique diseases in the dataset, which is 41.
* I have used the df.describe function to get some descriptive statistics of the dataset, such as the mean, standard deviation, minimum, maximum, and quartiles of each numerical column.
* I have used the df.columns function to get the names of all the columns in the dataset, which are mostly symptoms and the prognosis.
* I have defined the target variable (y) as the prognosis column and the feature variables (X) as the rest of the columns, using the df.drop function to remove the prognosis column from the feature set.
* I have used the train_test_split function from the sklearn library to split the data into training and testing sets, with a stratified sampling strategy to preserve the class distribution of the target variable and a random state of 2529 to ensure reproducibility.
* I have used the df.shape, X_train.shape, and X_test.shape functions to check the dimensions of the original, training, and testing sets, which are (4920, 133), (3690, 132), and (1230, 132) respectively.
* I have used the y_train.value_counts function to check the frequency distribution of the target variable in the training set, which shows that each disease has 90 occurrences.
* I have applied the k-nearest neighbors (kNN) algorithm to the multiple disease prediction problem. This is a supervised machine learning method that assigns a new data point to the most common class label among its k closest neighbors in the feature space. I have used the scikit-learn library to implement the kNN algorithm in Python.
* I have imported the KNeighborsClassifier class from the sklearn.neighbors module, which provides the functionality for the kNN algorithm.
* I have created an instance of the KNeighborsClassifier class, with the default value of k (n_neighbors) as 5. You can change this value to any positive integer you want, depending on the complexity of your problem and the size of your data.
* I have fitted the model to the training data using the fit method, which trains the model on the input features (X_train) and the corresponding class labels (y_train).
* I have used the predict method to make predictions on the testing data (X_test), which returns an array of predicted class labels (y_pred) for each data point in the testing set.
* I have imported the confusion_matrix, classification_report, and accuracy_score functions from the sklearn.metrics module, which provide various ways to evaluate the performance of your model.
* I have used the confusion_matrix function to generate a matrix that shows the number of true positives, false positives, true negatives, and false negatives for each class label. The diagonal elements of the matrix represent the correct predictions, while the off-diagonal elements represent the incorrect predictions. A perfect classifier would have zero values in the off-diagonal elements.
* I have used the print function to display the output of the classification_report function, which provides a summary of the precision, recall, f1-score, and support for each class label. Precision is the ratio of true positives to the total number of positive predictions, recall is the ratio of true positives to the total number of actual positives, f1-score is the harmonic mean of precision and recall, and support is the number of occurrences of each class label in the testing set. The report also shows the weighted average of these metrics across all the class labels, which gives an overall measure of the model’s performance.
