#**Disease Prediction based on Symptoms**

This project focuses on predicting diseases based on input symptoms using machine learning techniques. It involves analyzing a dataset of symptoms and their associated diseases, encoding symptom severity, and applying a neural network classifier (MLPClassifier) for disease prediction. The model achieves an impressive 99% accuracy by training on a dataset that converts categorical symptom data into numerical values using one-hot encoding.

Users can input their symptoms, and the system maps these to severity weights, feeding them into the trained model to predict potential diseases. The model was trained and evaluated using scikit-learn, with the capability to save and reload for future predictions. This project demonstrates how AI and data processing can assist in early and accurate disease detection based on common symptoms.

# **Data Analysis**



In [None]:
!pip install pandas matplotlib seaborn scikit-learn gdown


In [69]:
!gdown --id 1z3C62GGtnWQ0YqRReyJk4BK1Wsjezg4Z


Downloading...
From: https://drive.google.com/uc?id=1z3C62GGtnWQ0YqRReyJk4BK1Wsjezg4Z
To: /content/dataset.csv
100% 633k/633k [00:00<00:00, 101MB/s]


**1. Import Dependencies**

In [70]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn import metrics
from sklearn.model_selection import train_test_split


**1. Load the Dataset**

In [71]:
dataset = pd.read_csv('/content/dataset.csv')

In [72]:
dataset

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4920,Fungal infection,skin rash,itching,nodal skin eruptions,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4921,Fungal infection,itching,skin rash,nodal skin eruptions,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4922,Impetigo,skin rash,high fever,blister,red sore around nose,yellow crust ooze,0,0,0,0,0,0,0,0,0,0,0,0
4923,Impetigo,skin rash,high fever,blister,red sore around nose,yellow crust ooze,0,0,0,0,0,0,0,0,0,0,0,0


In [73]:
# Types of Disease in dataset
dataset['Disease'].value_counts()

Unnamed: 0_level_0,count
Disease,Unnamed: 1_level_1
Impetigo,123
Fungal infection,122
GERD,120
Hypothyroidism,120
Alcoholic hepatitis,120
Tuberculosis,120
Common Cold,120
Pneumonia,120
Dimorphic hemmorhoids(piles),120
Heart attack,120


**2. Get the Statistical Details**

In [74]:
dataset.describe()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
count,4925,4925,4925,4925,4577,3719,2939,2273,1949,1697,1517,1199,749,509,311,245,197,77
unique,41,37,52,56,53,41,33,27,22,23,22,19,12,9,5,4,4,2
top,Impetigo,vomiting,vomiting,fatigue,high_fever,headache,nausea,abdominal_pain,abdominal_pain,yellowing_of_eyes,yellowing_of_eyes,irritability,malaise,muscle_pain,chest_pain,chest_pain,blood_in_sputum,muscle_pain
freq,123,822,870,726,378,348,390,264,276,228,198,120,126,72,96,144,72,72


**3. Get information about dataset**

In [75]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4925 entries, 0 to 4924
Data columns (total 18 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Disease     4925 non-null   object
 1   Symptom_1   4925 non-null   object
 2   Symptom_2   4925 non-null   object
 3   Symptom_3   4925 non-null   object
 4   Symptom_4   4577 non-null   object
 5   Symptom_5   3719 non-null   object
 6   Symptom_6   2939 non-null   object
 7   Symptom_7   2273 non-null   object
 8   Symptom_8   1949 non-null   object
 9   Symptom_9   1697 non-null   object
 10  Symptom_10  1517 non-null   object
 11  Symptom_11  1199 non-null   object
 12  Symptom_12  749 non-null    object
 13  Symptom_13  509 non-null    object
 14  Symptom_14  311 non-null    object
 15  Symptom_15  245 non-null    object
 16  Symptom_16  197 non-null    object
 17  Symptom_17  77 non-null     object
dtypes: object(18)
memory usage: 692.7+ KB


In [76]:
dataset


Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4920,Fungal infection,skin rash,itching,nodal skin eruptions,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4921,Fungal infection,itching,skin rash,nodal skin eruptions,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4922,Impetigo,skin rash,high fever,blister,red sore around nose,yellow crust ooze,0,0,0,0,0,0,0,0,0,0,0,0
4923,Impetigo,skin rash,high fever,blister,red sore around nose,yellow crust ooze,0,0,0,0,0,0,0,0,0,0,0,0


**4. Converting Categorical Data into Numerical using One Hot Encoding**

In [77]:
for col in dataset.columns:
    dataset[col] = dataset[col].str.replace('_',' ')

In [78]:
dataset.head()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin rash,nodal skin eruptions,dischromic patches,,,,,,,,,,,,,
1,Fungal infection,skin rash,nodal skin eruptions,dischromic patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal skin eruptions,dischromic patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin rash,dischromic patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin rash,nodal skin eruptions,,,,,,,,,,,,,,


In [79]:
# Removing the white space in each cell of Dataframe
cols = dataset.columns
data = dataset[cols].values.flatten()

s = pd.Series(data)
s = s.str.strip()
s = s.values.reshape(dataset.shape)

dataset = pd.DataFrame(s,columns = cols)
dataset.head()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin rash,nodal skin eruptions,dischromic patches,,,,,,,,,,,,,
1,Fungal infection,skin rash,nodal skin eruptions,dischromic patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal skin eruptions,dischromic patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin rash,dischromic patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin rash,nodal skin eruptions,,,,,,,,,,,,,,


handling the missing values   

In [80]:
dataset.fillna(0, inplace = True)

In [81]:
dataset

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin rash,nodal skin eruptions,dischromic patches,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Fungal infection,skin rash,nodal skin eruptions,dischromic patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Fungal infection,itching,nodal skin eruptions,dischromic patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Fungal infection,itching,skin rash,dischromic patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Fungal infection,itching,skin rash,nodal skin eruptions,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4920,Fungal infection,skin rash,itching,nodal skin eruptions,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4921,Fungal infection,itching,skin rash,nodal skin eruptions,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4922,Impetigo,skin rash,high fever,blister,red sore around nose,yellow crust ooze,0,0,0,0,0,0,0,0,0,0,0,0
4923,Impetigo,skin rash,high fever,blister,red sore around nose,yellow crust ooze,0,0,0,0,0,0,0,0,0,0,0,0


# **Symptoms Severity**

**1. Load the Dataset**

In [82]:
!gdown --id 1UNejFIcIwu7ykEAE2ONMbZHlPiM7oGug

Downloading...
From: https://drive.google.com/uc?id=1UNejFIcIwu7ykEAE2ONMbZHlPiM7oGug
To: /content/Symptom-severity.csv
100% 2.33k/2.33k [00:00<00:00, 8.11MB/s]


In [83]:
severity = pd.read_csv('/content/Symptom-severity.csv')

In [84]:
severity.head()

Unnamed: 0,Symptom,weight
0,itching,1
1,skin_rash,3
2,nodal_skin_eruptions,4
3,continuous_sneezing,4
4,shivering,5


In [85]:
# remove the slashs from the dataset
severity['Symptom'] = severity['Symptom'].str.replace('_',' ')

Encoding the dataset symtoms with the weights given

In [86]:
vals = dataset.values
symptoms = severity['Symptom'].unique()

In [87]:
for i in range(len(symptoms)):
    vals[vals == symptoms[i]] = severity[severity['Symptom'] == symptoms[i]]['weight'].values[0]

convert the encoded data series into the dataframe

In [88]:
vals

array([['Fungal infection', 1, 3, ..., 0, 0, 0],
       ['Fungal infection', 3, 4, ..., 0, 0, 0],
       ['Fungal infection', 1, 4, ..., 0, 0, 0],
       ...,
       ['Impetigo', 3, 7, ..., '0', '0', '0'],
       ['Impetigo', 3, 7, ..., '0', '0', '0'],
       ['Impetigo', 3, 2, ..., '0', '0', '0']], dtype=object)

In [89]:
cols = dataset.columns
dataset = pd.DataFrame(vals,columns=cols)
dataset

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,1,3,4,dischromic patches,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Fungal infection,3,4,dischromic patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Fungal infection,1,4,dischromic patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Fungal infection,1,3,dischromic patches,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Fungal infection,1,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4920,Fungal infection,3,1,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4921,Fungal infection,1,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4922,Impetigo,3,7,4,2,3,0,0,0,0,0,0,0,0,0,0,0,0
4923,Impetigo,3,7,4,2,3,0,0,0,0,0,0,0,0,0,0,0,0


Check for non-encoded cell

In [90]:
temp = []
for i in range(1,18):
  temp.append(dataset[f'Symptom_{i}'].value_counts())

In [91]:
temp

[Symptom_1
 3    1930
 4    1074
 5     972
 1     679
 6     120
 2     120
 7      30
 Name: count, dtype: int64,
 Symptom_2
 3                      1699
 5                      1500
 4                      1116
 7                       350
 2                       133
 6                       108
 foul smell of urine      18
 1                         1
 Name: count, dtype: int64,
 Symptom_3
 4                      1979
 3                      1014
 5                       852
 7                       378
 2                       360
 6                       222
 foul smell of urine      84
 dischromic  patches      36
 Name: count, dtype: int64,
 Symptom_4
 4                      1110
 5                       900
 3                       846
 2                       704
 7                       553
 0                       348
 6                       348
 dischromic  patches      72
 spotting  urination      42
 0                         2
 Name: count, dtype: int64,
 Symptom_5
 4

In [92]:
dataset = dataset.replace('spotting  urination',0)
dataset = dataset.replace('dischromic  patches',0)
dataset = dataset.replace('foul smell of urine',0)
dataset

  dataset = dataset.replace('spotting  urination',0)
  dataset = dataset.replace('foul smell of urine',0)


Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,1,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Fungal infection,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Fungal infection,1,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Fungal infection,1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Fungal infection,1,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4920,Fungal infection,3,1,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4921,Fungal infection,1,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4922,Impetigo,3,7,4,2,3,0,0,0,0,0,0,0,0,0,0,0,0
4923,Impetigo,3,7,4,2,3,0,0,0,0,0,0,0,0,0,0,0,0


Checkt the ratio of (Symptom:Disease)

In [93]:
print("Number of symptoms used to identify the disease ",len(severity['Symptom'].unique()))
print("Number of diseases that can be identified ",len(dataset['Disease'].unique()))

Number of symptoms used to identify the disease  132
Number of diseases that can be identified  41


# **Split Dataset into training and testing dataset**

**Encoded Dataset**

In [94]:
dataset.head()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,1,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Fungal infection,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Fungal infection,1,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Fungal infection,1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Fungal infection,1,3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**Split dataset into Training and Label Dataset**

In [95]:
X = dataset.iloc[:,1:].values
y = dataset['Disease'].values

In [96]:
X

array([[1, 3, 4, ..., 0, 0, 0],
       [3, 4, 0, ..., 0, 0, 0],
       [1, 4, 0, ..., 0, 0, 0],
       ...,
       [3, 7, 4, ..., '0', '0', '0'],
       [3, 7, 4, ..., '0', '0', '0'],
       [3, 2, 4, ..., '0', '0', '0']], dtype=object)

In [97]:
y

array(['Fungal infection', 'Fungal infection', 'Fungal infection', ...,
       'Impetigo', 'Impetigo', 'Impetigo'], dtype=object)

In [98]:
y.dtype

dtype('O')

**Splits the 'X' set into Training and Testing set**

In [99]:
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [100]:
print(X_train.dtype)
print(X_test.dtype)
print(y_train.dtype)
print(y_test.dtype)


object
object
object
object


In [101]:
# get the set size
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)


(3940, 17)
(3940,)
(985, 17)
(985,)


## **Model: Neural Network --> MLPClassifier**

In [102]:
from sklearn.neural_network import MLPClassifier

In [103]:
model_MLPC = MLPClassifier()

Train Model

In [104]:
model_MLPC.fit(X_train,y_train)



Make the prediction

In [105]:
y_MLPC = model_MLPC.predict(X_test)

**Evaluate the model**

In [106]:
print(classification_report(y_test,y_MLPC))

                                         precision    recall  f1-score   support

(vertigo) Paroymsal  Positional Vertigo       1.00      1.00      1.00        27
                                   AIDS       0.92      1.00      0.96        24
                                   Acne       1.00      1.00      1.00        21
                    Alcoholic hepatitis       1.00      1.00      1.00        24
                                Allergy       0.92      1.00      0.96        24
                              Arthritis       1.00      1.00      1.00        25
                       Bronchial Asthma       1.00      1.00      1.00        27
                   Cervical spondylosis       1.00      0.90      0.95        21
                            Chicken pox       1.00      1.00      1.00        20
                    Chronic cholestasis       1.00      1.00      1.00        25
                            Common Cold       1.00      1.00      1.00        19
                           

**By using the Neural Network --> MLPClassfier we also got the Accuracy = 99%**

# **Test the model Manuallly**

In [107]:
y_man = model_MLPC.predict([[3, 3,	3, 2, 2, 2,	0, 0,	0, 0,	0, 0,	0, 0,	0, 0,	0]])

In [108]:
print(y_man)

['Psoriasis']


# **Save and Use Model**

In [109]:
import pickle

In [115]:
# Save Model
pickle.dump(model_MLPC, open('model_MLPC.sav', 'wb'))


In [116]:
# Load the severity dataset and the trained model
severity = pd.read_csv('/content/Symptom-severity.csv')
model_MLPC = pickle.load(open('model_MLPC.sav', 'rb'))


In [117]:
symptom_dict = dict(zip(severity['Symptom'], severity['weight']))

available_symptoms = list(symptom_dict.keys())

In [118]:
def display_symptoms(start_idx):
    print("\nPlease select the symptoms you have from the list below by entering their corresponding numbers (comma-separated):")
    print(f"{'Sl.No.':<5} {'Symptom':<30} {'Sl.No.':<5} {'Symptom':<30}")

    for i in range(10):  # Display 10 rows, 2 symptoms per row
        left_idx = start_idx + i
        right_idx = start_idx + i + 10

        left_symptom = f"{left_idx + 1}. {available_symptoms[left_idx]}" if left_idx < len(available_symptoms) else ""
        right_symptom = f"{right_idx + 1}. {available_symptoms[right_idx]}" if right_idx < len(available_symptoms) else ""

        print(f"{left_symptom:<35} {right_symptom:<35}")


In [119]:
# Function to get the user's symptom input
def get_user_symptoms():
    selected_symptoms = []
    start_idx = 0
    max_symptoms = 17

    while len(selected_symptoms) < max_symptoms and start_idx < len(available_symptoms):
        display_symptoms(start_idx)


        symptom_input = input("Enter symptom numbers (comma-separated) or '0' to stop: ")
        if symptom_input == '0':
            break


        try:
            symptom_indices = [int(x.strip()) - 1 for x in symptom_input.split(',') if x.strip().isdigit()]
            for idx in symptom_indices:
                if idx >= 0 and idx < len(available_symptoms):
                    selected_symptoms.append(available_symptoms[idx])
                    if len(selected_symptoms) >= max_symptoms:
                        break
                else:
                    print(f"Invalid symptom number: {idx + 1}. Please try again.")
        except ValueError:
            print("Invalid input. Please enter valid symptom numbers.")

        start_idx += 20

    return selected_symptoms

In [120]:
# Function to map symptoms to their severity weights
def map_symptoms_to_weights(symptoms):
    symptom_weights = [0] * 17
    for i, symptom in enumerate(symptoms):
        symptom_weights[i] = symptom_dict[symptom]

    return symptom_weights

In [121]:
# Function to predict disease based on symptoms
def predict_disease(symptom_weights):
    prediction = model_MLPC.predict([symptom_weights])
    return prediction[0]

In [123]:
user_symptoms = get_user_symptoms()
symptom_weights = map_symptoms_to_weights(user_symptoms)
predicted_disease = predict_disease(symptom_weights)

print(f"\nPredicted Disease: {predicted_disease}")



Please select the symptoms you have from the list below by entering their corresponding numbers (comma-separated):
Sl.No. Symptom                        Sl.No. Symptom                       
1. itching                          11. muscle_wasting                 
2. skin_rash                        12. vomiting                       
3. nodal_skin_eruptions             13. burning_micturition            
4. continuous_sneezing              14. spotting_urination             
5. shivering                        15. fatigue                        
6. chills                           16. weight_gain                    
7. joint_pain                       17. anxiety                        
8. stomach_pain                     18. cold_hands_and_feets           
9. acidity                          19. mood_swings                    
10. ulcers_on_tongue                20. weight_loss                    
Enter symptom numbers (comma-separated) or '0' to stop: 1,2,3

Please select the symptom