# Healthcare Symptoms–Disease Classification

This dataset contains 25,000 synthetic healthcare records designed for machine learning models that classify diseases based on patient symptoms. It includes demographic attributes, symptom lists, and confirmed diagnoses across 30 common acute, chronic, infectious, and neurological diseases.

The dataset is well-suited for:

Multi-class disease classification Symptom pattern analysis Medical decision support modeling NLP feature extraction on symptom text Data mining and biomedical research

Each record corresponds to a unique patient with a generated combination of symptoms and diagnosis created from realistic patterns while maintaining anonymity.

This dataset is purely synthetic, meaning no real patient data is used.

## imports

In [1]:
!pip install catboost
from catboost import CatBoostClassifier
import numpy as np
import pandas as pd
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
import os
from sklearn.preprocessing import LabelEncoder
import kagglehub
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection  import train_test_split
import tensorflow as tf
from tensorflow.keras import layers, models



## Load The Data

In [2]:
# Download latest version
path = kagglehub.dataset_download("kundanbedmutha/healthcare-symptomsdisease-classification-dataset")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'healthcare-symptomsdisease-classification-dataset' dataset.
Path to dataset files: /kaggle/input/healthcare-symptomsdisease-classification-dataset


In [3]:
print(os.listdir(path))

['Healthcare.csv']


In [4]:
df = pd.read_csv(os.path.join(path,'Healthcare.csv'))

## Explore The Data

In [5]:
df.head()

Unnamed: 0,Patient_ID,Age,Gender,Symptoms,Symptom_Count,Disease
0,1,29,Male,"fever, back pain, shortness of breath",3,Allergy
1,2,76,Female,"insomnia, back pain, weight loss",3,Thyroid Disorder
2,3,78,Male,"sore throat, vomiting, diarrhea",3,Influenza
3,4,58,Other,"blurred vision, depression, weight loss, muscl...",4,Stroke
4,5,55,Female,"swelling, appetite loss, nausea",3,Heart Disease


In [6]:
df.isna().sum()

Unnamed: 0,0
Patient_ID,0
Age,0
Gender,0
Symptoms,0
Symptom_Count,0
Disease,0


In [7]:
df.duplicated().sum()

np.int64(0)

In [8]:
df['Symptoms'].unique()

array(['fever, back pain, shortness of breath',
       'insomnia, back pain, weight loss',
       'sore throat, vomiting, diarrhea', ..., 'anxiety, nausea, tremors',
       'muscle pain, rash, diarrhea, joint pain',
       'sweating, abdominal pain, fever, insomnia, blurred vision, anxiety, back pain'],
      dtype=object)

In [9]:
df['Disease'].unique()

array(['Allergy', 'Thyroid Disorder', 'Influenza', 'Stroke',
       'Heart Disease', 'Food Poisoning', 'Bronchitis', 'COVID-19',
       'Dermatitis', 'Diabetes', 'Arthritis', 'Sinusitis', 'Dementia',
       "Parkinson's", 'Obesity', 'Asthma', 'Depression', 'Gastritis',
       'Liver Disease', 'Epilepsy', 'IBS', 'Tuberculosis', 'Pneumonia',
       'Anemia', 'Migraine', 'Common Cold', 'Anxiety',
       'Chronic Kidney Disease', 'Ulcer', 'Hypertension'], dtype=object)

In [10]:
df.shape

(25000, 6)

## Preprocessing

In [11]:
df = df.drop('Patient_ID',axis=1)

In [12]:
df['Gender'] = df['Gender'].apply(lambda x:1 if x == 'Male' else 0)

In [13]:
df["Symptoms_list"] = df["Symptoms"].str.split(",").apply(lambda x: [i.strip() for i in x])

# 2. Explode (each symptom becomes separate row)
df_exploded = df.explode("Symptoms_list")

# 3. One-hot encode the exploded symptoms
dummy = pd.get_dummies(df_exploded["Symptoms_list"])

# 4. Group back by original index (max => 0/1)
dummy_grouped = dummy.groupby(level=0).max()

# 5. Convert to int
dummy_grouped = dummy_grouped.astype(int)

# 6. Final DF = original DF without "Symptoms" & without "Symptoms_list"
df = pd.concat([df.drop(columns=["Symptoms", "Symptoms_list"]), dummy_grouped], axis=1)

df

Unnamed: 0,Age,Gender,Symptom_Count,Disease,abdominal pain,anxiety,appetite loss,back pain,blurred vision,chest pain,...,runny nose,shortness of breath,sneezing,sore throat,sweating,swelling,tremors,vomiting,weight gain,weight loss
0,29,1,3,Allergy,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
1,76,0,3,Thyroid Disorder,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,78,1,3,Influenza,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
3,58,0,4,Stroke,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
4,55,0,3,Heart Disease,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,42,1,6,Ulcer,1,1,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
24996,36,1,6,Common Cold,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
24997,70,0,3,Anxiety,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
24998,9,0,4,Obesity,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
df.shape

(25000, 32)

In [15]:
df['Disease'].value_counts()

Unnamed: 0_level_0,count
Disease,Unnamed: 1_level_1
Anxiety,911
Arthritis,896
Food Poisoning,871
Depression,859
Allergy,858
Bronchitis,856
Dermatitis,856
Thyroid Disorder,855
Migraine,854
Diabetes,850


## Refining the Target: From 27 Diseases → Fewer Super-Categories

At first, I tried to build a multi-class classifier that directly predicts **all 27 individual diseases** using the available features (age, gender, symptom count, and one-hot encoded symptoms).  
However, the results were very poor:

- **Overall accuracy was around 0.03 (3%)**
- **Per-class precision, recall, and F1-scores were also very low (≈ 0.02–0.06)**

This is expected because:
- Many diseases share **very similar symptom patterns** (e.g., fever + cough can belong to multiple diseases).
- The model has to separate **27 different classes** based on limited features.
- Some diseases are likely **under-represented**, causing strong class imbalance.

Because of this, I decided to **simplify the prediction task** by grouping diseases into a smaller number of **meaningful medical super-categories**.  
For example:

- Respiratory / Infectious (COVID-19, Asthma, Bronchitis, Pneumonia, Tuberculosis, Common Cold, Sinusitis, Influenza)
- Cardio-metabolic / Blood (Heart Disease, Hypertension, Diabetes, Thyroid Disorder, Obesity, Anemia)
- Digestive / Liver (Gastritis, IBS, Ulcer, Food Poisoning, Liver Disease)
- Neuro / Stroke (Epilepsy, Migraine, Dementia, Parkinson’s, Stroke)
- Mental Health (Depression, Anxiety)
- Immune / Allergy / Skin (Allergy, Dermatitis)
- Other Chronic (Chronic Kidney Disease, Arthritis)

By **reducing 27 fine-grained labels into a smaller set of super-categories**, the classification problem becomes more realistic and learnable:

- The model now focuses on predicting **broader disease groups** rather than exact diagnoses.
- This better reflects the fact that **symptoms overlap heavily** across specific diseases.
- It also improves performance, stability, and interpretability of the model.

In the next steps, I train a new model using these **super-categories** as the target instead of all 27 diseases individually.


In [16]:
disease_to_superclass = {
    # 1) Respiratory & Infections
    "COVID-19": "Respiratory/Infectious",
    "Asthma": "Respiratory/Infectious",
    "Bronchitis": "Respiratory/Infectious",
    "Pneumonia": "Respiratory/Infectious",
    "Tuberculosis": "Respiratory/Infectious",
    "Common Cold": "Respiratory/Infectious",
    "Sinusitis": "Respiratory/Infectious",
    "Influenza": "Respiratory/Infectious",

    # 2) Cardio-metabolic & Blood
    "Heart Disease": "Cardio-metabolic/Blood",
    "Hypertension": "Cardio-metabolic/Blood",
    "Diabetes": "Cardio-metabolic/Blood",
    "Thyroid Disorder": "Cardio-metabolic/Blood",
    "Obesity": "Cardio-metabolic/Blood",
    "Anemia": "Cardio-metabolic/Blood",

    # 3) Digestive & Liver
    "Gastritis": "Digestive/Liver",
    "IBS": "Digestive/Liver",
    "Ulcer": "Digestive/Liver",
    "Food Poisoning": "Digestive/Liver",
    "Liver Disease": "Digestive/Liver",

    # 4) Neurological & Stroke
    "Epilepsy": "Neuro/Stroke",
    "Migraine": "Neuro/Stroke",
    "Dementia": "Neuro/Stroke",
    "Parkinson's": "Neuro/Stroke",
    "Stroke": "Neuro/Stroke",

    # 5) Mental Health
    "Depression": "Mental Health",
    "Anxiety": "Mental Health",

    # 6) Immune / Allergy / Skin
    "Allergy": "Immune/Allergy/Skin",
    "Dermatitis": "Immune/Allergy/Skin",

    # 7) Other Chronic (Kidney, Joint)
    "Chronic Kidney Disease": "Other Chronic",
    "Arthritis": "Other Chronic",
}


In [17]:
df['Superclass'] = df['Disease'].map(disease_to_superclass)

In [18]:
df['Disease'].value_counts()

Unnamed: 0_level_0,count
Disease,Unnamed: 1_level_1
Anxiety,911
Arthritis,896
Food Poisoning,871
Depression,859
Allergy,858
Bronchitis,856
Dermatitis,856
Thyroid Disorder,855
Migraine,854
Diabetes,850


In [19]:
df['Superclass'].value_counts()

Unnamed: 0_level_0,count
Superclass,Unnamed: 1_level_1
Respiratory/Infectious,6545
Cardio-metabolic/Blood,4975
Digestive/Liver,4168
Neuro/Stroke,4125
Mental Health,1770
Immune/Allergy/Skin,1714
Other Chronic,1703


In [20]:
le_disease = LabelEncoder()
df["Disease"] = le_disease.fit_transform(df["Disease"])

print("Disease classes (27 labels):")
print(le_disease.classes_)


le_super = LabelEncoder()
df["Superclass"] = le_super.fit_transform(df["Superclass"])

print("\nSuperclass classes (7 labels):")
print(le_super.classes_)


Disease classes (27 labels):
['Allergy' 'Anemia' 'Anxiety' 'Arthritis' 'Asthma' 'Bronchitis' 'COVID-19'
 'Chronic Kidney Disease' 'Common Cold' 'Dementia' 'Depression'
 'Dermatitis' 'Diabetes' 'Epilepsy' 'Food Poisoning' 'Gastritis'
 'Heart Disease' 'Hypertension' 'IBS' 'Influenza' 'Liver Disease'
 'Migraine' 'Obesity' "Parkinson's" 'Pneumonia' 'Sinusitis' 'Stroke'
 'Thyroid Disorder' 'Tuberculosis' 'Ulcer']

Superclass classes (7 labels):
['Cardio-metabolic/Blood' 'Digestive/Liver' 'Immune/Allergy/Skin'
 'Mental Health' 'Neuro/Stroke' 'Other Chronic' 'Respiratory/Infectious']


In [21]:
df.head()

Unnamed: 0,Age,Gender,Symptom_Count,Disease,abdominal pain,anxiety,appetite loss,back pain,blurred vision,chest pain,...,shortness of breath,sneezing,sore throat,sweating,swelling,tremors,vomiting,weight gain,weight loss,Superclass
0,29,1,3,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,2
1,76,0,3,27,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
2,78,1,3,19,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,6
3,58,0,4,26,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,4
4,55,0,3,16,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 33 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Age                  25000 non-null  int64
 1   Gender               25000 non-null  int64
 2   Symptom_Count        25000 non-null  int64
 3   Disease              25000 non-null  int64
 4   abdominal pain       25000 non-null  int64
 5   anxiety              25000 non-null  int64
 6   appetite loss        25000 non-null  int64
 7   back pain            25000 non-null  int64
 8   blurred vision       25000 non-null  int64
 9   chest pain           25000 non-null  int64
 10  cough                25000 non-null  int64
 11  depression           25000 non-null  int64
 12  diarrhea             25000 non-null  int64
 13  dizziness            25000 non-null  int64
 14  fatigue              25000 non-null  int64
 15  fever                25000 non-null  int64
 16  headache             2

##  Prepare Features & Target

In [23]:
!pip install -q sentence-transformers catboost


In [30]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import os # Import os for os.path.join

# Re-load the original data to get the 'Symptoms' text, as it was dropped from 'df'
original_data = pd.read_csv(os.path.join(path, 'Healthcare.csv'))

# Target is already label encoded as integers in df["Superclass"]
y = df["Superclass"]
# y_enc is simply y, as it's already encoded
y_enc = y


X = pd.DataFrame()
X["Symptoms"] = original_data["Symptoms"]
X["Age"] = df["Age"]
X["Gender"] = df["Gender"]
X["Symptom_Count"] = df["Symptom_Count"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y_enc,
    test_size=0.2,
    random_state=42,
    stratify=y_enc
)

# Load sentence-level BERT/MiniLM encoder
bert_model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode Symptoms text → dense embeddings
train_symptoms = X_train["Symptoms"].astype(str).tolist()
test_symptoms  = X_test["Symptoms"].astype(str).tolist()

emb_train = bert_model.encode(train_symptoms, batch_size=64, show_progress_bar=True)
emb_test  = bert_model.encode(test_symptoms, batch_size=64, show_progress_bar=True)

print("Embedding shape:", emb_train.shape)

# Numeric features: Age, Gender, Symptom_Count
num_cols = ["Age", "Gender", "Symptom_Count"]

scaler = StandardScaler()
num_train = scaler.fit_transform(X_train[num_cols])
num_test  = scaler.transform(X_test[num_cols])

# Concatenate [BERT embeddings | numeric features]
X_train_bert = np.hstack([emb_train, num_train])
X_test_bert  = np.hstack([emb_test, num_test])

print("Final train shape:", X_train_bert.shape)
print("Final test shape:", X_test_bert.shape)


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Batches:   0%|          | 0/79 [00:00<?, ?it/s]

Embedding shape: (20000, 384)
Final train shape: (20000, 387)
Final test shape: (5000, 387)


In [31]:
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report

cat = CatBoostClassifier(
    iterations=800,
    depth=8,
    learning_rate=0.03,
    loss_function="MultiClass",
    eval_metric="Accuracy",
    random_seed=42,
    verbose=100
)

cat.fit(X_train_bert, y_train)

# Predict
y_pred = cat.predict(X_test_bert).flatten()



0:	learn: 0.2698000	total: 2.78s	remaining: 37m 2s
100:	learn: 0.2700500	total: 5m 18s	remaining: 36m 41s
200:	learn: 0.3296500	total: 10m 31s	remaining: 31m 22s
300:	learn: 0.4092000	total: 15m 46s	remaining: 26m 8s
400:	learn: 0.4800500	total: 20m 52s	remaining: 20m 46s
500:	learn: 0.5452000	total: 26m 1s	remaining: 15m 32s
600:	learn: 0.6048500	total: 31m 8s	remaining: 10m 18s
700:	learn: 0.6609000	total: 36m 9s	remaining: 5m 6s
799:	learn: 0.7073000	total: 41m 11s	remaining: 0us


In [34]:
# Predict
y_pred = cat.predict(X_test_bert).flatten()

# Re-create LabelEncoder for superclasses to ensure correct target_names
# This is necessary because 'le_super' might have been lost or re-fit on numerical data
# We get the unique string names directly from the 'disease_to_superclass' dictionary values
all_superclass_names = list(set(disease_to_superclass.values()))
# Create a temporary LabelEncoder and fit it on these string names
temp_le_super = LabelEncoder()
temp_le_super.fit(all_superclass_names)

print("=========== BERT + CatBoost Classification Report (7 Super-Categories) ===========\n")
print(classification_report(
    y_test,
    y_pred,
    target_names=temp_le_super.classes_   # Use the classes from the re-created LabelEncoder
))



                        precision    recall  f1-score   support

Cardio-metabolic/Blood       0.21      0.14      0.17       995
       Digestive/Liver       0.17      0.05      0.07       834
   Immune/Allergy/Skin       0.00      0.00      0.00       343
         Mental Health       0.25      0.00      0.01       354
          Neuro/Stroke       0.22      0.07      0.10       825
         Other Chronic       0.00      0.00      0.00       340
Respiratory/Infectious       0.27      0.79      0.40      1309

              accuracy                           0.25      5000
             macro avg       0.16      0.15      0.11      5000
          weighted avg       0.19      0.25      0.17      5000

