# Healthcare Symptoms ‚Üí Disease Classification

This notebook implements the second project specification:

> **Multi-Class Text Classification for Healthcare Symptoms ‚Üí Disease**
>
> Dataset: *Healthcare Symptoms‚ÄìDisease Classification* (Kaggle)
>
> Goal: Given a short text describing symptoms, build models that predict the corresponding disease class.

We will:

1. Load and inspect the dataset.
2. Analyse three **labeling scenarios**:
   - **Scenario A ‚Äì Raw Diseases (no noise removal)**  
   - **Scenario B ‚Äì Cleaned Diseases (canonical disease per symptom text)**  
   - **Scenario C ‚Äì Symptom-Based Clusters (K-Means groups)**  
3. For the **final modelling step**, compare four models on one chosen scenario:
   - Classic ML model with text embeddings (**TF‚ÄëIDF + Logistic Regression**)
   - Simple feed-forward neural network on embeddings (**SimpleNN**)
   - **RNN** model
   - **LSTM** model


## 0. Imports

In [1]:
import pandas as pd
import numpy as np

from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

from sklearn.cluster import KMeans

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

import matplotlib.pyplot as plt

pd.options.display.max_colwidth = 120


## 1. Load Dataset & Basic Cleaning

In [2]:
# Path assumes Healthcare.csv is in the same folder as this notebook
DATA_PATH = "./Healthcare.csv"

# 1.1 Load raw dataset
df = pd.read_csv(DATA_PATH)
print(f"Raw dataset shape: {df.shape}")
display(df.head())

# 1.2 Basic cleaning
# - Drop duplicate rows
# - Lowercase symptoms and strip whitespace
df = df.drop_duplicates().copy()
df["Symptoms"] = df["Symptoms"].str.lower().str.strip()

print(f"\nAfter cleaning duplicates: {df.shape}")
print("Number of unique diseases:", df["Disease"].nunique())
print(df["Disease"].value_counts().head())

Raw dataset shape: (25000, 6)


Unnamed: 0,Patient_ID,Age,Gender,Symptoms,Symptom_Count,Disease
0,1,29,Male,"fever, back pain, shortness of breath",3,Allergy
1,2,76,Female,"insomnia, back pain, weight loss",3,Thyroid Disorder
2,3,78,Male,"sore throat, vomiting, diarrhea",3,Influenza
3,4,58,Other,"blurred vision, depression, weight loss, muscle pain",4,Stroke
4,5,55,Female,"swelling, appetite loss, nausea",3,Heart Disease



After cleaning duplicates: (25000, 6)
Number of unique diseases: 30
Disease
Anxiety           911
Arthritis         896
Food Poisoning    871
Depression        859
Allergy           858
Name: count, dtype: int64


## 2. Utility ‚Äì Baseline Trainer (TF‚ÄëIDF + Logistic Regression)

This helper encapsulates the repeated steps for the classic model:

- Train/test split (80/20, stratified)
- TF‚ÄëIDF vectorisation
- Logistic Regression training
- Accuracy + most-frequent-class baseline + classification report.

In [15]:
def train_logreg_tfidf(texts, labels, description="Scenario"):
    """Train & evaluate TF‚ÄëIDF + Logistic Regression baseline.

    Returns (accuracy, classifier, tfidf_vectorizer,
             X_train_text, X_test_text, y_train, y_test).
    """
    print("\n" + "=" * 80)
    print(f"{description}: TF‚ÄëIDF + Logistic Regression")
    print("=" * 80)

    X_train_text, X_test_text, y_train, y_test = train_test_split(
        texts,
        labels,
        test_size=0.2,
        random_state=42,
        stratify=labels,
    )

    # TF‚ÄëIDF vectorisation
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
    X_train_vec = tfidf.fit_transform(X_train_text)
    X_test_vec = tfidf.transform(X_test_text)

    print("TF‚ÄëIDF shapes:", X_train_vec.shape, X_test_vec.shape)

    # Logistic Regression classifier
    clf = LogisticRegression(
        max_iter=1000,
        random_state=42,
        multi_class="multinomial",
        n_jobs=-1,
    )
    clf.fit(X_train_vec, y_train)

    # Evaluation
    y_pred = clf.predict(X_test_vec)
    acc = accuracy_score(y_test, y_pred)

    # Most-frequent-class baseline for comparison
    majority = pd.Series(y_train).value_counts(normalize=True).iloc[0]

    print(f"\nAccuracy: {acc:.4f}")
    print(f"Most-frequent-class baseline: {majority:.4f}")
    print("\nClassification report:")
    print(classification_report(y_test, y_pred, digits=3))

    return acc, clf, tfidf, X_train_text, X_test_text, y_train, y_test

## 3. Scenario A ‚Äì Raw Diseases (No Noise Removal)

In Scenario A we directly predict the original **Disease** label from the symptom text, without any attempt to clean noisy labels.

This reflects the "naive" formulation of the problem and serves as a starting point.

In [4]:
# Encode raw Disease labels to integers
le_raw = LabelEncoder()
df["Disease_id_raw"] = le_raw.fit_transform(df["Disease"])

print("Number of disease classes (raw):", len(le_raw.classes_))

acc_raw, clf_raw, tfidf_raw, X_train_raw, X_test_raw, y_train_raw, y_test_raw = train_logreg_tfidf(
    texts=df["Symptoms"],
    labels=df["Disease_id_raw"],
    description="Scenario A ‚Äì Raw Diseases",
)

print(f"\nScenario A final accuracy: {acc_raw:.4f}")

Number of disease classes (raw): 30

Scenario A ‚Äì Raw Diseases: TF‚ÄëIDF + Logistic Regression
TF‚ÄëIDF shapes: (20000, 649) (5000, 649)

Accuracy: 0.0336
Most-frequent-class baseline: 0.0365

Classification report:
              precision    recall  f1-score   support

           0      0.036     0.041     0.038       172
           1      0.056     0.055     0.056       163
           2      0.018     0.027     0.022       182
           3      0.034     0.056     0.042       179
           4      0.029     0.019     0.023       156
           5      0.055     0.053     0.054       171
           6      0.025     0.024     0.024       168
           7      0.027     0.025     0.026       161
           8      0.031     0.031     0.031       161
           9      0.034     0.030     0.032       165
          10      0.034     0.035     0.034       172
          11      0.015     0.012     0.013       171
          12      0.006     0.006     0.006       170
          13      0.033  

## 4. Scenario B ‚Äì Cleaned Diseases (Canonical Label per Symptom)

In Scenario B we clean noisy labels at the **symptom-text** level:

- For each unique `Symptoms` string, count how often each disease appears.
- If one disease accounts for at least 80% of occurrences, we treat it as the **canonical** disease for that symptom pattern.
- Symptom patterns without a dominant disease are considered ambiguous and are removed.

This enforces a consistent mapping: *each symptom text ‚Üí one disease*.

In [5]:
from collections import Counter

# 4.1 Build counts of diseases for each Symptoms string
symptom_disease_counts = (
    df.groupby("Symptoms")
      .agg({"Disease": lambda x: Counter(x)})
      .reset_index()
      .rename(columns={"Disease": "disease_counts"})
)

print("\nSample of symptom ‚Üí disease frequency table:")
display(symptom_disease_counts.head())

# 4.2 Decide canonical disease for a symptom pattern
def choose_canonical_disease(counter: Counter, min_ratio: float = 0.8):
    """Return dominant disease if its share ‚â• min_ratio; otherwise None."""
    total = sum(counter.values())
    disease, count = counter.most_common(1)[0]
    if count / total >= min_ratio:
        return disease
    return None

symptom_disease_counts["canonical_disease"] = symptom_disease_counts["disease_counts"].apply(
    lambda c: choose_canonical_disease(c, min_ratio=0.8)
)

print("\nAfter canonical mapping (first 10 rows):")
display(symptom_disease_counts.head(10))

# 4.3 Map back to the full dataframe and drop ambiguous rows
symptom2disease = (
    symptom_disease_counts.dropna(subset=["canonical_disease"])
    .set_index("Symptoms")["canonical_disease"]
    .to_dict()
)

print("\nNon‚Äëambiguous symptom patterns:", len(symptom2disease))

df_clean = df.copy()
df_clean["Canonical_Disease"] = df_clean["Symptoms"].map(symptom2disease)

before = len(df_clean)
df_clean = df_clean.dropna(subset=["Canonical_Disease"])
after = len(df_clean)

print(f"\nCleaned dataset: kept {after} of {before} rows ({after/before:.1%}).")
print("Number of distinct canonical diseases:", df_clean["Canonical_Disease"].nunique())
display(df_clean[["Symptoms", "Disease", "Canonical_Disease"]].head(10))


Sample of symptom ‚Üí disease frequency table:


Unnamed: 0,Symptoms,disease_counts
0,"abdominal pain, anxiety, appetite loss, nausea, blurred vision",{'Food Poisoning': 1}
1,"abdominal pain, anxiety, back pain, rash, headache",{'Depression': 1}
2,"abdominal pain, anxiety, back pain, weight loss",{'Liver Disease': 1}
3,"abdominal pain, anxiety, blurred vision, chest pain",{'Hypertension': 1}
4,"abdominal pain, anxiety, blurred vision, dizziness, weight gain, tremors, sore throat",{'Allergy': 1}



After canonical mapping (first 10 rows):


Unnamed: 0,Symptoms,disease_counts,canonical_disease
0,"abdominal pain, anxiety, appetite loss, nausea, blurred vision",{'Food Poisoning': 1},Food Poisoning
1,"abdominal pain, anxiety, back pain, rash, headache",{'Depression': 1},Depression
2,"abdominal pain, anxiety, back pain, weight loss",{'Liver Disease': 1},Liver Disease
3,"abdominal pain, anxiety, blurred vision, chest pain",{'Hypertension': 1},Hypertension
4,"abdominal pain, anxiety, blurred vision, dizziness, weight gain, tremors, sore throat",{'Allergy': 1},Allergy
5,"abdominal pain, anxiety, blurred vision, tremors, sweating, appetite loss",{'Epilepsy': 1},Epilepsy
6,"abdominal pain, anxiety, chest pain, weight gain",{'Diabetes': 1},Diabetes
7,"abdominal pain, anxiety, depression, sneezing, tremors",{'COVID-19': 1},COVID-19
8,"abdominal pain, anxiety, depression, weight gain, back pain, nausea, dizziness",{'Gastritis': 1},Gastritis
9,"abdominal pain, anxiety, diarrhea",{'Liver Disease': 1},Liver Disease



Non‚Äëambiguous symptom patterns: 23789

Cleaned dataset: kept 23806 of 25000 rows (95.2%).
Number of distinct canonical diseases: 30


Unnamed: 0,Symptoms,Disease,Canonical_Disease
0,"fever, back pain, shortness of breath",Allergy,Allergy
1,"insomnia, back pain, weight loss",Thyroid Disorder,Thyroid Disorder
2,"sore throat, vomiting, diarrhea",Influenza,Influenza
3,"blurred vision, depression, weight loss, muscle pain",Stroke,Stroke
4,"swelling, appetite loss, nausea",Heart Disease,Heart Disease
5,"vomiting, swelling, dizziness, fatigue",Heart Disease,Heart Disease
6,"anxiety, shortness of breath, appetite loss, cough, back pain",Food Poisoning,Food Poisoning
7,"sore throat, weight loss, chest pain, depression, anxiety, rash",Bronchitis,Bronchitis
8,"insomnia, diarrhea, swelling",COVID-19,COVID-19
9,"joint pain, shortness of breath, runny nose",Dermatitis,Dermatitis


In [6]:
# 4.4 Encode canonical diseases
le_clean = LabelEncoder()
df_clean["Disease_id_clean"] = le_clean.fit_transform(df_clean["Canonical_Disease"])

print("Number of disease classes (cleaned):", len(le_clean.classes_))

# 4.5 Train baseline on cleaned labels
acc_clean, clf_clean, tfidf_clean, X_train_clean, X_test_clean, y_train_clean, y_test_clean = train_logreg_tfidf(
    texts=df_clean["Symptoms"],
    labels=df_clean["Disease_id_clean"],
    description="Scenario B ‚Äì Cleaned Diseases",
)

print(f"\nScenario B final accuracy: {acc_clean:.4f}")

Number of disease classes (cleaned): 30

Scenario B ‚Äì Cleaned Diseases: TF‚ÄëIDF + Logistic Regression
TF‚ÄëIDF shapes: (19044, 649) (4762, 649)

Accuracy: 0.0338
Most-frequent-class baseline: 0.0367

Classification report:
              precision    recall  f1-score   support

           0      0.034     0.037     0.036       161
           1      0.028     0.026     0.027       156
           2      0.049     0.074     0.059       175
           3      0.066     0.095     0.078       169
           4      0.038     0.027     0.031       150
           5      0.028     0.030     0.029       164
           6      0.048     0.037     0.042       160
           7      0.007     0.006     0.007       154
           8      0.019     0.019     0.019       154
           9      0.050     0.052     0.051       155
          10      0.046     0.042     0.044       165
          11      0.015     0.012     0.014       161
          12      0.028     0.031     0.029       162
          13     

## 5. Scenario C ‚Äì Symptom-Based Clusters (K‚ÄëMeans Groups)

In Scenario C we relax the problem: instead of predicting the exact disease, we let the data define broader **symptom groups** using K‚ÄëMeans clustering.

Steps:

1. Vectorise all symptom strings using `CountVectorizer`.
2. Run `KMeans(n_clusters = 5)` on these vectors.
3. Use the resulting `Cluster_Label` as the target class.
4. Train TF‚ÄëIDF + Logistic Regression to predict the cluster from the original symptom text.

This scenario is easier because the labels are now derived from the same features used by the model.

In [7]:
# 5.1 Vectorise all symptom strings with CountVectorizer
cv = CountVectorizer(max_features=5000)
X_symptoms_cv = cv.fit_transform(df["Symptoms"])

print("CountVectorizer shape:", X_symptoms_cv.shape)

# 5.2 Run K‚ÄëMeans to create k clusters
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(X_symptoms_cv)

# Attach cluster labels to dataframe
df_cluster = df.copy()
df_cluster["Cluster_Label"] = cluster_labels

print("\nCluster label distribution:")
print(df_cluster["Cluster_Label"].value_counts(normalize=True))

# 5.3 Train classifier to predict Cluster_Label from Symptoms
acc_cluster, clf_cluster, tfidf_cluster, X_train_cluster, X_test_cluster, y_train_cluster, y_test_cluster = train_logreg_tfidf(
    texts=df_cluster["Symptoms"],
    labels=df_cluster["Cluster_Label"],
    description="Scenario C ‚Äì Symptom-Based Clusters",
)

print(f"\nScenario C final accuracy: {acc_cluster:.4f}")

CountVectorizer shape: (25000, 35)





Cluster label distribution:
Cluster_Label
1    0.32952
3    0.21652
0    0.18084
4    0.16996
2    0.10316
Name: proportion, dtype: float64

Scenario C ‚Äì Symptom-Based Clusters: TF‚ÄëIDF + Logistic Regression
TF‚ÄëIDF shapes: (20000, 649) (5000, 649)

Accuracy: 0.9976
Most-frequent-class baseline: 0.3295

Classification report:
              precision    recall  f1-score   support

           0      0.999     1.000     0.999       904
           1      1.000     1.000     1.000      1648
           2      1.000     0.981     0.990       516
           3      0.999     0.999     0.999      1082
           4      0.988     0.999     0.994       850

    accuracy                          0.998      5000
   macro avg      0.997     0.996     0.996      5000
weighted avg      0.998     0.998     0.998      5000


Scenario C final accuracy: 0.9976


In [14]:
print(df_cluster["Cluster_Label"].value_counts())

Cluster_Label
1    8238
3    5413
0    4521
4    4249
2    2579
Name: count, dtype: int64


## 6. Scenario-Level Summary

Here we compare the three scenarios using the same classic ML baseline (TF‚ÄëIDF + Logistic Regression).

In [8]:
results_scenarios = pd.DataFrame(
    {
        "Scenario": [
            "A ‚Äì Raw Diseases",
            "B ‚Äì Cleaned Diseases",
            "C ‚Äì Symptom Clusters",
        ],
        "Accuracy": [acc_raw, acc_clean, acc_cluster],
    }
)

print("\nüèÜ FINAL SCENARIO COMPARISON (Classic ML) üèÜ")
print(results_scenarios.to_string(index=False))


üèÜ FINAL SCENARIO COMPARISON (Classic ML) üèÜ
            Scenario  Accuracy
    A ‚Äì Raw Diseases  0.033600
B ‚Äì Cleaned Diseases  0.033809
C ‚Äì Symptom Clusters  0.997600


## 7. Final Modelling

Following the project specification, we now compare **four models** on a single, clearly defined task.

We choose **Scenario B ‚Äì Cleaned Diseases** as the main task, because:

- It still predicts **real diseases**, not artificial clusters.
- Contradictory labels are removed, so the mapping *Symptoms ‚Üí Disease* is at least self‚Äëconsistent.

The four models are:

1. TF‚ÄëIDF + Logistic Regression (classic model)
2. Simple feed‚Äëforward neural network on token embeddings (SimpleNN)
3. RNN
4. LSTM

All models use the **same train/test split** and are evaluated using **accuracy** and a **classification report**.

### 7.1 Build Vocabulary & Sequence Data (Scenario B)

In [9]:
# We reuse X_train_clean / X_test_clean / y_train_clean / y_test_clean

MAX_LEN = 20

# Build vocabulary from training text only
all_train_words = " ".join(X_train_clean.values).split()
word_counts = Counter(all_train_words)

vocab = {"<PAD>": 0, "<UNK>": 1}
for word, _ in word_counts.most_common():
    vocab[word] = len(vocab)

vocab_size = len(vocab)
print(f"Vocabulary size (Scenario B, train only): {vocab_size}")


def encode_text(text_list, vocab, max_len=20):
    """Convert list/Series of text strings into padded sequences of word IDs."""
    encoded = []
    for text in text_list:
        words = text.split()
        ids = [vocab.get(w, vocab["<UNK>"]) for w in words]
        if len(ids) < max_len:
            ids = ids + [vocab["<PAD>"]] * (max_len - len(ids))
        else:
            ids = ids[:max_len]
        encoded.append(ids)
    return np.array(encoded)

X_train_seq = encode_text(X_train_clean, vocab, max_len=MAX_LEN)
X_test_seq = encode_text(X_test_clean, vocab, max_len=MAX_LEN)

print("Train seq shape:", X_train_seq.shape)
print("Test seq shape:", X_test_seq.shape)

# Convert to tensors
X_train_tensor = torch.from_numpy(X_train_seq).long()
y_train_tensor = torch.from_numpy(y_train_clean.values).long()
X_test_tensor = torch.from_numpy(X_test_seq).long()
y_test_tensor = torch.from_numpy(y_test_clean.values).long()

BATCH_SIZE = 32

train_data = TensorDataset(X_train_tensor, y_train_tensor)
test_data = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE)

print("DataLoaders ready.")

Vocabulary size (Scenario B, train only): 60
Train seq shape: (19044, 20)
Test seq shape: (4762, 20)
DataLoaders ready.


### 7.2 Define Neural Architectures

In [10]:
class SimpleNN(nn.Module):
    """Feed-forward network on flattened embeddings."""
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, max_len):
        super(SimpleNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc1 = nn.Linear(max_len * embed_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        emb = self.embedding(x)                     # [batch, max_len, embed_dim]
        flat = emb.view(emb.size(0), -1)           # [batch, max_len * embed_dim]
        h = self.relu(self.fc1(flat))              # [batch, hidden_dim]
        logits = self.fc2(h)                       # [batch, output_dim]
        return logits


class RNNModel(nn.Module):
    """Vanilla RNN based classifier."""
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        emb = self.embedding(x)
        out, hidden = self.rnn(emb)
        hidden = hidden.squeeze(0)                # [batch, hidden_dim]
        logits = self.fc(hidden)
        return logits


class LSTMModel(nn.Module):
    """LSTM based classifier."""
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        emb = self.embedding(x)
        out, (hidden, cell) = self.lstm(emb)
        hidden = hidden.squeeze(0)                # [batch, hidden_dim]
        logits = self.fc(hidden)
        return logits


print("Neural architectures defined.")

Neural architectures defined.


### 7.3 Training & Evaluation Helper

In [11]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

criterion = nn.CrossEntropyLoss()


def train_and_evaluate(model, name, train_loader, test_loader, epochs=5, lr=1e-3):
    """Train a PyTorch model and report test accuracy + classification report."""
    model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)

    print("\n" + "-" * 60)
    print(f"Training {name}")
    print("-" * 60)

    for epoch in range(epochs):
        model.train()
        running_loss = 0.0
        for X_batch, y_batch in train_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            optimizer.zero_grad()
            logits = model(X_batch)
            loss = criterion(logits, y_batch)
            loss.backward()
            optimizer.step()

            running_loss += loss.item() * X_batch.size(0)

        epoch_loss = running_loss / len(train_loader.dataset)
        print(f"Epoch {epoch+1}/{epochs} - Loss: {epoch_loss:.4f}")

    # Evaluation
    model.eval()
    correct = 0
    total = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            logits = model(X_batch)
            _, preds = torch.max(logits, dim=1)

            total += y_batch.size(0)
            correct += (preds == y_batch).sum().item()

            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(y_batch.cpu().numpy())

    acc = correct / total
    print(f"\n{name} Test Accuracy: {acc:.4f}")
    print("\nClassification report:")
    print(classification_report(all_labels, all_preds, digits=3))

    return acc

Using device: cpu


### 7.4 Run the Four-Model Comparison (Scenario B)

We now run:

1. **Classic ML** ‚Äì the Logistic Regression model already trained as part of Scenario B.  
2. **SimpleNN**  
3. **RNN**  
4. **LSTM**  

All on the cleaned-disease task.

In [12]:
num_classes_clean = len(le_clean.classes_)

results_models = []

# 1. Classic ML accuracy from Scenario B baseline
results_models.append(("Logistic Regression (TF‚ÄëIDF)", acc_clean))

# 2. SimpleNN
EMBED_DIM = 64
HIDDEN_DIM = 128

model_nn = SimpleNN(vocab_size, EMBED_DIM, HIDDEN_DIM, num_classes_clean, MAX_LEN)
acc_nn = train_and_evaluate(model_nn, "SimpleNN", train_loader, test_loader, epochs=5)
results_models.append(("SimpleNN", acc_nn))

# 3. RNN
model_rnn = RNNModel(vocab_size, EMBED_DIM, HIDDEN_DIM, num_classes_clean)
acc_rnn = train_and_evaluate(model_rnn, "RNN", train_loader, test_loader, epochs=5)
results_models.append(("RNN", acc_rnn))

# 4. LSTM
model_lstm = LSTMModel(vocab_size, EMBED_DIM, HIDDEN_DIM, num_classes_clean)
acc_lstm = train_and_evaluate(model_lstm, "LSTM", train_loader, test_loader, epochs=5)
results_models.append(("LSTM", acc_lstm))

results_models_df = pd.DataFrame(results_models, columns=["Model", "Test Accuracy"]).sort_values(by="Test Accuracy", ascending=False)

print("\nüèÜ FINAL MODEL COMPARISON (Scenario B ‚Äì Cleaned Diseases) üèÜ")
print(results_models_df.to_string(index=False))


------------------------------------------------------------
Training SimpleNN
------------------------------------------------------------
Epoch 1/5 - Loss: 3.4065
Epoch 2/5 - Loss: 3.4022
Epoch 3/5 - Loss: 3.4007
Epoch 4/5 - Loss: 3.3959
Epoch 5/5 - Loss: 3.3831

SimpleNN Test Accuracy: 0.0323

Classification report:


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0      0.077     0.006     0.011       161
           1      0.100     0.006     0.012       156
           2      0.031     0.051     0.039       175
           3      0.063     0.036     0.045       169
           4      0.018     0.007     0.010       150
           5      0.500     0.006     0.012       164
           6      0.100     0.006     0.012       160
           7      0.025     0.026     0.025       154
           8      0.074     0.013     0.022       154
           9      1.000     0.006     0.013       155
          10      0.032     0.600     0.060       165
          11      0.000     0.000     0.000       161
          12      0.041     0.031     0.035       162
          13      0.000     0.000     0.000       158
          14      0.050     0.060     0.055       167
          15      0.000     0.000     0.000       155
          16      0.000     0.000     0.000       152
          17      0.000    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Epoch 1/5 - Loss: 3.4042
Epoch 2/5 - Loss: 3.4019
Epoch 3/5 - Loss: 3.4015
Epoch 4/5 - Loss: 3.4013
Epoch 5/5 - Loss: 3.4012

LSTM Test Accuracy: 0.0367

Classification report:
              precision    recall  f1-score   support

           0      0.000     0.000     0.000       161
           1      0.000     0.000     0.000       156
           2      0.037     1.000     0.071       175
           3      0.000     0.000     0.000       169
           4      0.000     0.000     0.000       150
           5      0.000     0.000     0.000       164
           6      0.000     0.000     0.000       160
           7      0.000     0.000     0.000       154
           8      0.000     0.000     0.000       154
           9      0.000     0.000     0.000       155
          10      0.000     0.000     0.000       165
          11      0.000     0.000     0.000       161
          12      0.000     0.000     0.000       162
          13      0.000     0.000     0.000       158
          14

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
