# Task 3 Disease Prediction System

### Problem Definition
In modern healthcare, early and accurate disease identification is essential for effective treatment and resource optimization. With the exponential growth of medical data and patient records, data-driven approaches have become instrumental in supporting clinical decision-making. Machine learning algorithms enable the extraction of meaningful patterns from large-scale medical datasets, allowing for the development of intelligent systems that assist doctors in diagnosing diseases based on patient symptoms.

The objective of this project is to design and implement a multi-symptom disease prediction system that can predict possible diseases based on a patient’s reported symptoms.

### Data Exploration

In [48]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    classification_report,
    confusion_matrix,
    top_k_accuracy_score,
    make_scorer
)
import time
import warnings
warnings.filterwarnings('ignore')

In [49]:
# Load the dataset
df = pd.read_csv("C:/Users/kendr/Downloads/archive/dataset.csv")

# Dataset Overview
print(f"Dataset Shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

Dataset Shape: (4920, 18)
Number of samples: 4920
Number of columns: 18


In [50]:
# Symptom Analysis
symptom_cols = [col for col in df.columns if col.startswith('Symptom_')]

# Count non-null symptoms per disease (row-wise)
df['Symptom_Count'] = df[symptom_cols].notna().sum(axis=1)

print("Symptoms per disease statistics:")
print(df['Symptom_Count'].describe())

Symptoms per disease statistics:
count    4920.000000
mean        7.448780
std         3.592166
min         3.000000
25%         5.000000
50%         6.000000
75%        10.000000
max        17.000000
Name: Symptom_Count, dtype: float64


In [51]:
# Collect all symptoms
all_symptoms = []
for col in symptom_cols:
    symptoms = df[col].dropna().astype(str).str.strip()
    all_symptoms.extend(symptoms.tolist())

# Filter out "nan" strings
all_symptoms = [symptom for symptom in all_symptoms if symptom.lower() != 'nan']

# Count symptom frequencies
symptom_counter = Counter(all_symptoms)
total_unique_symptoms = len(symptom_counter)

print(f"Total unique symptoms in dataset: {total_unique_symptoms}")
print("\nTop 20 most common symptoms:")
for symptom, count in symptom_counter.most_common(20):
    print(f"  {symptom}: {count}")

Total unique symptoms in dataset: 131

Top 20 most common symptoms:
  fatigue: 1932
  vomiting: 1914
  high_fever: 1362
  loss_of_appetite: 1152
  nausea: 1146
  headache: 1134
  abdominal_pain: 1032
  yellowish_skin: 912
  yellowing_of_eyes: 816
  chills: 798
  skin_rash: 786
  malaise: 702
  chest_pain: 696
  joint_pain: 684
  itching: 678
  sweating: 678
  dark_urine: 570
  cough: 564
  diarrhoea: 564
  irritability: 474


In [52]:
# For each symptom, count how many different diseases it appears in
symptom_disease_map = {}
for symptom in symptom_counter.keys():
    diseases_with_symptom = set()
    for col in symptom_cols:
        diseases = df[df[col].str.strip() == symptom]['Disease'].unique()
        diseases_with_symptom.update(diseases)
    symptom_disease_map[symptom] = len(diseases_with_symptom)

# Find symptoms that appear in multiple diseases
multi_disease_symptoms = {k: v for k, v in symptom_disease_map.items() if v > 1}
print(f"Number of symptoms appearing in multiple diseases: {len(multi_disease_symptoms)}")

# Top symptoms by disease diversity
sorted_symptoms = sorted(symptom_disease_map.items(), key=lambda x: x[1], reverse=True)
print("\nTop 10 symptoms by number of associated diseases:")
for symptom, disease_count in sorted_symptoms[:10]:
    print(f"  {symptom}: appears in {disease_count} different diseases")

Number of symptoms appearing in multiple diseases: 51

Top 10 symptoms by number of associated diseases:
  vomiting: appears in 17 different diseases
  fatigue: appears in 17 different diseases
  high_fever: appears in 12 different diseases
  headache: appears in 10 different diseases
  loss_of_appetite: appears in 10 different diseases
  nausea: appears in 10 different diseases
  abdominal_pain: appears in 9 different diseases
  yellowish_skin: appears in 8 different diseases
  skin_rash: appears in 7 different diseases
  chills: appears in 7 different diseases


This dataset contains data on 41 distinct diseases, each characterized by combinations from 131 unique symptoms, with each disease entry averaging about 7.45 symptoms. Notably, 51 symptoms appear in multiple diseases, highlighting the complexity and overlap often encountered in clinical diagnosis. 

### Data Preprocessing
- Clean data
- Feature engineering
- Create feature matrix (X) and target labels (Y)
- Train-test validation split

In [53]:
# Data Cleaning
df_processed = df.copy()
# Strip whitespace from all symptom entries (handle both string and empty)
for col in symptom_cols:
    df_processed[col] = df_processed[col].astype(str).str.strip()
    # Replace 'nan' strings with empty strings
    df_processed[col] = df_processed[col].replace('nan', '')

# Strip whitespace from disease names
df_processed['Disease'] = df_processed['Disease'].str.strip()

print(f"Sample cleaned data:")
print(df_processed[['Disease'] + symptom_cols[:3]].head())

# Verify no 'nan' strings remain
nan_strings = (df_processed[symptom_cols] == 'nan').sum().sum()
print(f"\n'nan' strings remaining: {nan_strings}")

Sample cleaned data:
            Disease  Symptom_1             Symptom_2             Symptom_3
0  Fungal infection    itching             skin_rash  nodal_skin_eruptions
1  Fungal infection  skin_rash  nodal_skin_eruptions   dischromic _patches
2  Fungal infection    itching  nodal_skin_eruptions   dischromic _patches
3  Fungal infection    itching             skin_rash   dischromic _patches
4  Fungal infection    itching             skin_rash  nodal_skin_eruptions

'nan' strings remaining: 0


In [54]:
# Collect all unique symptoms
all_symptoms = set()
for col in symptom_cols:
    symptoms = df_processed[col][df_processed[col] != ''].unique()
    all_symptoms.update(symptoms)

# Convert to sorted list for consistent indexing
symptom_list = sorted(list(all_symptoms))
symptom_to_index = {symptom: idx for idx, symptom in enumerate(symptom_list)}

print(f"First 10 symptoms in vocabulary: {symptom_list[:10]}")

First 10 symptoms in vocabulary: ['abdominal_pain', 'abnormal_menstruation', 'acidity', 'acute_liver_failure', 'altered_sensorium', 'anxiety', 'back_pain', 'belly_pain', 'blackheads', 'bladder_discomfort']


In [55]:
def create_multi_hot_encoding(row, symptom_to_index, symptom_cols):
    """
    Create a multi-hot encoded vector for a disease entry.
    Each symptom present is marked as 1, absent as 0.
    """
    encoding = np.zeros(len(symptom_to_index))
    
    for col in symptom_cols:
        symptom = row[col]
        if symptom != '' and symptom in symptom_to_index:
            encoding[symptom_to_index[symptom]] = 1
    
    return encoding

In [56]:
# Create feature matrix
X = np.array([create_multi_hot_encoding(row, symptom_to_index, symptom_cols) 
              for _, row in df_processed.iterrows()])

print(f"Feature matrix shape: {X.shape}")

Feature matrix shape: (4920, 131)


In [57]:
# Encode disease labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_processed['Disease'])

print(f"Disease classes (first 10): {list(label_encoder.classes_[:10])}")

# Create disease mapping for reference
disease_mapping = {idx: disease for idx, disease in enumerate(label_encoder.classes_)}
reverse_disease_mapping = {disease: idx for idx, disease in enumerate(label_encoder.classes_)}

print(f"\nDisease mapping:")
for i in range(min(10, len(disease_mapping))):
    print(f"  {i}: {disease_mapping[i]}")

Disease classes (first 10): ['(vertigo) Paroymsal  Positional Vertigo', 'AIDS', 'Acne', 'Alcoholic hepatitis', 'Allergy', 'Arthritis', 'Bronchial Asthma', 'Cervical spondylosis', 'Chicken pox', 'Chronic cholestasis']

Disease mapping:
  0: (vertigo) Paroymsal  Positional Vertigo
  1: AIDS
  2: Acne
  3: Alcoholic hepatitis
  4: Allergy
  5: Arthritis
  6: Bronchial Asthma
  7: Cervical spondylosis
  8: Chicken pox
  9: Chronic cholestasis


In [58]:
# Split data - 80% train, 20% validation
X_train, X_val, y_train, y_val = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Validation set size: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.1f}%)")

Training set size: 3936 samples (80.0%)
Validation set size: 984 samples (20.0%)


### Base Model Construction
- Build baseline AdaBoost with default parameters

#### Why AdaBoost?
In many real-world scenarios, patients exhibit overlapping or ambiguous symptoms, making traditional rule-based diagnostic systems insufficient. To address this, we aim to build an interpretable and data-driven model using the AdaBoost classifier.

Advantages of AdaBoost:
1. An ensemble algorithm that builds a strong classifier by combining multiple weak learners, making it highly effective in improving prediction accuracy for complex multi-symptom diagnostic problems. AdaBoost adaptively focuses on ambiguous instances (in the healthcare datasets where symptoms overlap between diseases), helping the model to learn subtle distinctions and reduce misclassification rates.

2. It is fast, scalable, and robust, making it suitable for real-time applications and large datasets found in clinical environments.

In [59]:
# Building Baseline AdaBoost model
base_adaboost = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=3),
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)

### Model Training & Evaluation
- Train base model on training set
- Evaluate on validation set using:
    Accuracy, Precision, Recall, F1-score, Top-K accuracy
- Analyze results and error patterns

In [60]:
# Train the model and measure time
start_time = time.time()
base_adaboost.fit(X_train, y_train)
training_time = time.time() - start_time

In [61]:
# Predictions on training set
y_train_pred = base_adaboost.predict(X_train)
y_train_proba = base_adaboost.predict_proba(X_train)

# Predictions on validation set
start_time = time.time()
y_val_pred = base_adaboost.predict(X_val)
y_val_proba = base_adaboost.predict_proba(X_val)
inference_time = time.time() - start_time

In [62]:
# Training set metrics
train_accuracy = accuracy_score(y_train, y_train_pred)
train_precision = precision_score(y_train, y_train_pred, average='weighted', zero_division=0)
train_recall = recall_score(y_train, y_train_pred, average='weighted', zero_division=0)
train_f1 = f1_score(y_train, y_train_pred, average='weighted', zero_division=0)

# Validation set metrics
val_accuracy = accuracy_score(y_val, y_val_pred)
val_precision = precision_score(y_val, y_val_pred, average='weighted', zero_division=0)
val_recall = recall_score(y_val, y_val_pred, average='weighted', zero_division=0)
val_f1 = f1_score(y_val, y_val_pred, average='weighted', zero_division=0)

print(f"{'Metric':<20} {'Training Set':<20} {'Validation Set':<20} {'Difference':<20}")
print("-"*80)
print(f"{'Accuracy':<20} {train_accuracy:>8.4f} ({train_accuracy*100:>5.2f}%)   {val_accuracy:>8.4f} ({val_accuracy*100:>5.2f}%)   {abs(train_accuracy-val_accuracy):>8.4f}")
print(f"{'Precision':<20} {train_precision:>8.4f} ({train_precision*100:>5.2f}%)   {val_precision:>8.4f} ({val_precision*100:>5.2f}%)   {abs(train_precision-val_precision):>8.4f}")
print(f"{'Recall':<20} {train_recall:>8.4f} ({train_recall*100:>5.2f}%)   {val_recall:>8.4f} ({val_recall*100:>5.2f}%)   {abs(train_recall-val_recall):>8.4f}")
print(f"{'F1-Score':<20} {train_f1:>8.4f} ({train_f1*100:>5.2f}%)   {val_f1:>8.4f} ({val_f1*100:>5.2f}%)   {abs(train_f1-val_f1):>8.4f}")

Metric               Training Set         Validation Set       Difference          
--------------------------------------------------------------------------------
Accuracy               0.9652 (96.52%)     0.9563 (95.63%)     0.0089
Precision              0.9845 (98.45%)     0.9843 (98.43%)     0.0001
Recall                 0.9652 (96.52%)     0.9563 (95.63%)     0.0089
F1-Score               0.9708 (97.08%)     0.9644 (96.44%)     0.0064


In [63]:
# Check for overfitting
accuracy_gap = train_accuracy - val_accuracy

print(f"Accuracy Gap:        {accuracy_gap:.4f}")

if accuracy_gap < 0.05:
    print("Model generalizes well (gap < 5%)")
elif accuracy_gap < 0.10:
    print("Slight overfitting detected (gap 5-10%)")
else:
    print("Significant overfitting (gap > 10%)")

Accuracy Gap:        0.0089
Model generalizes well (gap < 5%)


**Top-K accuracy analysis** is used to evaluate how often the true disease appears among the top K model predictions, rather than only considering when it is the top choice (Top-1 accuracy). This is especially important for real-world, multi-class diagnosis tasks, where several diseases can present with very similar symptoms. It reflects the reality that several plausible diagnoses may fit a patient’s symptoms, helping clinicians make safer, broader-informed decisions.

In [64]:
# Calculate Top-K accuracies for K = 1, 2, 3, 5
k_values = [1, 2, 3, 5]
topk_results = []

for k in k_values:
    # Training set
    train_topk = top_k_accuracy_score(y_train, y_train_proba, k=k, labels=np.arange(len(label_encoder.classes_)))
    # Validation set
    val_topk = top_k_accuracy_score(y_val, y_val_proba, k=k, labels=np.arange(len(label_encoder.classes_)))
    topk_results.append({
        'K': k,
        'Train': train_topk,
        'Validation': val_topk
    })

print(f"{'Top-K':<15} {'Training Set':<25} {'Validation Set':<25} {'Improvement':<15}")
print("-"*80)
for result in topk_results:
    k = result['K']
    train_acc = result['Train']
    val_acc = result['Validation']
    improvement = val_acc - topk_results[0]['Validation']  # vs Top-1
    
    print(f"{'Top-'+str(k):<15} {train_acc:>8.4f} ({train_acc*100:>5.2f}%)      {val_acc:>8.4f} ({val_acc*100:>5.2f}%)      +{improvement*100:>5.2f}%")

Top-K           Training Set              Validation Set            Improvement    
--------------------------------------------------------------------------------
Top-1             0.9652 (96.52%)        0.9563 (95.63%)      + 0.00%
Top-2             0.9970 (99.70%)        0.9939 (99.39%)      + 3.76%
Top-3             0.9990 (99.90%)        0.9980 (99.80%)      + 4.17%
Top-5             1.0000 (100.00%)        1.0000 (100.00%)      + 4.37%


In [65]:
# Get detailed classification report
class_report = classification_report(
    y_val, 
    y_val_pred, 
    target_names=label_encoder.classes_,
    output_dict=True,
    zero_division=0
)

# Extract per-class metrics
class_performance = []
for disease, metrics in class_report.items():
    if disease not in ['accuracy', 'macro avg', 'weighted avg']:
        class_performance.append({
            'Disease': disease,
            'Precision': metrics['precision'],
            'Recall': metrics['recall'],
            'F1-Score': metrics['f1-score'],
            'Support': int(metrics['support'])
        })

# Sort by F1-score
class_performance.sort(key=lambda x: x['F1-Score'], reverse=True)

# Display top performers
print("TOP 5 PERFORMING DISEASES (by F1-Score)")
print(f"{'Disease':<45} {'Precision':<12} {'Recall':<12} {'F1-Score':<12} {'Samples':<8}")
print("-"*95)
for i, perf in enumerate(class_performance[:5], 1):
    print(f"{perf['Disease']:<45} {perf['Precision']:>8.4f}     {perf['Recall']:>8.4f}     {perf['F1-Score']:>8.4f}     {perf['Support']:<8}")

# Display bottom performers
print("\nBOTTOM 5 PERFORMING DISEASES")
print(f"{'Disease':<45} {'Precision':<12} {'Recall':<12} {'F1-Score':<12} {'Samples':<8}")
print("-"*95)
for perf in class_performance[-5:]:
    print(f"{perf['Disease']:<45} {perf['Precision']:>8.4f}     {perf['Recall']:>8.4f}     {perf['F1-Score']:>8.4f}     {perf['Support']:<8}")

# Count perfect predictions
perfect_diseases = sum(1 for p in class_performance if p['F1-Score'] == 1.0)
print(f"\n{perfect_diseases}/{len(class_performance)} diseases achieved perfect F1-score (100%)")


TOP 5 PERFORMING DISEASES (by F1-Score)
Disease                                       Precision    Recall       F1-Score     Samples 
-----------------------------------------------------------------------------------------------
AIDS                                            1.0000       1.0000       1.0000     24      
Acne                                            1.0000       1.0000       1.0000     24      
Arthritis                                       1.0000       1.0000       1.0000     24      
Common Cold                                     1.0000       1.0000       1.0000     24      
Dengue                                          1.0000       1.0000       1.0000     24      

BOTTOM 5 PERFORMING DISEASES
Disease                                       Precision    Recall       F1-Score     Samples 
-----------------------------------------------------------------------------------------------
Fungal infection                                1.0000       0.7917       0.8837

In [66]:
# Generate confusion matrix
cm = confusion_matrix(y_val, y_val_pred)

# Calculate key metrics from confusion matrix
correct_predictions = np.trace(cm)
total_predictions = np.sum(cm)
misclassifications = total_predictions - correct_predictions

print(f"Overall Statistics:")
print(f"  • Total Predictions: {total_predictions}")
print(f"  • Correct Predictions: {correct_predictions} ({correct_predictions/total_predictions*100:.2f}%)")
print(f"  • Misclassifications: {misclassifications} ({misclassifications/total_predictions*100:.2f}%)")

# Analyze which diseases are most problematic
print("\nDISEASES WITH MOST MISCLASSIFICATIONS")
disease_errors = []
for i in range(len(cm)):
    # Row sum minus diagonal = total errors for this disease (false negatives)
    false_negatives = np.sum(cm[i, :]) - cm[i, i]
    # Column sum minus diagonal = times other diseases were classified as this (false positives)
    false_positives = np.sum(cm[:, i]) - cm[i, i]
    total_errors = false_negatives + false_positives
    
    if total_errors > 0:
        disease_errors.append({
            'Disease': label_encoder.classes_[i],
            'False_Negatives': false_negatives,
            'False_Positives': false_positives,
            'Total_Errors': total_errors
        })

disease_errors.sort(key=lambda x: x['Total_Errors'], reverse=True)

print(f"{'Disease':<45} {'False Neg':<12} {'False Pos':<12} {'Total':<10}")
print("-"*80)
for error in disease_errors[:10]:
    print(f"{error['Disease']:<45} {error['False_Negatives']:<12} {error['False_Positives']:<12} {error['Total_Errors']:<10}")

Overall Statistics:
  • Total Predictions: 984
  • Correct Predictions: 941 (95.63%)
  • Misclassifications: 43 (4.37%)

DISEASES WITH MOST MISCLASSIFICATIONS
Disease                                       False Neg    False Pos    Total     
--------------------------------------------------------------------------------
Chronic cholestasis                           0            43           43        
Alcoholic hepatitis                           7            0            7         
Typhoid                                       7            0            7         
Fungal infection                              5            0            5         
Gastroenteritis                               5            0            5         
Allergy                                       3            0            3         
Bronchial Asthma                              3            0            3         
Cervical spondylosis                          3            0            3         
Hepatitis C  

In [67]:
# Identify the most problematic disease
most_confused = disease_errors[0] if disease_errors else None

if most_confused:
    problem_disease = most_confused['Disease']
    problem_idx = np.where(label_encoder.classes_ == problem_disease)[0][0]
    
    print(f"Most Problematic Disease: {problem_disease}")
    print(f"  • Total Errors: {most_confused['Total_Errors']}")
    print(f"  • False Negatives: {most_confused['False_Negatives']} (missed diagnoses)")
    print(f"  • False Positives: {most_confused['False_Positives']} (over-diagnosed)")

Most Problematic Disease: Chronic cholestasis
  • Total Errors: 43
  • False Negatives: 0 (missed diagnoses)
  • False Positives: 43 (over-diagnosed)


In [68]:
# Sample predictions
# Select 5 random samples from validation set
np.random.seed(42)
sample_indices = np.random.choice(len(X_val), size=5, replace=False)

for idx, sample_idx in enumerate(sample_indices, 1):
    true_label = y_val[sample_idx]
    true_disease = label_encoder.classes_[true_label]
    probabilities = y_val_proba[sample_idx]
    
    # Get top 3 predictions
    top_k_indices = np.argsort(probabilities)[::-1][:3]
    
    print(f"{'='*80}")
    print(f"Example {idx}: True Disease = {true_disease}")
    print(f"{'='*80}")
    print(f"{'Rank':<6} {'Predicted Disease':<45} {'Probability':<15}")
    print("-"*80)
    
    for rank, pred_idx in enumerate(top_k_indices, 1):
        pred_disease = label_encoder.classes_[pred_idx]
        probability = probabilities[pred_idx]
        
        # Mark if correct
        marker = "✓ CORRECT" if pred_idx == true_label else ""
        print(f"{rank:<6} {pred_disease:<45} {probability*100:>6.2f}%  {marker}")
    print()

Example 1: True Disease = Allergy
Rank   Predicted Disease                             Probability    
--------------------------------------------------------------------------------
1      Allergy                                         2.44%  ✓ CORRECT
2      Chronic cholestasis                             2.44%  
3      Hepatitis D                                     2.44%  

Example 2: True Disease = Varicose veins
Rank   Predicted Disease                             Probability    
--------------------------------------------------------------------------------
1      Varicose veins                                  2.44%  ✓ CORRECT
2      Hepatitis D                                     2.44%  
3      Chronic cholestasis                             2.44%  

Example 3: True Disease = Psoriasis
Rank   Predicted Disease                             Probability    
--------------------------------------------------------------------------------
1      Psoriasis                         

It was observed that the probabilities assigned to the top three predicted diseases are all very close, around 2.44% each. By printing the full probability distribution, we saw that the predicted “winning” class is only slightly higher than the others, making the difference nearly indistinguishable with just two decimal places. For this reason, we will display more decimal places for probabilities, allowing us to see even subtle differences in the model’s confidence for each disease prediction.

In [69]:
# Check the full probability distribution for first few samples
print("Full probability distributions:")
for i in range(min(3, len(y_val_proba))):
    print(f"Sample {i}: Predicted class {y_val_pred[i]}")
    print(f"Probabilities: {y_val_proba[i]}")
    print(f"Max probability: {y_val_proba[i].max():.5f}")
    print(f"Top 3 classes: {np.argsort(y_val_proba[i])[-3:][::-1]}")
    print("-" * 50)

Full probability distributions:
Sample 0: Predicted class 23
Probabilities: [0.02439298 0.02440716 0.02440701 0.02439564 0.02438462 0.02438616
 0.02440088 0.02439907 0.024375   0.02442795 0.024375   0.024375
 0.024375   0.024386   0.024375   0.02440518 0.02440177 0.02441363
 0.02440209 0.024375   0.02440027 0.0244205  0.024375   0.0244414
 0.024375   0.024375   0.024375   0.024375   0.0244011  0.024375
 0.02440144 0.024375   0.02440177 0.02438737 0.024375   0.024375
 0.024375   0.02439877 0.024375   0.024375   0.02438727]
Max probability: 0.02444
Top 3 classes: [23  9 21]
--------------------------------------------------
Sample 1: Predicted class 14
Probabilities: [0.02439298 0.02440716 0.02440701 0.02439564 0.02438462 0.02438616
 0.02440088 0.02438533 0.02438717 0.02439967 0.024375   0.024375
 0.024375   0.024386   0.02445789 0.02440518 0.02438759 0.02441363
 0.02438706 0.02438983 0.0244153  0.02440565 0.024375   0.02438747
 0.024375   0.024375   0.024375   0.024375   0.02440242 0.02

In [70]:
for idx, sample_idx in enumerate(sample_indices, 1):
    true_label = y_val[sample_idx]
    true_disease = label_encoder.classes_[true_label]
    probabilities = y_val_proba[sample_idx]
    
    # Get top 3 predictions
    top_k_indices = np.argsort(probabilities)[::-1][:3]
    
    print(f"{'='*80}")
    print(f"Example {idx}: True Disease = {true_disease}")
    print(f"{'='*80}")
    print(f"{'Rank':<6} {'Predicted Disease':<45} {'Probability':<15}")
    print("-"*80)
    
    for rank, pred_idx in enumerate(top_k_indices, 1):
        pred_disease = label_encoder.classes_[pred_idx]
        probability = probabilities[pred_idx]
        
        # Mark if correct
        marker = "✓ CORRECT" if pred_idx == true_label else ""
        print(f"{rank:<6} {pred_disease:<45} {probability*100:>6.5f}%  {marker}")
    print()

Example 1: True Disease = Allergy
Rank   Predicted Disease                             Probability    
--------------------------------------------------------------------------------
1      Allergy                                       2.44499%  ✓ CORRECT
2      Chronic cholestasis                           2.44279%  
3      Hepatitis D                                   2.44205%  

Example 2: True Disease = Varicose veins
Rank   Predicted Disease                             Probability    
--------------------------------------------------------------------------------
1      Varicose veins                                2.44227%  ✓ CORRECT
2      Hepatitis D                                   2.44205%  
3      Chronic cholestasis                           2.44163%  

Example 3: True Disease = Psoriasis
Rank   Predicted Disease                             Probability    
--------------------------------------------------------------------------------
1      Psoriasis                   

In [71]:
summary_table = {
    'Metric': [
        'Validation Accuracy',
        'Validation Precision',
        'Validation Recall',
        'Validation F1-Score',
        'Top-3 Accuracy',
        'Top-5 Accuracy',
        'Perfect F1-Score Diseases',
        'Total Misclassifications',
        'Training Time',
        'Inference Time (per sample)'
    ],
    'Value': [
        f"{val_accuracy:.4f} ({val_accuracy*100:.2f}%)",
        f"{val_precision:.4f}",
        f"{val_recall:.4f}",
        f"{val_f1:.4f}",
        f"{topk_results[2]['Validation']:.4f} ({topk_results[2]['Validation']*100:.2f}%)",
        f"{topk_results[3]['Validation']:.4f} ({topk_results[3]['Validation']*100:.2f}%)",
        f"{perfect_diseases}/{len(class_performance)}",
        f"{misclassifications}/{total_predictions}",
        f"{training_time:.2f} seconds",
        f"{(inference_time/len(X_val))*1000:.2f} ms"
    ]
}

for metric, value in zip(summary_table['Metric'], summary_table['Value']):
    print(f"  {metric:<35}: {value}")


  Validation Accuracy                : 0.9563 (95.63%)
  Validation Precision               : 0.9843
  Validation Recall                  : 0.9563
  Validation F1-Score                : 0.9644
  Top-3 Accuracy                     : 0.9980 (99.80%)
  Top-5 Accuracy                     : 1.0000 (100.00%)
  Perfect F1-Score Diseases          : 24/41
  Total Misclassifications           : 43/984
  Training Time                      : 0.94 seconds
  Inference Time (per sample)        : 0.08 ms


The baseline AdaBoost classifier provided a strong foundation for multi-symptom disease prediction, achieving high accuracy (95.63%) and a strong F1-score (96.44%) on the validation set. The small gap between training and validation results confirms the model generalizes well without overfitting. Top-K analysis shows that the correct disease was present in the top-3 predictions for nearly every case, making this model highly reliable for generating ranked diagnostic suggestions.

Performance was especially strong for most diseases, with 24 out of 41 reaching perfect F1-scores. However, misclassifications tended to cluster among certain diseases with overlapping symptom profiles, such as chronic cholestasis, highlighting areas for improvement.

### Test Set Construction with Noise

In real-world medical diagnosis, symptoms often overlap across multiple diseases, and many patients present with combinations of symptoms. This complexity arises because:

- Patients can have overlapping symptoms from two or more different conditions, which can lead to ambiguity or even multiple diagnoses at the same time.
- Incomplete or missing symptom information is common in clinical settings—either because patients forget certain details, or early-stage diseases may not present with all expected symptoms.
- Some symptoms are highly non-specific and can appear in many diseases, increasing the chances of diagnostic uncertainty and misclassification.

To truly evaluate model robustness for real-world diagnosis, two distinct test set constructions are used:

1. Blended disease samples test whether the model can handle diagnoses when symptoms from multiple conditions overlap, simulating complex cases where patients have co-occuring conditions or or display a mix of symptoms that could reasonably belong to more than one disease. 
2. Random noisy samples check the model’s ability to tolerate extra or missing symptoms. This reflects everyday situations where patient data is incomplete or they display mild symptoms unrelated to their main condition.

Evaluating both aspects provides a comprehensive, realistic picture of how the model will perform in practice

In [72]:
def get_disease_symptoms(disease_name, df_processed, symptom_cols):
    disease_data = df_processed[df_processed['Disease'] == disease_name]
    symptoms = set()
    for col in symptom_cols:
        disease_symptoms = disease_data[col][disease_data[col] != ''].unique()
        symptoms.update(disease_symptoms)
    return list(symptoms)

def create_blended_sample(diseases, symptom_counts, df_processed, symptom_cols):
    blended_symptoms, symptom_sources, used_symptoms = [], [], set()
    for disease, count in zip(diseases, symptom_counts):
        available = [s for s in get_disease_symptoms(disease, df_processed, symptom_cols) if s not in used_symptoms]
        if available:
            n_to_select = min(count, len(available))
            selected = np.random.choice(available, size=n_to_select, replace=False)
            blended_symptoms.extend(selected)
            symptom_sources.extend([disease] * n_to_select)
            used_symptoms.update(selected)
    return {
        'symptoms': blended_symptoms,
        'sources': symptom_sources,
        'primary_disease': diseases[0],
        'diseases': diseases,
        'symptom_counts': symptom_counts,
        'actual_counts': [symptom_sources.count(d) for d in diseases]
    }

def encode_symptom_list(symptom_list, symptom_to_index):
    encoding = np.zeros(len(symptom_to_index))
    for s in symptom_list:
        if s in symptom_to_index:
            encoding[symptom_to_index[s]] = 1
    return encoding

In [73]:
# Controlled Test Set
test_scenarios = [
    {'name': 'Scenario 1: Pneumonia-dominant with Typhoid', 'diseases': ['Pneumonia', 'Typhoid'], 'symptom_counts': [8, 2], 'expected': 'Pneumonia'},
    {'name': 'Scenario 2: Typhoid-dominant with Gastroenteritis', 'diseases': ['Typhoid', 'Gastroenteritis'], 'symptom_counts': [7, 3], 'expected': 'Typhoid'},
    {'name': 'Scenario 3: Three-way blend (Pneumonia, Bronchial Asthma, Typhoid)', 'diseases': ['Pneumonia', 'Bronchial Asthma', 'Typhoid'], 'symptom_counts': [5, 3, 2], 'expected': 'Pneumonia'},
    {'name': 'Scenario 4: Equal blend (Pneumonia, Typhoid)', 'diseases': ['Pneumonia', 'Typhoid'], 'symptom_counts': [5, 5], 'expected': 'Either Pneumonia or Typhoid'},
    {'name': 'Scenario 5: Bronchial Asthma with minor Gastroenteritis', 'diseases': ['Bronchial Asthma', 'Gastroenteritis'], 'symptom_counts': [6, 2], 'expected': 'Bronchial Asthma'}
]

controlled_test_samples, controlled_test_labels, controlled_test_metadata = [], [], []

for scenario in test_scenarios:
    for _ in range(10):  # 10 samples per scenario
        sample_data = create_blended_sample(scenario['diseases'], scenario['symptom_counts'], df_processed, symptom_cols)
        if len(sample_data['symptoms']) >= 3:
            encoded = encode_symptom_list(sample_data['symptoms'], symptom_to_index)
            controlled_test_samples.append(encoded)
            controlled_test_labels.append(reverse_disease_mapping[sample_data['primary_disease']])
            controlled_test_metadata.append({
                'scenario': scenario['name'],
                'diseases': sample_data['diseases'],
                'requested_counts': sample_data['symptom_counts'],
                'actual_counts': sample_data['actual_counts'],
                'symptoms': sample_data['symptoms'],
                'sources': sample_data['sources'],
                'expected': scenario['expected']
            })

X_test_controlled = np.array(controlled_test_samples)
y_test_controlled = np.array(controlled_test_labels)

print(f"Controlled test samples: {len(X_test_controlled)} ({len(test_scenarios)} scenarios × 10)")

Controlled test samples: 50 (5 scenarios × 10)


In [74]:
# Noisy Test Set
n_noisy_samples = 200
noise_sample_indices = np.random.choice(len(X_val), size=n_noisy_samples, replace=False)
X_test_noisy, y_test_noisy, noise_metadata = [], [], []

for idx in noise_sample_indices:
    original_sample, true_label = X_val[idx].copy(), y_val[idx]
    orig_symptoms_idxs = np.where(original_sample == 1)[0]
    orig_symptoms = [symptom_list[i] for i in orig_symptoms_idxs]
    random_disease = np.random.choice(label_encoder.classes_)
    while random_disease == label_encoder.classes_[true_label]:
        random_disease = np.random.choice(label_encoder.classes_)
    noise_symptoms = get_disease_symptoms(random_disease, df_processed, symptom_cols)
    unique_noise_symptoms = [s for s in noise_symptoms if s not in set(orig_symptoms)]
    if len(unique_noise_symptoms) > 0:
        n_noise = np.random.randint(1, 4)
        n_to_select = min(n_noise, len(unique_noise_symptoms))
        selected_noise = np.random.choice(unique_noise_symptoms, size=n_to_select, replace=False)
        noisy_sample = original_sample.copy()
        for s in selected_noise:
            if s in symptom_to_index:
                noisy_sample[symptom_to_index[s]] = 1
        X_test_noisy.append(noisy_sample)
        y_test_noisy.append(true_label)
        noise_metadata.append({
            'original_disease': label_encoder.classes_[true_label],
            'noise_disease': random_disease,
            'n_original_symptoms': len(orig_symptoms),
            'n_noise_symptoms': len(selected_noise),
            'noise_level': len(selected_noise) / (len(orig_symptoms) + len(selected_noise)),
            'original_symptoms': orig_symptoms,
            'noise_symptoms': list(selected_noise)})

X_test_noisy = np.array(X_test_noisy)
y_test_noisy = np.array(y_test_noisy)
print(f"Random noisy test samples: {len(X_test_noisy)}")

Random noisy test samples: 200


In [75]:
# Evaluate models
def report_metrics(X_data, y_true, model, set_name):
    y_pred = model.predict(X_data)
    y_proba = model.predict_proba(X_data)
    acc = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    return acc, f1, y_pred, y_proba

val_acc, val_f1, _, _ = report_metrics(X_val, y_val, base_adaboost, "Validation")
controlled_acc, controlled_f1, y_controlled_pred, y_controlled_proba = report_metrics(X_test_controlled, y_test_controlled, base_adaboost, "Controlled Blended")
noisy_acc, noisy_f1, y_noisy_pred, y_noisy_proba = report_metrics(X_test_noisy, y_test_noisy, base_adaboost, "Random Noisy")

In [76]:
# Scenario Results
print("1. Controlled Blended Samples")
for scenario in test_scenarios:
    scenario_name = scenario['name']
    scenario_samples = [i for i, meta in enumerate(controlled_test_metadata) if meta['scenario'] == scenario_name]
    print(f"\n{scenario_name}")
    if scenario_samples:
        sample_idx = scenario_samples[0]
        m = controlled_test_metadata[sample_idx]
        true_label = y_test_controlled[sample_idx]
        true_disease = label_encoder.classes_[true_label]
        proba = y_controlled_proba[sample_idx]
        top_k_indices = np.argsort(proba)[::-1][:5]

        symptom_by_source = {}
        for sym, src in zip(m['symptoms'], m['sources']):
            symptom_by_source.setdefault(src, []).append(sym)
        for d in m['diseases']:
            if d in symptom_by_source:
                print(f"  From {d}:")
                for i, s in enumerate(symptom_by_source[d], 1):
                    print(f"    {i}. {s}")

        print(f"\n{'Rank':<6} {'Predicted Disease':<40} {'Probability':<15} {'Status'}")
        print("-"*80)
        for rank, pred_idx in enumerate(top_k_indices, 1):
            pred_dis = label_encoder.classes_[pred_idx]
            prob = proba[pred_idx]
            status = ''
            if pred_dis == true_disease:
                status = "✓ PRIMARY DISEASE"
            elif pred_dis in m['diseases']:
                status = "○ BLENDED DISEASE"
            print(f"{rank:<6} {pred_dis:<40} {prob*100:>8.5f}%      {status}")


1. Controlled Blended Samples

Scenario 1: Pneumonia-dominant with Typhoid
  From Pneumonia:
    1. chest_pain
    2. fast_heart_rate
    3. chills
    4. phlegm
    5. breathlessness
    6. fatigue
    7. rusty_sputum
    8. malaise
  From Typhoid:
    1. abdominal_pain
    2. diarrhoea

Rank   Predicted Disease                        Probability     Status
--------------------------------------------------------------------------------
1      Pneumonia                                 2.44403%      ✓ PRIMARY DISEASE
2      Chronic cholestasis                       2.44279%      
3      Jaundice                                  2.44159%      
4      Paralysis (brain hemorrhage)              2.44155%      
5      Gastroenteritis                           2.44136%      

Scenario 2: Typhoid-dominant with Gastroenteritis
  From Typhoid:
    1. constipation
    2. chills
    3. diarrhoea
    4. headache
    5. vomiting
    6. toxic_look_(typhos)
    7. fatigue
  From Gastroenteritis:
    1

In [77]:
# Noisy Samples
print("2. Random Noisy Samples")
for i in range(min(3, len(noise_metadata))):
    meta = noise_metadata[i]
    true_label = y_test_noisy[i]
    true_disease = label_encoder.classes_[true_label]
    probabilities = y_noisy_proba[i]
    pred_label = y_noisy_pred[i]
    pred_disease = label_encoder.classes_[pred_label]
    print(f"\nExample {i+1}:")
    print(f"  Original Disease: {meta['original_disease']}")
    print(f"  Original Symptoms ({meta['n_original_symptoms']}): {', '.join(meta['original_symptoms'][:5])}...")
    print(f"  Added Noise from {meta['noise_disease']} ({meta['n_noise_symptoms']}): {', '.join(meta['noise_symptoms'])}")
    print(f"  Noise Level: {meta['noise_level']*100:.1f}%")
    print(f"  Predicted: {pred_disease} {'✓' if pred_label == true_label else '✗'}")
    print(f"  Top prediction probability: {probabilities[pred_label]*100:.5f}%")

2. Random Noisy Samples

Example 1:
  Original Disease: Pneumonia
  Original Symptoms (10): breathlessness, chest_pain, cough, fast_heart_rate, fatigue...
  Added Noise from AIDS (1): extra_marital_contacts
  Noise Level: 9.1%
  Predicted: Pneumonia ✓
  Top prediction probability: 2.44403%

Example 2:
  Original Disease: Urinary tract infection
  Original Symptoms (3): bladder_discomfort, burning_micturition, foul_smell_of urine...
  Added Noise from Hypothyroidism (2): brittle_nails, fatigue
  Noise Level: 40.0%
  Predicted: Urinary tract infection ✓
  Top prediction probability: 2.44295%

Example 3:
  Original Disease: Bronchial Asthma
  Original Symptoms (5): breathlessness, family_history, fatigue, high_fever, mucoid_sputum...
  Added Noise from Urinary tract infection (3): foul_smell_of urine, burning_micturition, bladder_discomfort
  Noise Level: 37.5%
  Predicted: Bronchial Asthma ✓
  Top prediction probability: 2.44394%


In [78]:
# Metrics Summary
print(f"{'Test Set Type':<25} {'Accuracy':<15} {'F1-Score':<15} {'Performance'}")
print("-"*80)
print(f"{'Clean (Validation)':<25} {val_acc:>8.4f} ({val_acc*100:>5.2f}%)  {val_f1:>8.4f}     Baseline")
print(f"{'1. Controlled Blended':<25} {controlled_acc:>8.4f} ({controlled_acc*100:>5.2f}%)  {controlled_f1:>8.4f}     {(controlled_acc-val_acc)*100:>+5.2f}%")
print(f"{'2. Random Noisy':<25} {noisy_acc:>8.4f} ({noisy_acc*100:>5.2f}%)  {noisy_f1:>8.4f}     {(noisy_acc-val_acc)*100:>+5.2f}%")

Test Set Type             Accuracy        F1-Score        Performance
--------------------------------------------------------------------------------
Clean (Validation)          0.9563 (95.63%)    0.9644     Baseline
1. Controlled Blended       0.3400 (34.00%)    0.3812     -61.63%
2. Random Noisy             0.8800 (88.00%)    0.8905     -7.63%


The baseline AdaBoost model performed well with realistic noisy cases, reaching 86.50% accuracy, but struggled with controlled blended samples, where accuracy dropped to 36.00%. These results show the model is robust to random noise, but symptom blends from multiple diseases pose a significant challenge. Most primary diseases still ranked near the top, yet ambiguous cases highlight the need for more advanced or ensemble approaches in clinical prediction.

### Hyperparameter Optimization
- Define parameter grid
- GridSearchCV with cross-validation 
- Compare CV scores across parameter combinations

Objective: Optimize AdaBoost parameters to improve upon baseline performance

Strategy:
1. Define comprehensive parameter grid
2. Use 5-fold cross-validation for robust evaluation
3. Optimize for F1-score (balanced metric)
4. Compare results with baseline model

In [79]:
# Define parameter grid for GridSearchCV
param_grid = {
    # Base estimator parameters (Decision Tree)
    'estimator__max_depth': [4, 5, 6, 7],
    'estimator__min_samples_split': [2, 5],
    'estimator__min_samples_leaf': [1, 2, 4],
    
    # AdaBoost parameters
    'n_estimators': [75, 100, 150, 200],
    'learning_rate': [0.5, 0.8, 1.0]
}

print("Parameter Grid:")
print("-" * 80)
for param, values in param_grid.items():
    print(f"  {param:<35}: {values}")

# Calculate total combinations
total_combinations = 1
for values in param_grid.values():
    total_combinations *= len(values)

print(f"\n{'Total Combinations:':<35} {total_combinations}")
print(f"{'With 5-fold CV:':<35} {total_combinations * 5} model fits")

Parameter Grid:
--------------------------------------------------------------------------------
  estimator__max_depth               : [4, 5, 6, 7]
  estimator__min_samples_split       : [2, 5]
  estimator__min_samples_leaf        : [1, 2, 4]
  n_estimators                       : [75, 100, 150, 200]
  learning_rate                      : [0.5, 0.8, 1.0]

Total Combinations:                 288
With 5-fold CV:                     1440 model fits


In [80]:
# Create base estimator 
base_estimator = DecisionTreeClassifier(random_state=42)

# Create AdaBoost classifier
ada_clf = AdaBoostClassifier(
    estimator=base_estimator,
    random_state=42
)

# Define scoring metrics
scoring = {
    'accuracy': 'accuracy',
    'f1_weighted': make_scorer(f1_score, average='weighted', zero_division=0),
    'f1_macro': make_scorer(f1_score, average='macro', zero_division=0)
}

# Create GridSearchCV
grid_search = GridSearchCV(
    estimator=ada_clf,
    param_grid=param_grid,
    cv=5,  # 5-fold cross-validation
    scoring=scoring,
    refit='f1_weighted',  # Optimize for weighted F1-score
    n_jobs=-1,  
    verbose=2,  
    return_train_score=True
)

print("GridSearchCV Configuration:")
print(f"  • Cross-Validation: 5-fold")
print(f"  • Optimization Metric: F1-Score (weighted)")
print(f"  • Additional Metrics: Accuracy, F1-Macro")
print(f"  • Parallel Jobs: All available cores")
print(f"  • Total fits: {total_combinations * 5}")

GridSearchCV Configuration:
  • Cross-Validation: 5-fold
  • Optimization Metric: F1-Score (weighted)
  • Additional Metrics: Accuracy, F1-Macro
  • Parallel Jobs: All available cores
  • Total fits: 1440


In [81]:
# Run grid search
start_time = time.time()
grid_search.fit(X_train, y_train)
optimization_time = time.time() - start_time

print(f"Grid Search Completed in {optimization_time/60:.2f} minutes")

Fitting 5 folds for each of 288 candidates, totalling 1440 fits
Grid Search Completed in 17.12 minutes


In [82]:
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Optimal Hyperparameters:")
print("\nBase Estimator (Decision Tree):")
print(f"  • max_depth:         {best_params['estimator__max_depth']}")
print(f"  • min_samples_split: {best_params['estimator__min_samples_split']}")
print(f"  • min_samples_leaf:  {best_params['estimator__min_samples_leaf']}")

print("\nAdaBoost Parameters:")
print(f"  • n_estimators:      {best_params['n_estimators']}")
print(f"  • learning_rate:     {best_params['learning_rate']}")

print("\nCross-Validation Performance:")
print(f"  • Best CV F1-Score:  {best_score:.4f} ({best_score*100:.2f}%)")

Optimal Hyperparameters:

Base Estimator (Decision Tree):
  • max_depth:         6
  • min_samples_split: 5
  • min_samples_leaf:  1

AdaBoost Parameters:
  • n_estimators:      75
  • learning_rate:     1.0

Cross-Validation Performance:
  • Best CV F1-Score:  1.0000 (100.00%)


In [83]:
# Get best model from grid search
optimized_model = grid_search.best_estimator_

# Predictions with optimized model
y_train_opt_pred = optimized_model.predict(X_train)
y_val_opt_pred = optimized_model.predict(X_val)

# Calculate metrics
train_opt_accuracy = accuracy_score(y_train, y_train_opt_pred)
val_opt_accuracy = accuracy_score(y_val, y_val_opt_pred)
train_opt_f1 = f1_score(y_train, y_train_opt_pred, average='weighted', zero_division=0)
val_opt_f1 = f1_score(y_val, y_val_opt_pred, average='weighted', zero_division=0)

# Training metrics
print(f"{'Training Accuracy':<25} {train_accuracy:.4f}        {train_opt_accuracy:.4f}        "
      f"{(train_opt_accuracy - train_accuracy):+.4f}")
print(f"{'Training F1-Score':<25} {train_f1:.4f}        {train_opt_f1:.4f}        "
      f"{(train_opt_f1 - train_f1):+.4f}")

# Validation metrics
print(f"{'Validation Accuracy':<25} {val_accuracy:.4f}        {val_opt_accuracy:.4f}        "
      f"{(val_opt_accuracy - val_accuracy):+.4f}")
print(f"{'Validation F1-Score':<25} {val_f1:.4f}        {val_opt_f1:.4f}        "
      f"{(val_opt_f1 - val_f1):+.4f}")

# Overfitting analysis
baseline_gap = train_accuracy - val_accuracy
optimized_gap = train_opt_accuracy - val_opt_accuracy

print(f"{'Overfitting Gap':<25} {baseline_gap:.4f}        {optimized_gap:.4f}        "
      f"{(optimized_gap - baseline_gap):+.4f}")

print("\n" + "-" * 80)

# Determine improvement
improvement_pct = (val_opt_accuracy - val_accuracy) * 100
if improvement_pct > 0:
    print(f"✓ Optimized model improved validation accuracy by {improvement_pct:.2f}%")
elif improvement_pct == 0:
    print(f"→ Optimized model maintains baseline performance")
else:
    print(f"⚠ Optimized model decreased by {abs(improvement_pct):.2f}% "
          f"(may indicate overfitting to CV folds)")

Training Accuracy         0.9652        1.0000        +0.0348
Training F1-Score         0.9708        1.0000        +0.0292
Validation Accuracy       0.9563        1.0000        +0.0437
Validation F1-Score       0.9644        1.0000        +0.0356
Overfitting Gap           0.0089        0.0000        -0.0089

--------------------------------------------------------------------------------
✓ Optimized model improved validation accuracy by 4.37%


In [84]:
# Get probabilities from optimized model
y_val_opt_proba = optimized_model.predict_proba(X_val)

k_values = [1, 2, 3, 5]
topk_optimized = []

for k in k_values:
    topk_acc = top_k_accuracy_score(y_val, y_val_opt_proba, k=k, 
                                     labels=np.arange(len(label_encoder.classes_)))
    topk_optimized.append(topk_acc)

print("Top-K Accuracy Comparison:")
print("-" * 80)
print(f"{'K':<10} {'Baseline':<15} {'Optimized':<15} {'Improvement':<15}")
print("-" * 80)

for k, baseline_result, opt_acc in zip(k_values, topk_results, topk_optimized):
    baseline_acc = baseline_result['Validation']
    improvement = opt_acc - baseline_acc
    print(f"{'Top-'+str(k):<10} {baseline_acc:.4f}        {opt_acc:.4f}        "
          f"{improvement:+.4f} ({improvement*100:+.2f}%)")

Top-K Accuracy Comparison:
--------------------------------------------------------------------------------
K          Baseline        Optimized       Improvement    
--------------------------------------------------------------------------------
Top-1      0.9563        1.0000        +0.0437 (+4.37%)
Top-2      0.9939        1.0000        +0.0061 (+0.61%)
Top-3      0.9980        1.0000        +0.0020 (+0.20%)
Top-5      1.0000        1.0000        +0.0000 (+0.00%)


In [85]:
optimized_config = {
    'model_name': 'Optimized AdaBoost Classifier',
    'base_estimator': 'Decision Tree',
    'hyperparameters': best_params,
    'cv_f1_score': best_score,
    'validation_accuracy': val_opt_accuracy,
    'validation_f1_score': val_opt_f1,
    'training_time': optimization_time,
    'improvement_over_baseline': val_opt_accuracy - val_accuracy
}

print("Optimized Model Configuration:")
print("-" * 80)
for key, value in optimized_config.items():
    if isinstance(value, dict):
        print(f"{key}:")
        for k, v in value.items():
            print(f"  {k}: {v}")
    elif isinstance(value, float):
        if 'time' in key.lower():
            print(f"{key}: {value:.2f} seconds")
        else:
            print(f"{key}: {value:.4f}")
    else:
        print(f"{key}: {value}")

Optimized Model Configuration:
--------------------------------------------------------------------------------
model_name: Optimized AdaBoost Classifier
base_estimator: Decision Tree
hyperparameters:
  estimator__max_depth: 6
  estimator__min_samples_leaf: 1
  estimator__min_samples_split: 5
  learning_rate: 1.0
  n_estimators: 75
cv_f1_score: 1.0000
validation_accuracy: 1.0000
validation_f1_score: 1.0000
training_time: 1027.48 seconds
improvement_over_baseline: 0.0437


### Final Model Evaluation
- Retrain with best hyperparameters on full training set
- Compare baseline and optimized model

In [86]:
def compute_metrics(y_true, y_pred, y_proba, label_encoder):
    return {
        'accuracy': accuracy_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred, average='weighted', zero_division=0),
        'top3': top_k_accuracy_score(y_true, y_proba, k=3, labels=np.arange(len(label_encoder.classes_))),
        'top5': top_k_accuracy_score(y_true, y_proba, k=5, labels=np.arange(len(label_encoder.classes_))),
        'errors': np.sum(y_true != y_pred)
    }

sets_and_models = [
    ("Clean Validation", y_val, X_val, base_adaboost),
    ("Random Noisy", y_test_noisy, X_test_noisy, base_adaboost),
    ("Blended Controlled", y_test_controlled, X_test_controlled, base_adaboost),
    ("Clean Validation", y_val, X_val, optimized_model),
    ("Random Noisy", y_test_noisy, X_test_noisy, optimized_model),
    ("Blended Controlled", y_test_controlled, X_test_controlled, optimized_model),
]

comparison_data = []

for name, y_true, X, model in sets_and_models:
    y_pred = model.predict(X)
    y_proba = model.predict_proba(X)
    metrics = compute_metrics(y_true, y_pred, y_proba, label_encoder)
    model_type = "Optimized" if model is optimized_model else "Baseline"
    comparison_data.append({
        'Test Set': name,
        'Samples': len(y_true),
        'Model': model_type,
        'Accuracy': metrics['accuracy'],
        'F1-Score': metrics['f1'],
        'Top-3 Acc': metrics['top3'],
        'Top-5 Acc': metrics['top5'],
        'Errors': metrics['errors']
    })

comparison_df = pd.DataFrame(comparison_data)

print("\n" + "-"*95)
print(f"{'Test Set':<20} {'Model':<12} {'Samples':<10} {'Accuracy':<12} {'F1-Score':<12} "
      f"{'Top-3':<10} {'Top-5':<10} {'Errors':<8}")
print("-"*95)
for _, row in comparison_df.iterrows():
    print(f"{row['Test Set']:<20} {row['Model']:<12} {row['Samples']:<10} "
          f"{row['Accuracy']:.4f}       {row['F1-Score']:.4f}       "
          f"{row['Top-3 Acc']:.4f}    {row['Top-5 Acc']:.4f}    {row['Errors']:<8}")
print("-"*95)

optimized_val_pred = optimized_model.predict(X_val)
optimized_val_proba = optimized_model.predict_proba(X_val)
optimized_clean = compute_metrics(y_val, optimized_val_pred, optimized_val_proba, label_encoder)
optimized_noisy_pred = optimized_model.predict(X_test_noisy)
optimized_noisy_proba = optimized_model.predict_proba(X_test_noisy)
optimized_noisy = compute_metrics(y_test_noisy, optimized_noisy_pred, optimized_noisy_proba, label_encoder)

accuracy_drop = (optimized_clean['accuracy'] - optimized_noisy['accuracy']) * 100
f1_drop = (optimized_clean['f1'] - optimized_noisy['f1']) * 100
print(f"Accuracy drop from clean to noisy: {accuracy_drop:.2f}%")
print(f"F1-score drop from clean to noisy: {f1_drop:.2f}%")


-----------------------------------------------------------------------------------------------
Test Set             Model        Samples    Accuracy     F1-Score     Top-3      Top-5      Errors  
-----------------------------------------------------------------------------------------------
Clean Validation     Baseline     984        0.9563       0.9644       0.9980    1.0000    43      
Random Noisy         Baseline     200        0.8800       0.8905       0.9800    0.9950    24      
Blended Controlled   Baseline     50         0.3400       0.3812       0.6400    0.6600    33      
Clean Validation     Optimized    984        1.0000       1.0000       1.0000    1.0000    0       
Random Noisy         Optimized    200        0.9600       0.9601       1.0000    1.0000    8       
Blended Controlled   Optimized    50         0.4200       0.4687       0.6800    0.8000    29      
-----------------------------------------------------------------------------------------------
Accuracy 

### Interactive Demo

In [87]:
import re
from difflib import get_close_matches

def normalize_symptom(symptom_input):
    """
    Normalize user input to match dataset symptom format.
    Handles: spaces, underscores, case, extra whitespace, typos
    """
    # Convert to lowercase
    normalized = symptom_input.lower().strip()
    
    # Replace multiple spaces with single space
    normalized = re.sub(r'\s+', ' ', normalized)
    
    # Replace spaces with underscores
    normalized = normalized.replace(' ', '_')
    
    # Remove any leading/trailing underscores
    normalized = normalized.strip('_')
    
    return normalized

def find_best_match(user_symptom, symptom_list, threshold=0.6):
    """
    Find best matching symptom from the vocabulary.
    Uses fuzzy matching to handle typos and variations.
    
    Args:
        user_symptom: User's input symptom
        symptom_list: List of valid symptoms
        threshold: Similarity threshold (0-1)
    
    Returns:
        Tuple of (matched_symptom, confidence_score) or (None, 0)
    """
    # First, try exact match
    normalized = normalize_symptom(user_symptom)
    if normalized in symptom_list:
        return (normalized, 1.0)
    
    # Try fuzzy matching
    matches = get_close_matches(normalized, symptom_list, n=1, cutoff=threshold)
    if matches:
        return (matches[0], 0.8)  # Return with 0.8 confidence for fuzzy match
    
    return (None, 0.0)

def get_symptom_suggestions(partial_input, symptom_list, n=5):
    """Get symptom suggestions based on partial input."""
    normalized = normalize_symptom(partial_input)
    
    # Find symptoms that start with the input
    starts_with = [s for s in symptom_list if s.startswith(normalized)]
    
    # If not enough, use fuzzy matching
    if len(starts_with) < n:
        fuzzy_matches = get_close_matches(normalized, symptom_list, n=n, cutoff=0.4)
        starts_with.extend([s for s in fuzzy_matches if s not in starts_with])
    
    return starts_with[:n]

# Prediction function
def predict_disease(symptoms_input, model, symptom_to_index, label_encoder, 
                   symptom_list, top_k=5, show_confidence=True):

    # Normalize and match symptoms
    matched_symptoms = []
    unmatched_symptoms = []
    fuzzy_matches = []
    
    for symptom in symptoms_input:
        matched, confidence = find_best_match(symptom, symptom_list)
        
        if matched:
            matched_symptoms.append(matched)
            if confidence < 1.0:
                fuzzy_matches.append({
                    'input': symptom,
                    'matched': matched,
                    'confidence': confidence
                })
        else:
            unmatched_symptoms.append(symptom)
    
    # Create encoding
    encoding = np.zeros(len(symptom_to_index))
    for symptom in matched_symptoms:
        if symptom in symptom_to_index:
            encoding[symptom_to_index[symptom]] = 1
    
    # Make prediction
    encoding_2d = encoding.reshape(1, -1)
    probabilities = model.predict_proba(encoding_2d)[0]
    
    # Get top k predictions
    top_k_indices = np.argsort(probabilities)[::-1][:top_k]
    
    predictions = []
    for rank, idx in enumerate(top_k_indices, 1):
        predictions.append({
            'rank': rank,
            'disease': label_encoder.classes_[idx],
            'probability': probabilities[idx]
        })
    
    return {
        'predictions': predictions,
        'matched_symptoms': matched_symptoms,
        'unmatched_symptoms': unmatched_symptoms,
        'fuzzy_matches': fuzzy_matches,
        'total_symptoms': len(matched_symptoms)
    }


def display_predictions(result):
    """Display prediction results in a formatted way."""
    print("\n" + "="*80)
    print("DIAGNOSIS RESULTS")
    print("="*80)
    
    # Show matched symptoms
    print(f"\n✓ Recognized Symptoms ({result['total_symptoms']}):")
    for i, symptom in enumerate(result['matched_symptoms'], 1):
        print(f"  {i}. {symptom}")
    
    # Show fuzzy matches if any
    if result['fuzzy_matches']:
        print(f"\n⚠ Approximate Matches (please verify):")
        for match in result['fuzzy_matches']:
            print(f"  • '{match['input']}' matched to '{match['matched']}' (confidence: {match['confidence']*100:.0f}%)")
    
    # Show unmatched symptoms if any
    if result['unmatched_symptoms']:
        print(f"\n✗ Unrecognized Symptoms:")
        for symptom in result['unmatched_symptoms']:
            print(f"  • {symptom}")
            # Show suggestions
            suggestions = get_symptom_suggestions(symptom, symptom_list, n=3)
            if suggestions:
                print(f"    Did you mean: {', '.join(suggestions)}?")
    
    # Show predictions
    print(f"\n" + "-"*80)
    print("TOP {0} PREDICTED DISEASES:".format(len(result['predictions'])))
    print("-"*80)
    print(f"{'Rank':<6} {'Disease':<45} {'Probability':<15}")
    print("-"*80)
        
    for pred in result['predictions']:        
        print(f"{pred['rank']:<6} {pred['disease']:<45} {pred['probability']*100:>6.5f}%")
    
    print("-"*80)
    
    # Clinical advice
    print("\n⚠ IMPORTANT MEDICAL DISCLAIMER:")
    print("This is an AI prediction system for educational purposes only.")
    print("Always consult qualified healthcare professionals for medical diagnosis.")
    
    return result

In [88]:
def run_interactive_prediction():
    print("\n" + "="*60)
    print("DISEASE PREDICTION - INTERACTIVE SESSION")
    print("="*60)
    print("Enter symptoms separated by commas. Type 'quit' to exit.")

    while True:
        user_input = input("\nEnter symptoms separated by commas. Type 'quit' to exit.: ").strip()
        if user_input.lower() in ['quit', 'exit', 'q']:
            print("\nThank you for using the Disease Prediction System!")
            break

        symptoms_input = [s.strip() for s in user_input.split(",") if s.strip()]
        if not symptoms_input:
            print("⚠ Please enter at least one symptom.")
            continue

        result = predict_disease(
            symptoms_input=symptoms_input,
            model=optimized_model, 
            symptom_to_index=symptom_to_index,
            label_encoder=label_encoder,
            symptom_list=symptom_list,
            top_k=5
        )

        display_predictions(result)

In [89]:
# To start the interactive session, call:
run_interactive_prediction()


DISEASE PREDICTION - INTERACTIVE SESSION
Enter symptoms separated by commas. Type 'quit' to exit.



Enter symptoms separated by commas. Type 'quit' to exit.:  high fever, patches in throat, extra marital contacts



DIAGNOSIS RESULTS

✓ Recognized Symptoms (3):
  1. high_fever
  2. patches_in_throat
  3. extra_marital_contacts

--------------------------------------------------------------------------------
TOP 5 PREDICTED DISEASES:
--------------------------------------------------------------------------------
Rank   Disease                                       Probability    
--------------------------------------------------------------------------------
1      AIDS                                          2.44891%
2      Gastroenteritis                               2.44353%
3      Heart attack                                  2.44326%
4      Hepatitis C                                   2.44167%
5      Acne                                          2.44150%
--------------------------------------------------------------------------------

⚠ IMPORTANT MEDICAL DISCLAIMER:
This is an AI prediction system for educational purposes only.
Always consult qualified healthcare professionals for medica


Enter symptoms separated by commas. Type 'quit' to exit.:  vomiting, headache, weakness of one body side 



DIAGNOSIS RESULTS

✓ Recognized Symptoms (3):
  1. vomiting
  2. headache
  3. weakness_of_one_body_side

--------------------------------------------------------------------------------
TOP 5 PREDICTED DISEASES:
--------------------------------------------------------------------------------
Rank   Disease                                       Probability    
--------------------------------------------------------------------------------
1      Paralysis (brain hemorrhage)                  2.44862%
2      Gastroenteritis                               2.44353%
3      Heart attack                                  2.44326%
4      Hepatitis C                                   2.44245%
5      Chronic cholestasis                           2.44173%
--------------------------------------------------------------------------------

⚠ IMPORTANT MEDICAL DISCLAIMER:
This is an AI prediction system for educational purposes only.
Always consult qualified healthcare professionals for medical diagno


Enter symptoms separated by commas. Type 'quit' to exit.:  excessive hunger, stiff neck, depression, irritability



DIAGNOSIS RESULTS

✓ Recognized Symptoms (4):
  1. excessive_hunger
  2. stiff_neck
  3. depression
  4. irritability

--------------------------------------------------------------------------------
TOP 5 PREDICTED DISEASES:
--------------------------------------------------------------------------------
Rank   Disease                                       Probability    
--------------------------------------------------------------------------------
1      Gastroenteritis                               2.44353%
2      Heart attack                                  2.44326%
3      Chronic cholestasis                           2.44249%
4      Hepatitis C                                   2.44245%
5      Acne                                          2.44241%
--------------------------------------------------------------------------------

⚠ IMPORTANT MEDICAL DISCLAIMER:
This is an AI prediction system for educational purposes only.
Always consult qualified healthcare professionals for m


Enter symptoms separated by commas. Type 'quit' to exit.:  indigestion, headache, blurred and distorted vision, excessive hunger, stiff neck, depression



DIAGNOSIS RESULTS

✓ Recognized Symptoms (6):
  1. indigestion
  2. headache
  3. blurred_and_distorted_vision
  4. excessive_hunger
  5. stiff_neck
  6. depression

--------------------------------------------------------------------------------
TOP 5 PREDICTED DISEASES:
--------------------------------------------------------------------------------
Rank   Disease                                       Probability    
--------------------------------------------------------------------------------
1      Migraine                                      2.44583%
2      Gastroenteritis                               2.44353%
3      Chronic cholestasis                           2.44249%
4      Heart attack                                  2.44249%
5      Hepatitis C                                   2.44245%
--------------------------------------------------------------------------------

⚠ IMPORTANT MEDICAL DISCLAIMER:
This is an AI prediction system for educational purposes only.
Always c


Enter symptoms separated by commas. Type 'quit' to exit.:  joint pain, neck pain, knee pain, hip joint pain



DIAGNOSIS RESULTS

✓ Recognized Symptoms (4):
  1. joint_pain
  2. neck_pain
  3. knee_pain
  4. hip_joint_pain

--------------------------------------------------------------------------------
TOP 5 PREDICTED DISEASES:
--------------------------------------------------------------------------------
Rank   Disease                                       Probability    
--------------------------------------------------------------------------------
1      Osteoarthristis                               2.44713%
2      Gastroenteritis                               2.44353%
3      Chronic cholestasis                           2.44252%
4      Hepatitis C                                   2.44245%
5      Heart attack                                  2.44145%
--------------------------------------------------------------------------------

⚠ IMPORTANT MEDICAL DISCLAIMER:
This is an AI prediction system for educational purposes only.
Always consult qualified healthcare professionals for medical


Enter symptoms separated by commas. Type 'quit' to exit.:   joint pain, neck pain, knee pain, hip joint pain, swelling joints



DIAGNOSIS RESULTS

✓ Recognized Symptoms (5):
  1. joint_pain
  2. neck_pain
  3. knee_pain
  4. hip_joint_pain
  5. swelling_joints

--------------------------------------------------------------------------------
TOP 5 PREDICTED DISEASES:
--------------------------------------------------------------------------------
Rank   Disease                                       Probability    
--------------------------------------------------------------------------------
1      Osteoarthristis                               2.45053%
2      Gastroenteritis                               2.44353%
3      Chronic cholestasis                           2.44252%
4      Hepatitis C                                   2.44082%
5      Heart attack                                  2.44061%
--------------------------------------------------------------------------------

⚠ IMPORTANT MEDICAL DISCLAIMER:
This is an AI prediction system for educational purposes only.
Always consult qualified healthcare prof


Enter symptoms separated by commas. Type 'quit' to exit.:  q



Thank you for using the Disease Prediction System!


### Conclusion

The optimized AdaBoost model showed excellent performance on clean and moderately noisy test data, validating its robustness and potential for clinical support. However, accuracy dropped when dealing with blended or highly ambiguous symptom cases, and the interactive system may not always produce clinically reliable predictions for complex, real-world scenarios.

Model strengths and limitations:
The model is highly effective when enough clear and relevant symptoms are provided, achieving high accuracy and strong Top-K results. Still, its performance decreases for patient cases with overlapping or incomplete symptom profiles. Like many machine learning models in health, its accuracy is also constrained by data quality and the diversity of symptoms covered in the dataset.

Impact of noise:
The optimized system remains robust to moderate noise (extra or missing symptoms), but accuracy and confidence decline when symptom overlap increases or information is sparse. This reflects real-world situations where diagnoses are often uncertain.

Real-world considerations:
For best results, users should provide as many accurate and relevant symptoms as possible, as the model’s confidence and the chance of correct prediction both increase with richer input. In practice, this system is most valuable as a decision support tool—offering likely diagnoses for clinical review, rather than making definitive standalone predictions.
