# Task 3 Disease Prediction System

### Problem Definition
- Healthcare application context
- Problem statement: Multi-symptom disease prediction
- Dataset overview and project objectives

### Data Exploration
- Load dataset and display basic info
- Disease distribution analysis
- Symptom analysis and frequency
- Visualisations if needed

In [89]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    classification_report,
    confusion_matrix,
    top_k_accuracy_score
)
import time
import warnings
warnings.filterwarnings('ignore')

In [17]:
# Load the dataset
df = pd.read_csv("C:/Users/kendr/Downloads/archive/dataset.csv")

# Dataset Overview
print(f"Dataset Shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

Dataset Shape: (4920, 18)
Number of samples: 4920
Number of columns: 18


In [18]:
# Symptom Analysis
# Your analysis code with the FIX:
symptom_cols = [col for col in df.columns if col.startswith('Symptom_')]

# Count non-null symptoms per disease (row-wise)
df['Symptom_Count'] = df[symptom_cols].notna().sum(axis=1)

print("Symptoms per disease statistics:")
print(df['Symptom_Count'].describe())

Symptoms per disease statistics:
count    4920.000000
mean        7.448780
std         3.592166
min         3.000000
25%         5.000000
50%         6.000000
75%        10.000000
max        17.000000
Name: Symptom_Count, dtype: float64


In [26]:
# Collect all symptoms (flatten the symptom columns)
all_symptoms = []
for col in symptom_cols:
    symptoms = df[col].dropna().astype(str).str.strip()
    all_symptoms.extend(symptoms.tolist())

# Filter out "nan" strings
all_symptoms = [symptom for symptom in all_symptoms if symptom.lower() != 'nan']

# Count symptom frequencies
symptom_counter = Counter(all_symptoms)
total_unique_symptoms = len(symptom_counter)

print(f"Total unique symptoms in dataset: {total_unique_symptoms}")
print("\nTop 20 most common symptoms:")
for symptom, count in symptom_counter.most_common(20):
    print(f"  {symptom}: {count}")

Total unique symptoms in dataset: 131

Top 20 most common symptoms:
  fatigue: 1932
  vomiting: 1914
  high_fever: 1362
  loss_of_appetite: 1152
  nausea: 1146
  headache: 1134
  abdominal_pain: 1032
  yellowish_skin: 912
  yellowing_of_eyes: 816
  chills: 798
  skin_rash: 786
  malaise: 702
  chest_pain: 696
  joint_pain: 684
  itching: 678
  sweating: 678
  dark_urine: 570
  cough: 564
  diarrhoea: 564
  irritability: 474


In [27]:
# For each symptom, count how many different diseases it appears in
symptom_disease_map = {}
for symptom in symptom_counter.keys():
    diseases_with_symptom = set()
    for col in symptom_cols:
        diseases = df[df[col].str.strip() == symptom]['Disease'].unique()
        diseases_with_symptom.update(diseases)
    symptom_disease_map[symptom] = len(diseases_with_symptom)

# Find symptoms that appear in multiple diseases
multi_disease_symptoms = {k: v for k, v in symptom_disease_map.items() if v > 1}
print(f"\nNumber of symptoms appearing in multiple diseases: {len(multi_disease_symptoms)}")

# Top symptoms by disease diversity
sorted_symptoms = sorted(symptom_disease_map.items(), key=lambda x: x[1], reverse=True)
print("\nTop 15 symptoms by number of associated diseases:")
for symptom, disease_count in sorted_symptoms[:15]:
    print(f"  {symptom}: appears in {disease_count} different diseases")


Number of symptoms appearing in multiple diseases: 51

Top 15 symptoms by number of associated diseases:
  vomiting: appears in 17 different diseases
  fatigue: appears in 17 different diseases
  high_fever: appears in 12 different diseases
  headache: appears in 10 different diseases
  loss_of_appetite: appears in 10 different diseases
  nausea: appears in 10 different diseases
  abdominal_pain: appears in 9 different diseases
  yellowish_skin: appears in 8 different diseases
  skin_rash: appears in 7 different diseases
  chills: appears in 7 different diseases
  yellowing_of_eyes: appears in 7 different diseases
  itching: appears in 6 different diseases
  chest_pain: appears in 6 different diseases
  joint_pain: appears in 6 different diseases
  sweating: appears in 6 different diseases


In [30]:
print("Key Insights:")
print(f"  • Total diseases: {df['Disease'].nunique()}")
print(f"  • Total unique symptoms: {total_unique_symptoms}")
print(f"  • Average symptoms per disease entry: {df['Symptom_Count'].mean():.2f}")
print(f"  • Symptoms appearing in multiple diseases: {len(multi_disease_symptoms)}")

Key Insights:
  • Total diseases: 41
  • Total unique symptoms: 131
  • Average symptoms per disease entry: 7.45
  • Symptoms appearing in multiple diseases: 51


### Data Preprocessing
- Handle missing values
- Feature engineering
- Create feature matrix (X) and target labels (Y)
- Train-test validation split

In [34]:
# Data Cleaning
# Strip whitespace from all symptom entries (handle both string and empty)
for col in symptom_cols:
    df_processed[col] = df_processed[col].astype(str).str.strip()
    # Replace 'nan' strings with empty strings (from previous astype conversion)
    df_processed[col] = df_processed[col].replace('nan', '')

# Strip whitespace from disease names
df_processed['Disease'] = df_processed['Disease'].str.strip()

print("\nWhitespace removed from all entries.")
print(f"Sample cleaned data:")
print(df_processed[['Disease'] + symptom_cols[:3]].head())

# Verify no 'nan' strings remain
nan_strings = (df_processed[symptom_cols] == 'nan').sum().sum()
print(f"\n'nan' strings remaining: {nan_strings}")


Whitespace removed from all entries.
Sample cleaned data:
            Disease  Symptom_1             Symptom_2             Symptom_3
0  Fungal infection    itching             skin_rash  nodal_skin_eruptions
1  Fungal infection  skin_rash  nodal_skin_eruptions   dischromic _patches
2  Fungal infection    itching  nodal_skin_eruptions   dischromic _patches
3  Fungal infection    itching             skin_rash   dischromic _patches
4  Fungal infection    itching             skin_rash  nodal_skin_eruptions

'nan' strings remaining: 0


In [38]:
# Collect all unique symptoms (excluding empty strings)
all_symptoms = set()
for col in symptom_cols:
    symptoms = df_processed[col][df_processed[col] != ''].unique()
    all_symptoms.update(symptoms)

# Convert to sorted list for consistent indexing
symptom_list = sorted(list(all_symptoms))
symptom_to_index = {symptom: idx for idx, symptom in enumerate(symptom_list)}

print(f"Total unique symptoms: {len(symptom_list)}")
print(f"First 10 symptoms in vocabulary: {symptom_list[:10]}")

Total unique symptoms: 131
First 10 symptoms in vocabulary: ['abdominal_pain', 'abnormal_menstruation', 'acidity', 'acute_liver_failure', 'altered_sensorium', 'anxiety', 'back_pain', 'belly_pain', 'blackheads', 'bladder_discomfort']


In [39]:
def create_multi_hot_encoding(row, symptom_to_index, symptom_cols):
    """
    Create a multi-hot encoded vector for a disease entry.
    Each symptom present is marked as 1, absent as 0.
    """
    encoding = np.zeros(len(symptom_to_index))
    
    for col in symptom_cols:
        symptom = row[col]
        if symptom != '' and symptom in symptom_to_index:
            encoding[symptom_to_index[symptom]] = 1
    
    return encoding

In [49]:
# Create feature matrix
X = np.array([create_multi_hot_encoding(row, symptom_to_index, symptom_cols) 
              for _, row in df_processed.iterrows()])

print(f"Feature matrix shape: {X.shape}")
print(f"Feature matrix dimensions: {X.shape[0]} samples × {X.shape[1]} features")

Feature matrix shape: (4920, 131)
Feature matrix dimensions: 4920 samples × 131 features


In [56]:
# Encode disease labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_processed['Disease'])

print(f"Number of unique disease classes: {len(label_encoder.classes_)}")
print(f"\nDisease classes (first 10): {list(label_encoder.classes_[:10])}")

# Create disease mapping for reference
disease_mapping = {idx: disease for idx, disease in enumerate(label_encoder.classes_)}
reverse_disease_mapping = {disease: idx for idx, disease in enumerate(label_encoder.classes_)}

print(f"\nSample disease mapping:")
for i in range(min(10, len(disease_mapping))):
    print(f"  {i}: {disease_mapping[i]}")

Number of unique disease classes: 41

Disease classes (first 10): ['(vertigo) Paroymsal  Positional Vertigo', 'AIDS', 'Acne', 'Alcoholic hepatitis', 'Allergy', 'Arthritis', 'Bronchial Asthma', 'Cervical spondylosis', 'Chicken pox', 'Chronic cholestasis']

Sample disease mapping:
  0: (vertigo) Paroymsal  Positional Vertigo
  1: AIDS
  2: Acne
  3: Alcoholic hepatitis
  4: Allergy
  5: Arthritis
  6: Bronchial Asthma
  7: Cervical spondylosis
  8: Chicken pox
  9: Chronic cholestasis


In [60]:
# Split data - 80% train, 20% validation
# Using stratify to maintain disease distribution
X_train, X_val, y_train, y_val = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Validation set size: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"\nTraining set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")

Training set size: 3936 samples (80.0%)
Validation set size: 984 samples (20.0%)

Training set shape: (3936, 131)
Validation set shape: (984, 131)


### Base Model Construction
- Explain AdaBoost algorithm choice
- Build baseline AdaBoost with default parameters
- Establish baseline performance metrics

#### Why AdaBoost? (TO EDIT)

In [83]:
# Building Baseline AdaBoost model
base_adaboost = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=3),
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)

### Model Training & Evaluation
- Train base model on training set
- Evaluate on validation set using:
    Accuracy, Precision, Recall, F1-score, Confusion matrix, Top-K accuracy
- Analyze results and error patterns

In [84]:
# Train the model and measure time
start_time = time.time()
base_adaboost.fit(X_train, y_train)
training_time = time.time() - start_time

print(f"Training completed in {training_time:.2f} seconds")

Training completed in 0.92 seconds


In [85]:
# Predictions on training set
y_train_pred = base_adaboost.predict(X_train)
y_train_proba = base_adaboost.predict_proba(X_train)

# Predictions on validation set
start_time = time.time()
y_val_pred = base_adaboost.predict(X_val)
y_val_proba = base_adaboost.predict_proba(X_val)
inference_time = time.time() - start_time

print(f"\n✓ Training predictions generated")
print(f"✓ Validation predictions generated in {inference_time:.4f} seconds")
print(f"  Average time per prediction: {(inference_time/len(X_val))*1000:.2f} ms")



✓ Training predictions generated
✓ Validation predictions generated in 0.0686 seconds
  Average time per prediction: 0.07 ms


In [86]:
# Training set metrics
train_accuracy = accuracy_score(y_train, y_train_pred)
train_precision = precision_score(y_train, y_train_pred, average='weighted', zero_division=0)
train_recall = recall_score(y_train, y_train_pred, average='weighted', zero_division=0)
train_f1 = f1_score(y_train, y_train_pred, average='weighted', zero_division=0)

# Validation set metrics
val_accuracy = accuracy_score(y_val, y_val_pred)
val_precision = precision_score(y_val, y_val_pred, average='weighted', zero_division=0)
val_recall = recall_score(y_val, y_val_pred, average='weighted', zero_division=0)
val_f1 = f1_score(y_val, y_val_pred, average='weighted', zero_division=0)

# Display results in a table
print("\n" + "-"*80)
print(f"{'Metric':<20} {'Training Set':<20} {'Validation Set':<20} {'Difference':<20}")
print("-"*80)
print(f"{'Accuracy':<20} {train_accuracy:>8.4f} ({train_accuracy*100:>5.2f}%)   {val_accuracy:>8.4f} ({val_accuracy*100:>5.2f}%)   {abs(train_accuracy-val_accuracy):>8.4f}")
print(f"{'Precision':<20} {train_precision:>8.4f} ({train_precision*100:>5.2f}%)   {val_precision:>8.4f} ({val_precision*100:>5.2f}%)   {abs(train_precision-val_precision):>8.4f}")
print(f"{'Recall':<20} {train_recall:>8.4f} ({train_recall*100:>5.2f}%)   {val_recall:>8.4f} ({val_recall*100:>5.2f}%)   {abs(train_recall-val_recall):>8.4f}")
print(f"{'F1-Score':<20} {train_f1:>8.4f} ({train_f1*100:>5.2f}%)   {val_f1:>8.4f} ({val_f1*100:>5.2f}%)   {abs(train_f1-val_f1):>8.4f}")
print("-"*80)


--------------------------------------------------------------------------------
Metric               Training Set         Validation Set       Difference          
--------------------------------------------------------------------------------
Accuracy               0.9652 (96.52%)     0.9563 (95.63%)     0.0089
Precision              0.9845 (98.45%)     0.9843 (98.43%)     0.0001
Recall                 0.9652 (96.52%)     0.9563 (95.63%)     0.0089
F1-Score               0.9708 (97.08%)     0.9644 (96.44%)     0.0064
--------------------------------------------------------------------------------


In [87]:
# Check for overfitting
accuracy_gap = train_accuracy - val_accuracy

print(f"Accuracy Gap:        {accuracy_gap:.4f}")

if accuracy_gap < 0.05:
    print("Model generalizes well (gap < 5%)")
elif accuracy_gap < 0.10:
    print("Slight overfitting detected (gap 5-10%)")
else:
    print("Significant overfitting (gap > 10%)")

Accuracy Gap:        0.0089
Model generalizes well (gap < 5%)


In [90]:
# Top-K Accuracy Analysis
# Measures if the correct disease appears in the top K predictions (ranked by confidence)

# Calculate Top-K accuracies for K = 1, 2, 3, 5
k_values = [1, 2, 3, 5]
topk_results = []

for k in k_values:
    # Training set
    train_topk = top_k_accuracy_score(y_train, y_train_proba, k=k, labels=np.arange(len(label_encoder.classes_)))
    # Validation set
    val_topk = top_k_accuracy_score(y_val, y_val_proba, k=k, labels=np.arange(len(label_encoder.classes_)))
    topk_results.append({
        'K': k,
        'Train': train_topk,
        'Validation': val_topk
    })

print("\n" + "-"*80)
print(f"{'Top-K':<15} {'Training Set':<25} {'Validation Set':<25} {'Improvement':<15}")
print("-"*80)
for result in topk_results:
    k = result['K']
    train_acc = result['Train']
    val_acc = result['Validation']
    improvement = val_acc - topk_results[0]['Validation']  # vs Top-1
    
    print(f"{'Top-'+str(k):<15} {train_acc:>8.4f} ({train_acc*100:>5.2f}%)      {val_acc:>8.4f} ({val_acc*100:>5.2f}%)      +{improvement*100:>5.2f}%")
print("-"*80)

print("\nKey Insights:")
print(f"  • Top-1 Accuracy: {topk_results[0]['Validation']*100:.2f}% (exact match)")
print(f"  • Top-3 Accuracy: {topk_results[2]['Validation']*100:.2f}% (correct diagnosis in top 3)")
print(f"  • Top-5 Accuracy: {topk_results[3]['Validation']*100:.2f}% (correct diagnosis in top 5)")
improvement_top3 = (topk_results[2]['Validation'] - topk_results[0]['Validation']) * 100
print(f"  • Considering top 3 predictions improves accuracy by {improvement_top3:.2f}%")


--------------------------------------------------------------------------------
Top-K           Training Set              Validation Set            Improvement    
--------------------------------------------------------------------------------
Top-1             0.9652 (96.52%)        0.9563 (95.63%)      + 0.00%
Top-2             0.9970 (99.70%)        0.9939 (99.39%)      + 3.76%
Top-3             0.9990 (99.90%)        0.9980 (99.80%)      + 4.17%
Top-5             1.0000 (100.00%)        1.0000 (100.00%)      + 4.37%
--------------------------------------------------------------------------------

Key Insights:
  • Top-1 Accuracy: 95.63% (exact match)
  • Top-3 Accuracy: 99.80% (correct diagnosis in top 3)
  • Top-5 Accuracy: 100.00% (correct diagnosis in top 5)
  • Considering top 3 predictions improves accuracy by 4.17%


In [91]:
# Get detailed classification report
class_report = classification_report(
    y_val, 
    y_val_pred, 
    target_names=label_encoder.classes_,
    output_dict=True,
    zero_division=0
)

# Extract per-class metrics
class_performance = []
for disease, metrics in class_report.items():
    if disease not in ['accuracy', 'macro avg', 'weighted avg']:
        class_performance.append({
            'Disease': disease,
            'Precision': metrics['precision'],
            'Recall': metrics['recall'],
            'F1-Score': metrics['f1-score'],
            'Support': int(metrics['support'])
        })

# Sort by F1-score
class_performance.sort(key=lambda x: x['F1-Score'], reverse=True)

# Display top performers
print("\n--- TOP 10 PERFORMING DISEASES (by F1-Score) ---")
print(f"{'Disease':<45} {'Precision':<12} {'Recall':<12} {'F1-Score':<12} {'Samples':<8}")
print("-"*95)
for i, perf in enumerate(class_performance[:10], 1):
    print(f"{perf['Disease']:<45} {perf['Precision']:>8.4f}     {perf['Recall']:>8.4f}     {perf['F1-Score']:>8.4f}     {perf['Support']:<8}")

# Display bottom performers
print("\n--- BOTTOM 10 DISEASES NEEDING IMPROVEMENT ---")
print(f"{'Disease':<45} {'Precision':<12} {'Recall':<12} {'F1-Score':<12} {'Samples':<8}")
print("-"*95)
for perf in class_performance[-10:]:
    print(f"{perf['Disease']:<45} {perf['Precision']:>8.4f}     {perf['Recall']:>8.4f}     {perf['F1-Score']:>8.4f}     {perf['Support']:<8}")

# Count perfect predictions
perfect_diseases = sum(1 for p in class_performance if p['F1-Score'] == 1.0)
print(f"\n✓ {perfect_diseases}/{len(class_performance)} diseases achieved perfect F1-score (100%)")



--- TOP 10 PERFORMING DISEASES (by F1-Score) ---
Disease                                       Precision    Recall       F1-Score     Samples 
-----------------------------------------------------------------------------------------------
AIDS                                            1.0000       1.0000       1.0000     24      
Acne                                            1.0000       1.0000       1.0000     24      
Arthritis                                       1.0000       1.0000       1.0000     24      
Common Cold                                     1.0000       1.0000       1.0000     24      
Dengue                                          1.0000       1.0000       1.0000     24      
Diabetes                                        1.0000       1.0000       1.0000     24      
Dimorphic hemmorhoids(piles)                    1.0000       1.0000       1.0000     24      
Drug Reaction                                   1.0000       1.0000       1.0000     24      
GERD    

In [92]:
# Generate confusion matrix
cm = confusion_matrix(y_val, y_val_pred)

# Calculate key metrics from confusion matrix
correct_predictions = np.trace(cm)
total_predictions = np.sum(cm)
misclassifications = total_predictions - correct_predictions

print(f"\nOverall Statistics:")
print(f"  • Total Predictions: {total_predictions}")
print(f"  • Correct Predictions: {correct_predictions} ({correct_predictions/total_predictions*100:.2f}%)")
print(f"  • Misclassifications: {misclassifications} ({misclassifications/total_predictions*100:.2f}%)")

# Find most confused disease pairs
print("\n--- TOP 10 MOST CONFUSED DISEASE PAIRS ---")
print("(True Disease → Predicted Disease: Count)")
print("-"*80)

confusion_pairs = []
for i in range(len(cm)):
    for j in range(len(cm)):
        if i != j and cm[i][j] > 0:
            confusion_pairs.append({
                'True': label_encoder.classes_[i],
                'Predicted': label_encoder.classes_[j],
                'Count': cm[i][j]
            })

confusion_pairs.sort(key=lambda x: x['Count'], reverse=True)

for idx, pair in enumerate(confusion_pairs[:10], 1):
    print(f"{idx:>2}. {pair['True']:<40} → {pair['Predicted']:<40} ({pair['Count']} times)")

# Analyze which diseases are most problematic
print("\n--- DISEASES WITH MOST MISCLASSIFICATIONS ---")
disease_errors = []
for i in range(len(cm)):
    # Row sum minus diagonal = total errors for this disease (false negatives)
    false_negatives = np.sum(cm[i, :]) - cm[i, i]
    # Column sum minus diagonal = times other diseases were classified as this (false positives)
    false_positives = np.sum(cm[:, i]) - cm[i, i]
    total_errors = false_negatives + false_positives
    
    if total_errors > 0:
        disease_errors.append({
            'Disease': label_encoder.classes_[i],
            'False_Negatives': false_negatives,
            'False_Positives': false_positives,
            'Total_Errors': total_errors
        })

disease_errors.sort(key=lambda x: x['Total_Errors'], reverse=True)

print(f"\n{'Disease':<45} {'False Neg':<12} {'False Pos':<12} {'Total':<10}")
print("-"*80)
for error in disease_errors[:10]:
    print(f"{error['Disease']:<45} {error['False_Negatives']:<12} {error['False_Positives']:<12} {error['Total_Errors']:<10}")
 


Overall Statistics:
  • Total Predictions: 984
  • Correct Predictions: 941 (95.63%)
  • Misclassifications: 43 (4.37%)

--- TOP 10 MOST CONFUSED DISEASE PAIRS ---
(True Disease → Predicted Disease: Count)
--------------------------------------------------------------------------------
 1. Alcoholic hepatitis                      → Chronic cholestasis                      (7 times)
 2. Typhoid                                  → Chronic cholestasis                      (7 times)
 3. Fungal infection                         → Chronic cholestasis                      (5 times)
 4. Gastroenteritis                          → Chronic cholestasis                      (5 times)
 5. Allergy                                  → Chronic cholestasis                      (3 times)
 6. Bronchial Asthma                         → Chronic cholestasis                      (3 times)
 7. Cervical spondylosis                     → Chronic cholestasis                      (3 times)
 8. Hepatitis C           

In [93]:
# Error pattern insights
# Identify the most problematic disease
most_confused = disease_errors[0] if disease_errors else None

if most_confused:
    problem_disease = most_confused['Disease']
    problem_idx = np.where(label_encoder.classes_ == problem_disease)[0][0]
    
    print(f"\nMost Problematic Disease: {problem_disease}")
    print(f"  • Total Errors: {most_confused['Total_Errors']}")
    print(f"  • False Negatives: {most_confused['False_Negatives']} (missed diagnoses)")
    print(f"  • False Positives: {most_confused['False_Positives']} (over-diagnosed)")
    
    # Find what it's confused with
    print(f"\n  This disease is most confused with:")
    related_confusions = [p for p in confusion_pairs 
                         if p['True'] == problem_disease or p['Predicted'] == problem_disease][:5]
    for conf in related_confusions:
        if conf['True'] == problem_disease:
            print(f"    → Often misclassified as: {conf['Predicted']} ({conf['Count']} times)")
        else:
            print(f"    ← Often receives misclassifications from: {conf['True']} ({conf['Count']} times)")



Most Problematic Disease: Chronic cholestasis
  • Total Errors: 43
  • False Negatives: 0 (missed diagnoses)
  • False Positives: 43 (over-diagnosed)

  This disease is most confused with:
    ← Often receives misclassifications from: Alcoholic hepatitis (7 times)
    ← Often receives misclassifications from: Typhoid (7 times)
    ← Often receives misclassifications from: Fungal infection (5 times)
    ← Often receives misclassifications from: Gastroenteritis (5 times)
    ← Often receives misclassifications from: Allergy (3 times)


In [94]:
# Sample predictions
# Select 5 random samples from validation set
np.random.seed(42)
sample_indices = np.random.choice(len(X_val), size=5, replace=False)

for idx, sample_idx in enumerate(sample_indices, 1):
    true_label = y_val[sample_idx]
    true_disease = label_encoder.classes_[true_label]
    probabilities = y_val_proba[sample_idx]
    
    # Get top 3 predictions
    top_k_indices = np.argsort(probabilities)[::-1][:3]
    
    print(f"{'='*80}")
    print(f"Example {idx}: True Disease = {true_disease}")
    print(f"{'='*80}")
    print(f"{'Rank':<6} {'Predicted Disease':<45} {'Confidence':<15}")
    print("-"*80)
    
    for rank, pred_idx in enumerate(top_k_indices, 1):
        pred_disease = label_encoder.classes_[pred_idx]
        confidence = probabilities[pred_idx]
        
        # Mark if correct
        marker = "✓ CORRECT" if pred_idx == true_label else ""
        print(f"{rank:<6} {pred_disease:<45} {confidence*100:>6.2f}%  {marker}")
    print()

Example 1: True Disease = Allergy
Rank   Predicted Disease                             Confidence     
--------------------------------------------------------------------------------
1      Allergy                                         2.44%  ✓ CORRECT
2      Chronic cholestasis                             2.44%  
3      Hepatitis D                                     2.44%  

Example 2: True Disease = Varicose veins
Rank   Predicted Disease                             Confidence     
--------------------------------------------------------------------------------
1      Varicose veins                                  2.44%  ✓ CORRECT
2      Hepatitis D                                     2.44%  
3      Chronic cholestasis                             2.44%  

Example 3: True Disease = Psoriasis
Rank   Predicted Disease                             Confidence     
--------------------------------------------------------------------------------
1      Psoriasis                         

In [95]:
summary_table = {
    'Metric': [
        'Validation Accuracy',
        'Validation Precision',
        'Validation Recall',
        'Validation F1-Score',
        'Top-3 Accuracy',
        'Top-5 Accuracy',
        'Perfect F1-Score Diseases',
        'Total Misclassifications',
        'Training Time',
        'Inference Time (per sample)'
    ],
    'Value': [
        f"{val_accuracy:.4f} ({val_accuracy*100:.2f}%)",
        f"{val_precision:.4f}",
        f"{val_recall:.4f}",
        f"{val_f1:.4f}",
        f"{topk_results[2]['Validation']:.4f} ({topk_results[2]['Validation']*100:.2f}%)",
        f"{topk_results[3]['Validation']:.4f} ({topk_results[3]['Validation']*100:.2f}%)",
        f"{perfect_diseases}/{len(class_performance)}",
        f"{misclassifications}/{total_predictions}",
        f"{training_time:.2f} seconds",
        f"{(inference_time/len(X_val))*1000:.2f} ms"
    ]
}

for metric, value in zip(summary_table['Metric'], summary_table['Value']):
    print(f"  {metric:<35}: {value}")


  Validation Accuracy                : 0.9563 (95.63%)
  Validation Precision               : 0.9843
  Validation Recall                  : 0.9563
  Validation F1-Score                : 0.9644
  Top-3 Accuracy                     : 0.9980 (99.80%)
  Top-5 Accuracy                     : 1.0000 (100.00%)
  Perfect F1-Score Diseases          : 24/41
  Total Misclassifications           : 43/984
  Training Time                      : 0.92 seconds
  Inference Time (per sample)        : 0.07 ms


### Test Set Construction with Noise
- Explain rationale: Real-world symptoms often overlap diseases
- Noise injection strategy: Blend symptoms from 2-3 different diseases
- Document noise levels (e.g., 20%, 50% noisy samples)
- Create realistic test scenarios
- Take 1-3 diseases and blend their symptoms together and see if the predicted disease from the different Boosted models make sense. For example, if I pick Diseases A,B,C and blend 8 symptoms(A,A,A,A,A,B,C,C), I expect my model to give me a high probability to be class A over classes B and C.

### Hyperparameter Optimization
- Define parameter grid (n_estimators, learning_rate, base_estimator params)
- GridSearchCV with cross-validation (5-fold or 10-fold)
- Display best parameters found
- Compare CV scores across parameter combinations

### Final Model Evaluation
- Retrain with best hyperparameters on full training set
- Predict on clean validation set
- Predict on noisy test set
- Compare performance: clean vs. noisy data
- Top-K predictions: Show ranked disease predictions with confidence scores
- Case studies: Show 5-10 example predictions with interpretations

### Interactive Demo
- Input symptoms
- Display top 3-5 disease predictions with probabilities
- Shows practical applicability

### Conclusion
- Model strengths and limitations
- Impact of noise on predictions
- Real world considerations
