## **Model Training**

### Imports and Setup

In [1]:
import os
import json
import torch
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime

# Check GPU availability
print("SYSTEM CHECK")

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU Available: {gpu_name}")
    print(f"GPU Memory: {gpu_memory:.1f} GB")
else:
    print("WARNING: No GPU detected.")

print(f"PyTorch Version: {torch.__version__}")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)
from datasets import Dataset, DatasetDict
from sklearn.metrics import accuracy_score, f1_score, classification_report

SYSTEM CHECK
GPU Available: NVIDIA GeForce RTX 4050 Laptop GPU
GPU Memory: 6.0 GB
PyTorch Version: 2.6.0+cu124
Using device: cuda


### Load Processed Data

In [2]:
print("Loading processed data...")

# Load training data
with open("../data/classifier/train.json", 'r') as f:
    train_data = json.load(f)
print(f"Training records:   {len(train_data):,}")

# Load validation data
with open("../data/classifier/val.json", 'r') as f:
    val_data = json.load(f)
print(f"Validation records: {len(val_data):,}")

# Load test data
with open("../data/classifier/test.json", 'r') as f:
    test_data = json.load(f)
print(f"Test records:       {len(test_data):,}")

# Load label mappings
with open("../data/classifier/label2id.json", 'r') as f:
    label2id = json.load(f)

with open("../data/classifier/id2label.json", 'r') as f:
    id2label = json.load(f)
    id2label = {int(k): v for k, v in id2label.items()}

print(f"\nNumber of classes:  {len(label2id)}")

print("DATA LOADED SUCCESSFULLY!")

Loading processed data...
Training records:   1,025,602
Validation records: 132,448
Test records:       134,529

Number of classes:  49
DATA LOADED SUCCESSFULLY!


### Prepare Datasets for Training

In [3]:
print("Converting to HuggingFace Dataset format...")

def prepare_dataset(data):
    """Convert data to HuggingFace Dataset format."""
    texts = [item['text'] for item in data]
    labels = [label2id[item['condition']] for item in data]
    return Dataset.from_dict({
        'text': texts,
        'label': labels,
    })

train_dataset = prepare_dataset(train_data)
val_dataset = prepare_dataset(val_data)
test_dataset = prepare_dataset(test_data)

dataset = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset,
})

print("Dataset structure:")
print(dataset)

print("\nSample from training set:")
print(f"   Text: {dataset['train'][0]['text'][:80]}...")
print(f"   Label: {dataset['train'][0]['label']} ({id2label[dataset['train'][0]['label']]})")

Converting to HuggingFace Dataset format...
Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1025602
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 132448
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 134529
    })
})

Sample from training set:
   Text: Patient presents with: Do you live with 4 or more people; had significantly incr...
   Label: 45 (URTI)


### Load Model and Tokenizer

In [4]:
MODEL_NAME = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract"

print(f"Loading model: {MODEL_NAME}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print("Tokenizer loaded!")

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id,
)
print("Model loaded!")

# Move model to GPU
model = model.to(device)

# Print model info
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("MODEL INFO")
print(f"   Model: {MODEL_NAME}")
print(f"   Total parameters: {total_params:,}")
print(f"   Trainable parameters: {trainable_params:,}")
print(f"   Number of classes: {len(label2id)}")

Loading model: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Tokenizer loaded!


pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded!
MODEL INFO
   Model: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
   Total parameters: 109,519,921
   Trainable parameters: 109,519,921
   Number of classes: 49


### Tokenize Datasets

In [5]:
MAX_LENGTH = 256

print(f"Tokenizing datasets (max_length={MAX_LENGTH})...")

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=MAX_LENGTH,
    )

# Tokenize all datasets
print("Tokenizing training set...")
tokenized_train = train_dataset.map(tokenize_function, batched=True)

print("Tokenizing validation set...")
tokenized_val = val_dataset.map(tokenize_function, batched=True)

print("Tokenizing test set...")
tokenized_test = test_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
tokenized_train.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_val.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_test.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print("TOKENIZATION COMPLETE")
print(f"   Training samples:   {len(tokenized_train):,}")
print(f"   Validation samples: {len(tokenized_val):,}")
print(f"   Test samples:       {len(tokenized_test):,}")

Tokenizing datasets (max_length=256)...
Tokenizing training set...


Map:   0%|          | 0/1025602 [00:00<?, ? examples/s]

Tokenizing validation set...


Map:   0%|          | 0/132448 [00:00<?, ? examples/s]

Tokenizing test set...


Map:   0%|          | 0/134529 [00:00<?, ? examples/s]

TOKENIZATION COMPLETE
   Training samples:   1,025,602
   Validation samples: 132,448
   Test samples:       134,529


### Define Training Configuration

Since I have 6GB GPU memory, I'll use conservative batch sizes to avoid out-of-memory errors.

In [6]:
# Training hyperparameters (optimized for 6GB GPU)
BATCH_SIZE = 16
GRADIENT_ACCUMULATION_STEPS = 4  # Effective batch size = 16 * 4 = 64
LEARNING_RATE = 2e-5
NUM_EPOCHS = 3
WARMUP_RATIO = 0.1
WEIGHT_DECAY = 0.01

OUTPUT_DIR = "../models/condition_classifier"

print("TRAINING CONFIGURATION")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Gradient accumulation steps: {GRADIENT_ACCUMULATION_STEPS}")
print(f"   Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"   Learning rate: {LEARNING_RATE}")
print(f"   Epochs: {NUM_EPOCHS}")
print(f"   Warmup ratio: {WARMUP_RATIO}")
print(f"   Weight decay: {WEIGHT_DECAY}")
print(f"   Output directory: {OUTPUT_DIR}")

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Define metrics function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_score(labels, predictions)
    f1_macro = f1_score(labels, predictions, average='macro')
    f1_weighted = f1_score(labels, predictions, average='weighted')
    
    return {
        'accuracy': accuracy,
        'f1_macro': f1_macro,
        'f1_weighted': f1_weighted,
    }


TRAINING CONFIGURATION
   Batch size: 16
   Gradient accumulation steps: 4
   Effective batch size: 64
   Learning rate: 2e-05
   Epochs: 3
   Warmup ratio: 0.1
   Weight decay: 0.01
   Output directory: ../models/condition_classifier


### Setup Trainer

In [7]:
print("Setting up Trainer...")

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_weighted",
    greater_is_better=True,
    logging_dir=f"{OUTPUT_DIR}/logs",
    logging_steps=500,
    fp16=True,  # Mixed precision for memory efficiency
    dataloader_num_workers=4,
    report_to="none",
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Estimate training time
steps_per_epoch = len(tokenized_train) // (BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS)
total_steps = steps_per_epoch * NUM_EPOCHS

print(f"   Steps per epoch: {steps_per_epoch:,}")
print(f"   Total training steps: {total_steps:,}")

Setting up Trainer...
   Steps per epoch: 16,025
   Total training steps: 48,075


  trainer = Trainer(


### Start Training

In [8]:
print("STARTING TRAINING")
print(f"Start time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

# Train the model
start_time = datetime.now()

train_result = trainer.train()

end_time = datetime.now()
training_duration = end_time - start_time


print("TRAINING COMPLETE")
print(f"End time: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Total duration: {training_duration}")

print("\nTraining metrics:")
for key, value in train_result.metrics.items():
    print(f"   {key}: {value:.4f}")

STARTING TRAINING
Start time: 2026-01-20 19:19:00


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,F1 Weighted
1,0.0449,0.046539,0.979766,0.97697,0.978382
2,0.0446,0.045581,0.980211,0.977781,0.978829
3,0.042,0.045106,0.980355,0.978035,0.97898


TRAINING COMPLETE
End time: 2026-01-21 08:40:24
Total duration: 13:21:24.253545

Training metrics:
   train_runtime: 48084.0402
   train_samples_per_second: 63.9880
   train_steps_per_second: 1.0000
   total_flos: 404941647777850368.0000
   train_loss: 0.1060
   epoch: 3.0000


### Evaluate on Test Set

In [9]:
test_results = trainer.evaluate(tokenized_test)

print("\nTest Set Results:")

for key, value in test_results.items():
    if isinstance(value, float):
        print(f"   {key}: {value:.4f}")
    else:
        print(f"   {key}: {value}")


Test Set Results:
   eval_loss: 0.0420
   eval_accuracy: 0.9822
   eval_f1_macro: 0.9794
   eval_f1_weighted: 0.9810
   eval_runtime: 587.3746
   eval_samples_per_second: 229.0340
   eval_steps_per_second: 14.3160
   epoch: 3.0000


Final Performance Summary

| Dataset    | Accuracy | F1 Macro | F1 Weighted |
|:-----------|:---------|:---------|:------------|
| Validation | 98.04%   | 97.80%   | 97.90%      |
| Test       | 98.22%   | 97.94%   | 98.10%      |


| Observation       | Meaning                                     |
|:------------------|:--------------------------------------------|
| Test > Validation | Model generalizes very well                 |
| 98.22% Accuracy   | Misclassifies only ~2 out of 100 patients   |
| 97.94% F1 Macro   | Strong performance across all 49 conditions |
| No overfitting    | Test performance matches training           |

### Save the Model

In [10]:
# Save model
model_save_path = "../models/condition_classifier/final"
trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)

# Save label mappings with the model
import shutil
shutil.copy("../data/classifier/label2id.json", f"{model_save_path}/label2id.json")
shutil.copy("../data/classifier/id2label.json", f"{model_save_path}/id2label.json")

# Save training metrics
metrics = {
    "model_name": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract",
    "num_classes": len(label2id),
    "training_samples": len(train_data),
    "validation_samples": len(val_data),
    "test_samples": len(test_data),
    "epochs": NUM_EPOCHS,
    "batch_size": BATCH_SIZE,
    "learning_rate": LEARNING_RATE,
    "test_accuracy": 0.9822,
    "test_f1_macro": 0.9794,
    "test_f1_weighted": 0.9810,
}

with open(f"{model_save_path}/training_metrics.json", 'w') as f:
    json.dump(metrics, f, indent=2)

# List saved files
print("\nSaved files:")
print("-" * 60)
for file in os.listdir(model_save_path):
    file_path = os.path.join(model_save_path, file)
    size_mb = os.path.getsize(file_path) / (1024 * 1024)
    print(f"   {file:<30} {size_mb:>8.2f} MB")

print("Mode saved at:")

print(f"Location: {model_save_path}")


Saved files:
------------------------------------------------------------
   config.json                        0.00 MB
   model.safetensors                417.81 MB
   tokenizer_config.json              0.00 MB
   special_tokens_map.json            0.00 MB
   vocab.txt                          0.21 MB
   tokenizer.json                     0.65 MB
   training_args.bin                  0.01 MB
   label2id.json                      0.00 MB
   id2label.json                      0.00 MB
   training_metrics.json              0.00 MB
Mode saved at:
Location: ../models/condition_classifier/final


### Testing with Sample Predictions

In [12]:
from transformers import pipeline

# Create prediction pipeline
classifier = pipeline(
    "text-classification",
    model=model_save_path,
    tokenizer=model_save_path,
    device=0,
    top_k=5
)

# Test cases
test_cases = [
    "Patient presents with: severe headache; fever; stiff neck; sensitivity to light",
    "Patient presents with: chest pain; shortness of breath; sweating; pain radiating to left arm",
    "Patient presents with: runny nose; sore throat; cough; mild fever",
    "Patient presents with: wheezing; difficulty breathing; chest tightness; cough",
    "Patient presents with: abdominal pain; nausea; vomiting; diarrhea",
]

print("\nSample Predictions:")

for i, text in enumerate(test_cases, 1):
    print(f"\nTest Case {i}:")
    print(f"Input: {text[:70]}...")
    print("Predictions:")
    
    results = classifier(text)[0]
    for rank, result in enumerate(results, 1):
        condition = result['label']
        confidence = result['score'] * 100
        print(f"   {rank}. {condition}: {confidence:.2f}%")

Device set to use cuda:0



Sample Predictions:

Test Case 1:
Input: Patient presents with: severe headache; fever; stiff neck; sensitivity...
Predictions:
   1. Ebola: 69.52%
   2. Croup: 12.78%
   3. Bronchiolitis: 5.47%
   4. Guillain-Barr√© syndrome: 3.87%
   5. Allergic sinusitis: 3.56%

Test Case 2:
Input: Patient presents with: chest pain; shortness of breath; sweating; pain...
Predictions:
   1. Spontaneous pneumothorax: 27.13%
   2. Acute pulmonary edema: 18.16%
   3. Unstable angina: 7.62%
   4. Ebola: 7.00%
   5. Tuberculosis: 5.41%

Test Case 3:
Input: Patient presents with: runny nose; sore throat; cough; mild fever...
Predictions:
   1. URTI: 33.40%
   2. Allergic sinusitis: 31.58%
   3. Bronchitis: 9.87%
   4. Ebola: 7.83%
   5. Croup: 5.28%

Test Case 4:
Input: Patient presents with: wheezing; difficulty breathing; chest tightness...
Predictions:
   1. Bronchiolitis: 57.70%
   2. Acute COPD exacerbation / infection: 33.80%
   3. Acute pulmonary edema: 1.56%
   4. Ebola: 1.42%
   5. Spontaneous ri

The model runs, but the predictions on natural language input show some issues. Let me explain why.

---

| Test Case | Expected | Predicted | Issue |
|:----------|:---------|:----------|:------|
| 1. Headache, fever, stiff neck | Meningitis-like | Ebola (69%) | Meningitis not in dataset |
| 2. Chest pain, left arm | Heart condition | Pneumothorax (27%) | Low confidence, spread across options |
| 3. Runny nose, cough | Cold/URTI | URTI (33%) | Reasonable |
| 4. Wheezing, breathing difficulty | Asthma/COPD | Bronchiolitis (57%) | Reasonable |
| 5. Abdominal pain, vomiting | GI condition | Tuberculosis (33%) | GI conditions limited in dataset |

---

Why Some Predictions Seem Off

**Key Insight:** There's a format mismatch between training data and test input.

**Training data looks like:**
```
Patient presents with: Do you live with 4 or more people; had significantly 
increased sweating; pain somewhere, related to your reason for consulting...
```

**Our test input looks like:**
```
Patient presents with: severe headache; fever; stiff neck; sensitivity to light
```

The model learned from questionnaire-style text, not natural symptom descriptions.

---

Important Context

| Metric | Value | Notes |
|--------|-------|-------|
| Test Set Accuracy | 98.22% | On data matching training format |
| Real-world Input | Lower | Different text format |

**This is normal and expected.** The model excels at its training distribution.


### Model Training Summary

**Model Details**

| Attribute | Value |
|-----------|-------|
| Base Model | PubMedBERT (microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) |
| Fine-tuned on | DDXPlus Medical Dataset |
| Task | 49-class Condition Classification |

---

**Dataset**

| Split | Samples |
|-------|---------|
| Training | 1,025,602 |
| Validation | 132,448 |
| Test | 134,529 |
| **Total** | **1,292,579** |

---

**Training Configuration**

| Parameter | Value |
|-----------|-------|
| Epochs | 3 |
| Batch Size | 16 (effective: 64) |
| Learning Rate | 2e-5 |
| Training Time | 13 hours 21 minutes |

---

**Final Metrics (Test Set)**

| Metric | Value |
|--------|-------|
| Accuracy | 98.22% |
| F1 Macro | 97.94% |
| F1 Weighted | 98.10% |

---

**Model Saved**

| Attribute | Value |
|-----------|-------|
| Location | ../models/condition_classifier/final/ |
| Size | ~418 MB |

---

**Limitations Noted**

- Model trained on questionnaire-style text
- Natural language input may need preprocessing
- 49 conditions only (some common conditions may be missing)