# Fine-Tune Sentence Transformer for Medical Symptom Extraction

**Purpose:** Train a sentence transformer model to better understand how patients describe symptoms in natural language.

**Dataset:**
- 200k doctor-patient conversations (ai-medical-chatbot.csv)
- 131+ canonical symptoms (symptoms.csv)

**Output:** Fine-tuned model saved to `models/medical_symptom_matcher/`

---

## Table of Contents
1. [Setup & Installation](#setup)
2. [Load Data](#load-data)
3. [Prepare Training Data](#prepare-training)
4. [Fine-Tune Model](#fine-tune)
5. [Evaluate Performance](#evaluate)
6. [Save Model](#save)
7. [Quick Test](#test)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


---
## 1. Setup & Installation <a id='setup'></a>

In [None]:
# Install required packages
!pip install sentence-transformers torch pandas numpy scikit-learn tqdm



In [None]:
# Google Colab Setup
import os

# Create necessary directories
!mkdir -p data models

print("üìÅ Directories created!")
print("\n‚ö†Ô∏è IMPORTANT: Upload your datasets now!")
print("   1. Click the folder icon on the left sidebar")
print("   2. Navigate to the 'data' folder")
print("   3. Upload these files:")
print("      - ai-medical-chatbot.csv")
print("      - symptoms.csv")
print("\nüí° Or use the code below to upload via dialog:")
print("\nfrom google.colab import files")
print("uploaded = files.upload()")
print("# Then move files: !mv *.csv data/")

üìÅ Directories created!

‚ö†Ô∏è IMPORTANT: Upload your datasets now!
   1. Click the folder icon on the left sidebar
   2. Navigate to the 'data' folder
   3. Upload these files:
      - ai-medical-chatbot.csv
      - symptoms.csv

üí° Or use the code below to upload via dialog:

from google.colab import files
uploaded = files.upload()
# Then move files: !mv *.csv data/


In [None]:
# Check GPU availability
import torch

if torch.cuda.is_available():
    print(f"‚úÖ GPU Detected: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è No GPU detected. Training will be MUCH slower.")
    print("   Go to Runtime > Change runtime type > Select GPU")

‚úÖ GPU Detected: Tesla T4
   Memory: 15.83 GB


In [None]:
import pandas as pd
import numpy as np
import torch
import re
from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import json
import os
from collections import defaultdict

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è Running on CPU - training will be slower but still works!")

PyTorch version: 2.8.0+cu126
CUDA available: True
CUDA device: Tesla T4


---
## 2. Load Data <a id='load-data'></a>

In [None]:
# Load symptoms list
print("Loading symptoms...")
symptoms_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC/symptoms.csv')
SYMPTOMS = symptoms_df['symptoms'].str.lower().str.strip().tolist()
print(f"‚úì Loaded {len(SYMPTOMS)} symptoms")
print(f"\nFirst 10 symptoms:")
for i, symptom in enumerate(SYMPTOMS[:10], 1):
    print(f"  {i}. {symptom}")

Loading symptoms...
‚úì Loaded 342 symptoms

First 10 symptoms:
  1. abdominal distention
  2. abnormal appearing skin
  3. abnormal appearing tongue
  4. abnormal breathing sounds
  5. abnormal involuntary movements
  6. abnormal movement of eyelid
  7. abnormal size or shape of ear
  8. absence of menstruation
  9. ache all over
  10. acne or pimples


In [None]:
# Load medical conversations
print("\nLoading medical conversations...")
print("‚ö†Ô∏è This may take a minute for 100k rows...")

# Load full CSV, then restrict to 100,000 rows
conversations_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC/ai-medical-chatbot.csv')
if len(conversations_df) > 100_000:
    conversations_df = conversations_df.iloc[:100_000].copy()
print(f"‚úì Loaded {len(conversations_df):,} conversations")
print(f"\nDataset columns: {list(conversations_df.columns)}")
print(f"\nFirst conversation example:")
print(f"Description: {conversations_df.iloc[0]['Description'][:100]}...")
print(f"Patient: {conversations_df.iloc[0]['Patient'][:150]}...")
print(f"Doctor: {conversations_df.iloc[0]['Doctor'][:150]}...")


Loading medical conversations...
‚ö†Ô∏è This may take a minute for 100k rows...
‚úì Loaded 100,000 conversations

Dataset columns: ['Description', 'Patient', 'Doctor']

First conversation example:
Description: Q. What does abutment of the nerve root mean?...
Patient: Hi doctor,I am just wondering what is abutting and abutment of the nerve root means in a back issue. Please explain. What treatment is required for¬†an...
Doctor: Hi. I have gone through your query with diligence and would like you to know that I am here to help you. For further information consult a neurologist...


---
## 3. Prepare Training Data <a id='prepare-training'></a>

We'll create positive training pairs where:
- **Text 1:** Patient's message
- **Text 2:** Symptom name
- **Label:** 1.0 (if symptom mentioned), 0.0 (if not mentioned)

This teaches the model to recognize when patient language matches a symptom.

In [None]:
def clean_text(text):
    """Clean and normalize text"""
    if pd.isna(text):
        return ""
    text = str(text).lower()
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

def extract_symptom_mentions(patient_text, doctor_text, symptoms_list):
    """
    Find which symptoms are mentioned in the conversation
    Returns list of mentioned symptoms
    """
    patient_text = clean_text(patient_text)
    doctor_text = clean_text(doctor_text)
    combined_text = patient_text + " " + doctor_text

    mentioned_symptoms = []

    for symptom in symptoms_list:
        symptom_clean = symptom.lower().strip()

        # Check exact match
        if symptom_clean in combined_text:
            mentioned_symptoms.append(symptom)
            continue

        # Check word-by-word match (for multi-word symptoms)
        symptom_words = symptom_clean.split()
        if len(symptom_words) > 1:
            # Check if all words appear in text (allows for different word order)
            if all(word in combined_text for word in symptom_words):
                mentioned_symptoms.append(symptom)

    return mentioned_symptoms

# Test the function
test_patient = "I have a terrible headache and feel nauseous"
test_doctor = "You may have migraine. The nausea is common with headaches."
test_mentions = extract_symptom_mentions(test_patient, test_doctor, SYMPTOMS)
print(f"Test extraction: {test_mentions}")

Test extraction: ['headache', 'nausea']


In [None]:
def create_training_examples(conversations_df, symptoms_list, max_examples=50000, sample_negatives=True):
    """
    Create training examples from conversations

    Args:
        conversations_df: DataFrame with Patient, Doctor columns
        symptoms_list: List of canonical symptoms
        max_examples: Maximum number of positive examples to create
        sample_negatives: Whether to include negative examples (no symptom match)

    Returns:
        List of InputExample objects
    """
    training_examples = []
    positive_count = 0
    negative_count = 0

    print(f"\nCreating training examples from {len(conversations_df):,} conversations...")
    print(f"This will take 5-10 minutes...\n")

    for idx, row in tqdm(conversations_df.iterrows(), total=len(conversations_df)):
        if positive_count >= max_examples:
            break

        patient_text = clean_text(row['Patient'])
        doctor_text = clean_text(row['Doctor'])

        # Skip very short messages
        if len(patient_text.split()) < 5:
            continue

        # Find mentioned symptoms
        mentioned_symptoms = extract_symptom_mentions(patient_text, doctor_text, symptoms_list)

        if mentioned_symptoms:
            # Create positive examples
            for symptom in mentioned_symptoms:
                training_examples.append(
                    InputExample(texts=[patient_text, symptom], label=1.0)
                )
                positive_count += 1

            # Create some negative examples (symptoms NOT mentioned)
            if sample_negatives and len(mentioned_symptoms) < len(symptoms_list):
                # Sample 1-2 negative symptoms per positive
                unmentioned = [s for s in symptoms_list if s not in mentioned_symptoms]
                num_negatives = min(2, len(unmentioned))
                negative_samples = np.random.choice(unmentioned, size=num_negatives, replace=False)

                for neg_symptom in negative_samples:
                    training_examples.append(
                        InputExample(texts=[patient_text, neg_symptom], label=0.0)
                    )
                    negative_count += 1

    print(f"\n‚úì Created {len(training_examples):,} training examples")
    print(f"  - Positive examples: {positive_count:,}")
    print(f"  - Negative examples: {negative_count:,}")
    print(f"  - Ratio: {positive_count/negative_count:.2f}:1 (positive:negative)")

    return training_examples

In [None]:
# Create training examples
# Adjust max_examples based on your compute resources:
# - CPU: 10,000-20,000 examples
# - GPU: 50,000+ examples

MAX_EXAMPLES = 60000  # Adjust this based on your hardware

train_examples = create_training_examples(
    conversations_df,
    SYMPTOMS,
    max_examples=MAX_EXAMPLES,
    sample_negatives=True
)


Creating training examples from 100,000 conversations...
This will take 5-10 minutes...



 39%|‚ñà‚ñà‚ñà‚ñâ      | 39141/100000 [00:48<01:16, 799.32it/s]


‚úì Created 105,293 training examples
  - Positive examples: 60,003
  - Negative examples: 45,290
  - Ratio: 1.32:1 (positive:negative)





In [None]:
# Split into train and validation
train_data, val_data = train_test_split(train_examples, test_size=0.1, random_state=42)

print(f"\nDataset split:")
print(f"  Training: {len(train_data):,} examples")
print(f"  Validation: {len(val_data):,} examples")

# Show some examples
print(f"\nüìù Sample training examples:")
for i in range(min(3, len(train_data))):
    example = train_data[i]
    print(f"\nExample {i+1}:")
    print(f"  Patient text: {example.texts[0][:100]}...")
    print(f"  Symptom: {example.texts[1]}")
    print(f"  Label: {'MATCH ‚úì' if example.label == 1.0 else 'NO MATCH ‚úó'}")


Dataset split:
  Training: 94,763 examples
  Validation: 10,530 examples

üìù Sample training examples:

Example 1:
  Patient text: hello doctor, i am a 20 year old male. whenever i think of sex, my penis starts secreting a transpar...
  Symptom: hoarse voice
  Label: NO MATCH ‚úó

Example 2:
  Patient text: hello doctor, i am on no medications. i drink a ton of water, eat the right food. i do not eat pork....
  Symptom: rough skin
  Label: MATCH ‚úì

Example 3:
  Patient text: hello doctor, i have on and off fever since last friday and been having chest and stomach pain since...
  Symptom: skin pain
  Label: MATCH ‚úì


---
## 4. Fine-Tune Model <a id='fine-tune'></a>

We'll fine-tune the `all-MiniLM-L6-v2` model using cosine similarity loss.

In [None]:
from sentence_transformers import SentenceTransformer, models
from transformers import AutoConfig, AutoModel

print("Loading base model...")

base_model_name = 'sentence-transformers/all-MiniLM-L6-v2'

# 1Ô∏è‚É£ Load and modify config
config = AutoConfig.from_pretrained(base_model_name)
config.attention_probs_dropout_prob = 0.1
config.hidden_dropout_prob = 0.1

# 2Ô∏è‚É£ Load transformer backbone manually
hf_model = AutoModel.from_pretrained(base_model_name, config=config)

# 3Ô∏è‚É£ Wrap it with SentenceTransformers Transformer module
transformer = models.Transformer(base_model_name)
transformer.auto_model = hf_model  # Inject our custom model with dropout changes

# 4Ô∏è‚É£ Add pooling layer
pooling = models.Pooling(transformer.get_word_embedding_dimension())

# 5Ô∏è‚É£ Combine into SentenceTransformer
model = SentenceTransformer(modules=[transformer, pooling])

print(f"‚úì Loaded {base_model_name} with custom dropout")
print(f"Model embedding dimension: {model.get_sentence_embedding_dimension()}")


Loading base model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

‚úì Loaded sentence-transformers/all-MiniLM-L6-v2 with custom dropout
Model embedding dimension: 384


In [None]:
# Verify Dropout
print(model[0].auto_model.config.attention_probs_dropout_prob)
print(model[0].auto_model.config.hidden_dropout_prob)

0.1
0.1


In [None]:
# Create dataloaders
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=16)
val_dataloader = DataLoader(val_data, shuffle=False, batch_size=16)

# Define loss function
train_loss = losses.CosineSimilarityLoss(model)

print(f"Training setup:")
print(f"  Batch size: 16")
print(f"  Training batches: {len(train_dataloader)}")
print(f"  Validation batches: {len(val_dataloader)}")
print(f"  Loss function: CosineSimilarityLoss")

Training setup:
  Batch size: 16
  Training batches: 5923
  Validation batches: 659
  Loss function: CosineSimilarityLoss


In [None]:
from sentence_transformers import SentenceTransformer, losses
from torch.utils.data import DataLoader
import numpy as np
import torch
from sklearn.metrics import precision_recall_fscore_support

NUM_EPOCHS = 5
PATIENCE = 2
THRESHOLD = 0.05 # Minimum improvement required
WARMUP_STEPS = len(train_dataloader) * NUM_EPOCHS * 0.1

output_path = '/content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC/'
base_model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(base_model_name)
train_loss = losses.CosineSimilarityLoss(model)
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=16)
symptom_embeddings = model.encode(SYMPTOMS, convert_to_tensor=True, show_progress_bar=True)

best_val_loss = float('inf')
best_cosine_acc = 0
patience_counter = 0

def eval_val(model, val_data, symptom_embeddings, symptoms_list):
    y_true, y_pred = [], []
    losses = []
    for ex in val_data:
        text, target_symptom, label = ex.texts[0], ex.texts[1], ex.label
        text_emb = model.encode(text, convert_to_tensor=True)
        idx = symptoms_list.index(target_symptom)
        sym_emb = symptom_embeddings[idx]
        sim = float(torch.nn.functional.cosine_similarity(text_emb, sym_emb, dim=0))
        pred = 1.0 if sim > 0.5 else 0.0
        cosine_loss = 1 - sim if label == 1 else sim # loss for pos/neg
        losses.append(cosine_loss)
        y_true.append(label)
        y_pred.append(pred)
    macro_p, macro_r, macro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="macro")
    micro_p, micro_r, micro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="micro")
    val_loss = np.mean(losses) # mean cosine similarity loss over the validation set
    cosine_acc = np.mean([int((yp==yt)) for yp, yt in zip(y_pred, y_true)]) # proportion correct
    return val_loss, cosine_acc, macro_f1, micro_f1

for epoch in range(NUM_EPOCHS):
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=1,
        warmup_steps=int(WARMUP_STEPS),
        show_progress_bar=True
    )

    val_loss, cosine_acc, macro_f1, micro_f1 = eval_val(model, val_data, symptom_embeddings, SYMPTOMS)
    print(f"Epoch {epoch+1}: Val Loss {val_loss:.4f}, Cosine Acc {cosine_acc:.4f}, Macro F1 {macro_f1:.4f}, Micro F1 {micro_f1:.4f}")

    # Early stopping strategy: prefer val_loss, fallback to cosine_acc if more useful for your case
    # Here we use val_loss but you can switch to cosine_acc if you prefer
    improvement = best_val_loss - val_loss
    if val_loss < best_val_loss - THRESHOLD:
        best_val_loss = val_loss
        patience_counter = 0
        model.save(output_path + "sentence_transformer_best")
        print("‚úì New best model saved!")
    else:
        patience_counter += 1
        if patience_counter >= PATIENCE:
            print("Early stopping triggered!")
            break

# Optional: Apply MacroMicroF1Evaluator post-training for more analysis/reporting
class MacroMicroF1Evaluator:
    def __init__(self, val_examples, symptom_embeddings, symptoms_list):
        self.val_examples = val_examples
        self.symptom_embeddings = symptom_embeddings
        self.symptoms_list = symptoms_list

    def __call__(self, model):
        y_true, y_pred = [], []
        for ex in self.val_examples:
            text, target_symptom, label = ex.texts[0], ex.texts[1], ex.label
            text_emb = model.encode(text, convert_to_tensor=True)
            idx = self.symptoms_list.index(target_symptom)
            sym_emb = self.symptom_embeddings[idx]
            sim = float(torch.nn.functional.cosine_similarity(text_emb, sym_emb, dim=0))
            pred = 1.0 if sim > 0.5 else 0.0
            y_true.append(label)
            y_pred.append(pred)
        macro_p, macro_r, macro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="macro")
        micro_p, micro_r, micro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="micro")
        print(f"\nMacro F1: {macro_f1:.4f}, Micro F1: {micro_f1:.4f}")
        return macro_f1

# Final macro/micro F1 evaluation
evaluator = MacroMicroF1Evaluator(val_data, symptom_embeddings, SYMPTOMS)
evaluator(model)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/11 [00:00<?, ?it/s]

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msolomonchan-2023[0m ([33msolomonchan-2023-singapore-management-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
500,0.2137
1000,0.1174
1500,0.0971
2000,0.0891
2500,0.0873
3000,0.0839
3500,0.0805
4000,0.0784
4500,0.0787
5000,0.0768


Epoch 1: Val Loss 0.3896, Cosine Acc 0.6668, Macro F1 0.6589, Micro F1 0.6668
‚úì New best model saved!


Step,Training Loss
500,0.0725
1000,0.0731
1500,0.0717
2000,0.0739
2500,0.0736
3000,0.0745
3500,0.0734
4000,0.0702
4500,0.0719
5000,0.0711


Epoch 2: Val Loss 0.3885, Cosine Acc 0.6675, Macro F1 0.6598, Micro F1 0.6675


Step,Training Loss
500,0.0691
1000,0.0673
1500,0.0644
2000,0.0639
2500,0.0612
3000,0.0596
3500,0.0601
4000,0.0595
4500,0.0635
5000,0.0653


Epoch 3: Val Loss 0.3936, Cosine Acc 0.6472, Macro F1 0.6364, Micro F1 0.6472
Early stopping triggered!

Macro F1: 0.6364, Micro F1: 0.6472


0.6363714056772443

---
## 5. Evaluate Performance <a id='evaluate'></a>

Let's test the fine-tuned model against the base model.

In [None]:
from sklearn.metrics import precision_recall_fscore_support
# If you need entity-level metrics for span extraction, use seqeval or custom logic

def enhanced_evaluate_model(model, symptom_embeddings, val_examples, threshold=0.5):
    """
    Evaluate macro, micro, token, entity metrics for a model
    """
    y_true = []
    y_pred = []
    losses = []
    criterion = torch.nn.CosineSimilarity(dim=0)

    for example in tqdm(val_examples, desc="Evaluating"):
        text = example.texts[0]
        target_symptom = example.texts[1]
        true_label = example.label

        # Encode text and symptom
        text_embedding = model.encode(text, convert_to_tensor=True)
        symptom_idx = SYMPTOMS.index(target_symptom)
        symptom_embedding = symptom_embeddings[symptom_idx]
        similarity = float(torch.nn.functional.cosine_similarity(text_embedding, symptom_embedding, dim=0))
        predicted_label = 1.0 if similarity > threshold else 0.0

        # Loss (squared error for binary classification)
        single_loss = (similarity - true_label) ** 2
        losses.append(single_loss)

        y_true.append(true_label)
        y_pred.append(predicted_label)

    macro_p, macro_r, macro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average='macro')
    micro_p, micro_r, micro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average='micro')
    avg_loss = np.mean(losses)

    # Token-level metrics (if available)
    # Placeholder, depends on your token span annotation scheme
    # entity-level metrics (if available)
    # Placeholder, depends on your span/entity extraction logic

    metrics = {
        'macro_precision': macro_p,
        'macro_recall': macro_r,
        'macro_f1': macro_f1,
        'micro_precision': micro_p,
        'micro_recall': micro_r,
        'micro_f1': micro_f1,
        'avg_loss': avg_loss,
        # 'token_metrics': token_metrics,
        # 'entity_metrics': entity_metrics
    }
    return metrics

# Load both models for comparison
print("Loading models for comparison...")
base_model = SentenceTransformer(base_model_name)
finetuned_model = SentenceTransformer(output_path + "sentence_transformer_best")  # Use path to best model checkpoint!
print("‚úì Models loaded")

# Pre-compute symptom embeddings
print("\nComputing symptom embeddings...")
base_symptom_embeddings = base_model.encode(SYMPTOMS, convert_to_tensor=True, show_progress_bar=True)
finetuned_symptom_embeddings = finetuned_model.encode(SYMPTOMS, convert_to_tensor=True, show_progress_bar=True)
print("‚úì Embeddings computed")

# Enhanced evaluation
print("\nEvaluating BASE model...")
base_metrics = enhanced_evaluate_model(base_model, base_symptom_embeddings, val_data)
print("Base model:", base_metrics)

print("\nEvaluating FINE-TUNED model...")
ft_metrics = enhanced_evaluate_model(finetuned_model, finetuned_symptom_embeddings, val_data)
print("Fine-tuned model:", ft_metrics)

Loading models for comparison...
‚úì Models loaded

Computing symptom embeddings...


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

Batches:   0%|          | 0/11 [00:00<?, ?it/s]

‚úì Embeddings computed

Evaluating BASE model...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10530/10530 [01:33<00:00, 112.19it/s]


Base model: {'macro_precision': 0.7128960201889302, 'macro_recall': 0.5463114749081792, 'macro_f1': 0.39223594460089445, 'micro_precision': 0.47150997150997154, 'micro_recall': 0.47150997150997154, 'micro_f1': 0.47150997150997154, 'avg_loss': np.float64(0.33072767629016653)}

Evaluating FINE-TUNED model...


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10530/10530 [01:20<00:00, 130.51it/s]

Fine-tuned model: {'macro_precision': 0.9145876817764652, 'macro_recall': 0.9085757404874902, 'macro_f1': 0.9112158854692496, 'micro_precision': 0.9142450142450143, 'micro_recall': 0.9142450142450143, 'micro_f1': 0.9142450142450143, 'avg_loss': np.float64(0.07155196600663055)}





In [None]:
# Write comparison report
os.makedirs(output_path, exist_ok=True)

with open(os.path.join(output_path, 'SenTran_comparison_report.txt'), 'w') as f:
    f.write("[BASE MODEL METRICS]\n")
    for k, v in base_metrics.items():
        f.write(f"{k}: {v:.4f}\n")
    f.write("\n[FINE-TUNED MODEL METRICS]\n")
    for k, v in ft_metrics.items():
        f.write(f"{k}: {v:.4f}\n")

In [None]:
from sentence_transformers import util

def extract_symptoms_comparison(text, model, symptom_embeddings, symptoms_list, threshold=0.5, top_k=5):
    """
    Extract symptoms using a given model
    """
    text_embedding = model.encode(text, convert_to_tensor=True)
    similarities = util.cos_sim(text_embedding, symptom_embeddings)[0]

    matches = []
    for idx, score in enumerate(similarities):
        if score > threshold:
            matches.append({
                'symptom': symptoms_list[idx],
                'score': float(score)
            })
    matches = sorted(matches, key=lambda x: x['score'], reverse=True)
    top_matches = matches[:top_k]
    return top_matches, matches  # Return both filtered and full match list for improved comparison

# Test cases
test_cases = [
    "I've been sneezing all day",
    "My head hurts really bad",
    "I am coughing a lot",
    "My stomach area has been itching like crazy",
    "I can't stop sneezing and my nose is blocked",
    "Having terrible pounding in my temples and feel nauseous",
    "My back is really painful",
    "I feel dizzy and want to throw up"
]

print("\n" + "="*80)
print("COMPARISON: BASE MODEL vs FINE-TUNED MODEL")
print("="*80)

threshold = 0.45
top_k = 5

for test_text in test_cases:
    print(f"\nüìù Input: \"{test_text}\"")
    print("-" * 80)

    # Base model results
    base_top, base_full = extract_symptoms_comparison(
        test_text, base_model, base_symptom_embeddings, SYMPTOMS, threshold=threshold, top_k=top_k
    )
    print("‚ùå BASE MODEL:")
    if base_top:
        for r in base_top:
            print(f"   ‚Ä¢ {r['symptom']} (confidence: {r['score']:.3f})")
    else:
        print("   ‚Ä¢ No symptoms detected")

    # Fine-tuned model results
    ft_top, ft_full = extract_symptoms_comparison(
        test_text, finetuned_model, finetuned_symptom_embeddings, SYMPTOMS, threshold=threshold, top_k=top_k
    )
    print("\n‚úÖ FINE-TUNED MODEL:")
    if ft_top:
        for r in ft_top:
            print(f"   ‚Ä¢ {r['symptom']} (confidence: {r['score']:.3f})")
    else:
        print("   ‚Ä¢ No symptoms detected")

    # Compare number and confidence of symptoms detected
    detected_improvement = len(ft_top) - len(base_top)
    if detected_improvement > 0:
        print(f"\nüí° Improvement: +{detected_improvement} more symptoms detected")
    elif detected_improvement < 0:
        print(f"\n‚ö†Ô∏è Note: {abs(detected_improvement)} fewer symptoms (may be more precise)")
    else:
        print(f"\n‚è∫Ô∏è No change in number of detected symptoms")

    # Compare score distribution for deeper insight if needed
    avg_base_conf = np.mean([x['score'] for x in base_top]) if base_top else 0
    avg_ft_conf = np.mean([x['score'] for x in ft_top]) if ft_top else 0
    print(f"‚ÜîÔ∏è Avg confidence: Base={avg_base_conf:.3f}, Fine-tuned={avg_ft_conf:.3f}")

    print()



COMPARISON: BASE MODEL vs FINE-TUNED MODEL

üìù Input: "I've been sneezing all day"
--------------------------------------------------------------------------------
‚ùå BASE MODEL:
   ‚Ä¢ sneezing (confidence: 0.763)

‚úÖ FINE-TUNED MODEL:
   ‚Ä¢ sneezing (confidence: 0.975)
   ‚Ä¢ cough (confidence: 0.828)
   ‚Ä¢ sore in nose (confidence: 0.820)
   ‚Ä¢ feeling cold (confidence: 0.764)
   ‚Ä¢ wheezing (confidence: 0.763)

üí° Improvement: +4 more symptoms detected
‚ÜîÔ∏è Avg confidence: Base=0.763, Fine-tuned=0.830


üìù Input: "My head hurts really bad"
--------------------------------------------------------------------------------
‚ùå BASE MODEL:
   ‚Ä¢ headache (confidence: 0.569)
   ‚Ä¢ malformed scalp (confidence: 0.504)
   ‚Ä¢ hurts to breath (confidence: 0.483)
   ‚Ä¢ ache all over (confidence: 0.472)
   ‚Ä¢ ear pain (confidence: 0.465)

‚úÖ FINE-TUNED MODEL:
   ‚Ä¢ ache all over (confidence: 0.861)
   ‚Ä¢ redness in ear (confidence: 0.860)
   ‚Ä¢ arm pain (confidence: 0.84

---
## 6. Save Model & Metadata <a id='save'></a>

In [None]:
import json
import shutil
import zipfile

# Define output/model directory
output_path = '/content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC'

# Save fine-tuned model if not already saved
finetuned_model.save(os.path.join(output_path, "sentence_transformer_best"))

# Write metadata JSON
metadata = {
    'base_model': base_model_name,
    'training_examples': len(train_data),
    'validation_examples': len(val_data),
    'num_symptoms': len(SYMPTOMS),
    'epochs': NUM_EPOCHS,
    'macro_f1_base': float(base_metrics['macro_f1']),
    'micro_f1_base': float(base_metrics['micro_f1']),
    'macro_f1_finetuned': float(ft_metrics['macro_f1']),
    'micro_f1_finetuned': float(ft_metrics['micro_f1']),
    'macro_improvement_pct': float((ft_metrics['macro_f1'] - base_metrics['macro_f1']) / base_metrics['macro_f1'] * 100),
    'micro_improvement_pct': float((ft_metrics['micro_f1'] - base_metrics['micro_f1']) / base_metrics['micro_f1'] * 100),
    'avg_conf_base': float(avg_base_conf),
    'avg_conf_finetuned': float(avg_ft_conf),
    'output_path': output_path
}
with open(os.path.join(output_path, 'training_metadata.json'), 'w') as f:
    json.dump(metadata, f, indent=2)

# Path to comparison report and model dir
model_dir = os.path.join(output_path, "sentence_transformer_best")
comparison_report = os.path.join(output_path, "SenTran_comparison_report.txt")
metadata_file = os.path.join(output_path, "training_metadata.json")

# Name for the final zip file (in current Colab working dir)
zip_filename = "sentencetransformer_package.zip"
with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
    # Add model directory (all files and subfolders)
    for foldername, subfolders, filenames in os.walk(model_dir):
        for filename in filenames:
            file_path = os.path.join(foldername, filename)
            arcname = os.path.relpath(file_path, output_path)
            zipf.write(file_path, arcname)
    # Add report and metadata
    zipf.write(comparison_report, os.path.basename(comparison_report))
    zipf.write(metadata_file, os.path.basename(metadata_file))

print(f"‚úì All files zipped as {zip_filename}.")


‚úì All files zipped as sentencetransformer_package.zip.


---
## 7. Quick Test Interface <a id='test'></a>

Interactive testing of your fine-tuned model.

In [None]:
def test_symptom_extraction(
    text, threshold=0.45, top_k=5,
    model_choice='finetuned'
):
    """
    Test the chosen model ('base' or 'finetuned') on custom input
    """
    print(f"\n{'='*60}")
    print(f"Input: \"{text}\"")
    print(f"Model: {'BASE' if model_choice=='base' else 'FINE-TUNED'}")
    print(f"Threshold: {threshold:.2f}  |  Top-K: {top_k}")
    print(f"{'='*60}")

    # Choose model and embeddings
    if model_choice == 'base':
        model = base_model
        symptom_embeddings = base_symptom_embeddings
    else:
        model = finetuned_model
        symptom_embeddings = finetuned_symptom_embeddings

    top_results, all_results = extract_symptoms_comparison(
        text, model, symptom_embeddings, SYMPTOMS, threshold=threshold, top_k=top_k
    )

    if top_results:
        print(f"\n‚úÖ Detected {len(top_results)} symptom(s):\n")
        for i, r in enumerate(top_results, 1):
            confidence_bar = '‚ñà' * int(r['score'] * 20)
            print(f"  {i}. {r['symptom']}")
            print(f"     Confidence: {r['score']:.3f} {confidence_bar}")
    else:
        print("\n‚ùå No symptoms detected above threshold")
        print(f"   Try lowering threshold (currently {threshold})")

    # Show full sorted scores for transparency
    print("\nAll Candidate Scores (top 10):")
    for r in all_results[:10]:
        print(f"  ‚Ä¢ {r['symptom']}: {r['score']:.3f}")

    return top_results

# Example usage / tests:
test_symptom_extraction("I've been feeling dizzy and have a terrible headache", model_choice='finetuned')
test_symptom_extraction("My throat is sore and I can't stop coughing", model_choice='base')
test_symptom_extraction("I have chest pain and shortness of breath", model_choice='finetuned')



Input: "I've been feeling dizzy and have a terrible headache"
Model: FINE-TUNED
Threshold: 0.45  |  Top-K: 5

‚úÖ Detected 5 symptom(s):

  1. headache
     Confidence: 0.953 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  2. dizziness
     Confidence: 0.934 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  3. fatigue
     Confidence: 0.913 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  4. weight gain
     Confidence: 0.890 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  5. nausea
     Confidence: 0.879 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

All Candidate Scores (top 10):
  ‚Ä¢ headache: 0.953
  ‚Ä¢ dizziness: 0.934
  ‚Ä¢ fatigue: 0.913
  ‚Ä¢ weight gain: 0.890
  ‚Ä¢ nausea: 0.879
  ‚Ä¢ ache all over: 0.877
  ‚Ä¢ fever: 0.859
  ‚Ä¢ ear pain: 0.843
  ‚Ä¢ mass on ear: 0.839
  ‚Ä¢ rib pain: 0.821

Input: "My throat is sore and I can't stop coughing"
Model: BASE
Threshold: 0.45  |  Top-K: 5

‚úÖ Detected 5 symptom(s):

  1. sore throat
    

[{'symptom': 'shortness of breath', 'score': 0.9278728365898132},
 {'symptom': 'difficulty in breathing', 'score': 0.8845827579498291},
 {'symptom': 'burning chest pain', 'score': 0.87139892578125},
 {'symptom': 'sharp chest pain', 'score': 0.8534422516822815},
 {'symptom': 'breathing fast', 'score': 0.8158175945281982}]

In [None]:
# Interactive testing (optional - uncomment to use)
# while True:
#     user_input = input("\nDescribe your symptoms (or 'quit' to exit): ")
#     if user_input.lower() in ['quit', 'exit', 'q']:
#         break
#     test_symptom_extraction(user_input)

---
## ‚úÖ Training Complete!

### Next Steps:

1. **Your fine-tuned model is saved at:** `models/medical_symptom_matcher/`

2. **To use it in your chatbot notebook:**
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('models/medical_symptom_matcher')
```

3. **Integration code is ready** - just copy the `extract_symptoms_comparison()` function to your main chatbot!

### Performance Summary:
- ‚úÖ Base model accuracy: {base_accuracy:.1%}
- ‚úÖ Fine-tuned accuracy: {finetuned_accuracy:.1%}
- ‚úÖ Improvement: {improvement_pct:+.1f}%

### Tips:
- Adjust `threshold` parameter to control sensitivity (lower = more symptoms detected)
- If you need better performance, train with more examples or more epochs
- The model works best with complete sentences (not just single words)

In [None]:
import shutil
import os

folder_to_zip = '/content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC/sentence_transformer_best'
output_directory = '/content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC'
zip_filename = os.path.join(output_directory, 'sentence_transformer_best.zip')

# Create the zip archive (without including top-level directory)
shutil.make_archive(zip_filename.replace('.zip', ''), 'zip', folder_to_zip)
print(f"Created archive: {zip_filename}")

# No need to download if file is saved to Drive.
print(f"The zip file is saved to your Google Drive at: {zip_filename}")

Created archive: /content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC/sentence_transformer_best.zip
The zip file is saved to your Google Drive at: /content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC/sentence_transformer_best.zip
