# Fine-Tune Sentence Transformer for Medical Symptom Extraction

**Purpose:** Train a sentence transformer model to better understand how patients describe symptoms in natural language.

**Dataset:**
- 200k doctor-patient conversations (ai-medical-chatbot.csv)
- 131+ canonical symptoms (symptoms.csv)

**Output:** Fine-tuned model saved to `models/medical_symptom_matcher/`

---

## Table of Contents
1. [Setup & Installation](#setup)
2. [Load Data](#load-data)
3. [Prepare Training Data](#prepare-training)
4. [Fine-Tune Model](#fine-tune)
5. [Evaluate Performance](#evaluate)
6. [Save Model](#save)
7. [Quick Test](#test)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


---
## 1. Setup & Installation <a id='setup'></a>

In [2]:
# Install required packages
!pip install sentence-transformers torch pandas numpy scikit-learn tqdm



In [3]:
# Google Colab Setup
import os

# Create necessary directories
!mkdir -p data models

print("üìÅ Directories created!")
print("\n‚ö†Ô∏è IMPORTANT: Upload your datasets now!")
print("   1. Click the folder icon on the left sidebar")
print("   2. Navigate to the 'data' folder")
print("   3. Upload these files:")
print("      - ai-medical-chatbot.csv")
print("      - symptoms.csv")
print("\nüí° Or use the code below to upload via dialog:")
print("\nfrom google.colab import files")
print("uploaded = files.upload()")
print("# Then move files: !mv *.csv data/")

üìÅ Directories created!

‚ö†Ô∏è IMPORTANT: Upload your datasets now!
   1. Click the folder icon on the left sidebar
   2. Navigate to the 'data' folder
   3. Upload these files:
      - ai-medical-chatbot.csv
      - symptoms.csv

üí° Or use the code below to upload via dialog:

from google.colab import files
uploaded = files.upload()
# Then move files: !mv *.csv data/


In [4]:
# Check GPU availability
import torch

if torch.cuda.is_available():
    print(f"‚úÖ GPU Detected: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è No GPU detected. Training will be MUCH slower.")
    print("   Go to Runtime > Change runtime type > Select GPU")

‚úÖ GPU Detected: Tesla T4
   Memory: 15.83 GB


In [5]:
import pandas as pd
import numpy as np
import torch
import re
from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import json
import os
from collections import defaultdict

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è Running on CPU - training will be slower but still works!")

PyTorch version: 2.8.0+cu126
CUDA available: True
CUDA device: Tesla T4


---
## 2. Load Data <a id='load-data'></a>

In [7]:
# Load symptoms list
print("Loading symptoms...")
symptoms_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC/symptoms.csv')
SYMPTOMS = symptoms_df['symptoms'].str.lower().str.strip().tolist()
print(f"‚úì Loaded {len(SYMPTOMS)} symptoms")
print(f"\nFirst 10 symptoms:")
for i, symptom in enumerate(SYMPTOMS[:10], 1):
    print(f"  {i}. {symptom}")

Loading symptoms...
‚úì Loaded 342 symptoms

First 10 symptoms:
  1. abdominal distention
  2. abnormal appearing skin
  3. abnormal appearing tongue
  4. abnormal breathing sounds
  5. abnormal involuntary movements
  6. abnormal movement of eyelid
  7. abnormal size or shape of ear
  8. absence of menstruation
  9. ache all over
  10. acne or pimples


In [10]:
# Load medical conversations
print("\nLoading medical conversations...")
print("‚ö†Ô∏è This may take a minute for 100k rows...")

# Load full CSV, then restrict to 100,000 rows
conversations_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC/ai-medical-chatbot.csv')
if len(conversations_df) > 100_000:
    conversations_df = conversations_df.iloc[:100_000].copy()
print(f"‚úì Loaded {len(conversations_df):,} conversations")
print(f"\nDataset columns: {list(conversations_df.columns)}")
print(f"\nFirst conversation example:")
print(f"Description: {conversations_df.iloc[0]['Description'][:100]}...")
print(f"Patient: {conversations_df.iloc[0]['Patient'][:150]}...")
print(f"Doctor: {conversations_df.iloc[0]['Doctor'][:150]}...")


Loading medical conversations...
‚ö†Ô∏è This may take a minute for 100k rows...
‚úì Loaded 100,000 conversations

Dataset columns: ['Description', 'Patient', 'Doctor']

First conversation example:
Description: Q. What does abutment of the nerve root mean?...
Patient: Hi doctor,I am just wondering what is abutting and abutment of the nerve root means in a back issue. Please explain. What treatment is required for¬†an...
Doctor: Hi. I have gone through your query with diligence and would like you to know that I am here to help you. For further information consult a neurologist...


---
## 3. Prepare Training Data <a id='prepare-training'></a>

We'll create positive training pairs where:
- **Text 1:** Patient's message
- **Text 2:** Symptom name
- **Label:** 1.0 (if symptom mentioned), 0.0 (if not mentioned)

This teaches the model to recognize when patient language matches a symptom.

In [11]:
def clean_text(text):
    """Clean and normalize text"""
    if pd.isna(text):
        return ""
    text = str(text).lower()
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

def extract_symptom_mentions(patient_text, doctor_text, symptoms_list):
    """
    Find which symptoms are mentioned in the conversation
    Returns list of mentioned symptoms
    """
    patient_text = clean_text(patient_text)
    doctor_text = clean_text(doctor_text)
    combined_text = patient_text + " " + doctor_text

    mentioned_symptoms = []

    for symptom in symptoms_list:
        symptom_clean = symptom.lower().strip()

        # Check exact match
        if symptom_clean in combined_text:
            mentioned_symptoms.append(symptom)
            continue

        # Check word-by-word match (for multi-word symptoms)
        symptom_words = symptom_clean.split()
        if len(symptom_words) > 1:
            # Check if all words appear in text (allows for different word order)
            if all(word in combined_text for word in symptom_words):
                mentioned_symptoms.append(symptom)

    return mentioned_symptoms

# Test the function
test_patient = "I have a terrible headache and feel nauseous"
test_doctor = "You may have migraine. The nausea is common with headaches."
test_mentions = extract_symptom_mentions(test_patient, test_doctor, SYMPTOMS)
print(f"Test extraction: {test_mentions}")

Test extraction: ['headache', 'nausea']


In [12]:
def create_training_examples(conversations_df, symptoms_list, max_examples=50000, sample_negatives=True):
    """
    Create training examples from conversations

    Args:
        conversations_df: DataFrame with Patient, Doctor columns
        symptoms_list: List of canonical symptoms
        max_examples: Maximum number of positive examples to create
        sample_negatives: Whether to include negative examples (no symptom match)

    Returns:
        List of InputExample objects
    """
    training_examples = []
    positive_count = 0
    negative_count = 0

    print(f"\nCreating training examples from {len(conversations_df):,} conversations...")
    print(f"This will take 5-10 minutes...\n")

    for idx, row in tqdm(conversations_df.iterrows(), total=len(conversations_df)):
        if positive_count >= max_examples:
            break

        patient_text = clean_text(row['Patient'])
        doctor_text = clean_text(row['Doctor'])

        # Skip very short messages
        if len(patient_text.split()) < 5:
            continue

        # Find mentioned symptoms
        mentioned_symptoms = extract_symptom_mentions(patient_text, doctor_text, symptoms_list)

        if mentioned_symptoms:
            # Create positive examples
            for symptom in mentioned_symptoms:
                training_examples.append(
                    InputExample(texts=[patient_text, symptom], label=1.0)
                )
                positive_count += 1

            # Create some negative examples (symptoms NOT mentioned)
            if sample_negatives and len(mentioned_symptoms) < len(symptoms_list):
                # Sample 1-2 negative symptoms per positive
                unmentioned = [s for s in symptoms_list if s not in mentioned_symptoms]
                num_negatives = min(2, len(unmentioned))
                negative_samples = np.random.choice(unmentioned, size=num_negatives, replace=False)

                for neg_symptom in negative_samples:
                    training_examples.append(
                        InputExample(texts=[patient_text, neg_symptom], label=0.0)
                    )
                    negative_count += 1

    print(f"\n‚úì Created {len(training_examples):,} training examples")
    print(f"  - Positive examples: {positive_count:,}")
    print(f"  - Negative examples: {negative_count:,}")
    print(f"  - Ratio: {positive_count/negative_count:.2f}:1 (positive:negative)")

    return training_examples

In [15]:
# Create training examples
# Adjust max_examples based on your compute resources:
# - CPU: 10,000-20,000 examples
# - GPU: 50,000+ examples

MAX_EXAMPLES = 50000  # Adjust this based on your hardware

train_examples = create_training_examples(
    conversations_df,
    SYMPTOMS,
    max_examples=MAX_EXAMPLES,
    sample_negatives=True
)


Creating training examples from 100,000 conversations...
This will take 5-10 minutes...



 31%|‚ñà‚ñà‚ñà       | 30844/100000 [00:39<01:29, 776.45it/s]


‚úì Created 86,358 training examples
  - Positive examples: 50,000
  - Negative examples: 36,358
  - Ratio: 1.38:1 (positive:negative)





In [16]:
# Split into train and validation
train_data, val_data = train_test_split(train_examples, test_size=0.1, random_state=42)

print(f"\nDataset split:")
print(f"  Training: {len(train_data):,} examples")
print(f"  Validation: {len(val_data):,} examples")

# Show some examples
print(f"\nüìù Sample training examples:")
for i in range(min(3, len(train_data))):
    example = train_data[i]
    print(f"\nExample {i+1}:")
    print(f"  Patient text: {example.texts[0][:100]}...")
    print(f"  Symptom: {example.texts[1]}")
    print(f"  Label: {'MATCH ‚úì' if example.label == 1.0 else 'NO MATCH ‚úó'}")


Dataset split:
  Training: 77,722 examples
  Validation: 8,636 examples

üìù Sample training examples:

Example 1:
  Patient text: hi doctor, i am a 24 year old female. i got married a year ago. i am an audiologist. i am having pai...
  Symptom: knee pain
  Label: NO MATCH ‚úó

Example 2:
  Patient text: hi doctor, my mother is 58 years old. she has high bp problem since last six years. she is taking am...
  Symptom: white discharge from eye
  Label: NO MATCH ‚úó

Example 3:
  Patient text: hello doctor, i am a 25-year-old male and i have never been able to fully retract my penis. i have s...
  Symptom: penis pain
  Label: MATCH ‚úì


---
## 4. Fine-Tune Model <a id='fine-tune'></a>

We'll fine-tune the `all-MiniLM-L6-v2` model using cosine similarity loss.

In [20]:
from sentence_transformers import SentenceTransformer, models
from transformers import AutoConfig, AutoModel

print("Loading base model...")

base_model_name = 'sentence-transformers/all-MiniLM-L6-v2'

# 1Ô∏è‚É£ Load and modify config
config = AutoConfig.from_pretrained(base_model_name)
config.attention_probs_dropout_prob = 0.1
config.hidden_dropout_prob = 0.1

# 2Ô∏è‚É£ Load transformer backbone manually
hf_model = AutoModel.from_pretrained(base_model_name, config=config)

# 3Ô∏è‚É£ Wrap it with SentenceTransformers Transformer module
transformer = models.Transformer(base_model_name)
transformer.auto_model = hf_model  # Inject our custom model with dropout changes

# 4Ô∏è‚É£ Add pooling layer
pooling = models.Pooling(transformer.get_word_embedding_dimension())

# 5Ô∏è‚É£ Combine into SentenceTransformer
model = SentenceTransformer(modules=[transformer, pooling])

print(f"‚úì Loaded {base_model_name} with custom dropout")
print(f"Model embedding dimension: {model.get_sentence_embedding_dimension()}")


Loading base model...


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

‚úì Loaded sentence-transformers/all-MiniLM-L6-v2 with custom dropout
Model embedding dimension: 384


In [22]:
# Verify Dropout
print(model[0].auto_model.config.attention_probs_dropout_prob)
print(model[0].auto_model.config.hidden_dropout_prob)

0.1
0.1


In [23]:
# Create dataloaders
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=16)
val_dataloader = DataLoader(val_data, shuffle=False, batch_size=16)

# Define loss function
train_loss = losses.CosineSimilarityLoss(model)

print(f"Training setup:")
print(f"  Batch size: 16")
print(f"  Training batches: {len(train_dataloader)}")
print(f"  Validation batches: {len(val_dataloader)}")
print(f"  Loss function: CosineSimilarityLoss")

Training setup:
  Batch size: 16
  Training batches: 4858
  Validation batches: 540
  Loss function: CosineSimilarityLoss


In [None]:
from sentence_transformers import SentenceTransformer, losses
from torch.utils.data import DataLoader
import numpy as np
import torch
from sklearn.metrics import precision_recall_fscore_support

NUM_EPOCHS = 5
PATIENCE = 3
WARMUP_STEPS = len(train_dataloader) * NUM_EPOCHS * 0.1
best_macro_f1 = 0
patience_counter = 0

output_path = '/content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC'
model = SentenceTransformer(base_model_name)
train_loss = losses.CosineSimilarityLoss(model)
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=16)
symptom_embeddings = model.encode(SYMPTOMS, convert_to_tensor=True, show_progress_bar=True)

for epoch in range(NUM_EPOCHS):
    # Train one epoch
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=NUM_EPOCHS,
        warmup_steps=WARMUP_STEPS,
        show_progress_bar=True
    )
    # Eval on val set for macro/micro F1
    y_true, y_pred = [], []
    for ex in val_data:
        text, target_symptom, label = ex.texts[0], ex.texts[1], ex.label
        text_emb = model.encode(text, convert_to_tensor=True)
        idx = SYMPTOMS.index(target_symptom)
        sym_emb = symptom_embeddings[idx]
        sim = float(torch.nn.functional.cosine_similarity(text_emb, sym_emb, dim=0))
        pred = 1.0 if sim > 0.5 else 0.0
        y_true.append(label)
        y_pred.append(pred)
    macro_p, macro_r, macro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="macro")
    micro_p, micro_r, micro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="micro")

    print(f"Epoch {epoch+1} ‚Äî Macro F1: {macro_f1:.4f}, Micro F1: {micro_f1:.4f}")

    # Early stopping & checkpoint
    if macro_f1 > best_macro_f1:
        best_macro_f1 = macro_f1
        patience_counter = 0
        model.save(output_path + "sentence_transformer_best")
        print("‚úì New best model saved!")
    else:
        patience_counter += 1
        if patience_counter >= PATIENCE:
            print("Early stopping triggered!")
            break

# Optional: Apply MacroMicroF1Evaluator post-training for more analysis/reporting
class MacroMicroF1Evaluator:
    def __init__(self, val_examples, symptom_embeddings, symptoms_list):
        self.val_examples = val_examples
        self.symptom_embeddings = symptom_embeddings
        self.symptoms_list = symptoms_list

    def __call__(self, model):
        y_true, y_pred = [], []
        for ex in self.val_examples:
            text, target_symptom, label = ex.texts[0], ex.texts[1], ex.label
            text_emb = model.encode(text, convert_to_tensor=True)
            idx = self.symptoms_list.index(target_symptom)
            sym_emb = self.symptom_embeddings[idx]
            sim = float(torch.nn.functional.cosine_similarity(text_emb, sym_emb, dim=0))
            pred = 1.0 if sim > 0.5 else 0.0
            y_true.append(label)
            y_pred.append(pred)
        macro_p, macro_r, macro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="macro")
        micro_p, micro_r, micro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="micro")
        print(f"\nMacro F1: {macro_f1:.4f}, Micro F1: {micro_f1:.4f}")
        return macro_f1

evaluator = MacroMicroF1Evaluator(val_data, symptom_embeddings, SYMPTOMS)
evaluator(model)

Batches:   0%|          | 0/11 [00:00<?, ?it/s]

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.2081
1000,0.1121
1500,0.0928
2000,0.0885
2500,0.0811
3000,0.0797
3500,0.079
4000,0.0762
4500,0.0756
5000,0.0751


---
## 5. Evaluate Performance <a id='evaluate'></a>

Let's test the fine-tuned model against the base model.

In [None]:
from sklearn.metrics import precision_recall_fscore_support
# If you need entity-level metrics for span extraction, use seqeval or custom logic

def enhanced_evaluate_model(model, symptom_embeddings, val_examples, threshold=0.5):
    """
    Evaluate macro, micro, token, entity metrics for a model
    """
    y_true = []
    y_pred = []
    losses = []
    criterion = torch.nn.CosineSimilarity(dim=0)

    for example in tqdm(val_examples, desc="Evaluating"):
        text = example.texts[0]
        target_symptom = example.texts[1]
        true_label = example.label

        # Encode text and symptom
        text_embedding = model.encode(text, convert_to_tensor=True)
        symptom_idx = SYMPTOMS.index(target_symptom)
        symptom_embedding = symptom_embeddings[symptom_idx]
        similarity = float(torch.nn.functional.cosine_similarity(text_embedding, symptom_embedding, dim=0))
        predicted_label = 1.0 if similarity > threshold else 0.0

        # Loss (squared error for binary classification)
        single_loss = (similarity - true_label) ** 2
        losses.append(single_loss.item())

        y_true.append(true_label)
        y_pred.append(predicted_label)

    macro_p, macro_r, macro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average='macro')
    micro_p, micro_r, micro_f1, _ = precision_recall_fscore_support(y_true, y_pred, average='micro')
    avg_loss = np.mean(losses)

    # Token-level metrics (if available)
    # Placeholder, depends on your token span annotation scheme
    # entity-level metrics (if available)
    # Placeholder, depends on your span/entity extraction logic

    metrics = {
        'macro_precision': macro_p,
        'macro_recall': macro_r,
        'macro_f1': macro_f1,
        'micro_precision': micro_p,
        'micro_recall': micro_r,
        'micro_f1': micro_f1,
        'avg_loss': avg_loss,
        # 'token_metrics': token_metrics,
        # 'entity_metrics': entity_metrics
    }
    return metrics

# Load both models for comparison
print("Loading models for comparison...")
base_model = SentenceTransformer(base_model_name)
finetuned_model = SentenceTransformer(output_path)  # Use path to best model checkpoint!
print("‚úì Models loaded")

# Pre-compute symptom embeddings
print("\nComputing symptom embeddings...")
base_symptom_embeddings = base_model.encode(SYMPTOMS, convert_to_tensor=True, show_progress_bar=True)
finetuned_symptom_embeddings = finetuned_model.encode(SYMPTOMS, convert_to_tensor=True, show_progress_bar=True)
print("‚úì Embeddings computed")

# Enhanced evaluation
print("\nEvaluating BASE model...")
base_metrics = enhanced_evaluate_model(base_model, base_symptom_embeddings, val_data)
print("Base model:", base_metrics)

print("\nEvaluating FINE-TUNED model...")
ft_metrics = enhanced_evaluate_model(finetuned_model, finetuned_symptom_embeddings, val_data)
print("Fine-tuned model:", ft_metrics)

# Write comparison report
with open('content/drive/MyDrive/Colab Notebooks/CS425 GenAI for NLC/models/comparison_report.txt', 'w') as f:
    f.write("[BASE MODEL METRICS]\n")
    for k, v in base_metrics.items():
        f.write(f"{k}: {v:.4f}\n")
    f.write("\n[FINE-TUNED MODEL METRICS]\n")
    for k, v in ft_metrics.items():
        f.write(f"{k}: {v:.4f}\n")

Loading models for comparison...
‚úì Models loaded

Computing symptom embeddings...


Batches:   0%|          | 0/12 [00:00<?, ?it/s]

Batches:   0%|          | 0/12 [00:00<?, ?it/s]

‚úì Embeddings computed


In [None]:
from sentence_transformers import util

def extract_symptoms_comparison(text, model, symptom_embeddings, symptoms_list, threshold=0.5, top_k=5):
    """
    Extract symptoms using a given model
    """
    text_embedding = model.encode(text, convert_to_tensor=True)
    similarities = util.cos_sim(text_embedding, symptom_embeddings)[0]

    matches = []
    for idx, score in enumerate(similarities):
        if score > threshold:
            matches.append({
                'symptom': symptoms_list[idx],
                'score': float(score)
            })
    matches = sorted(matches, key=lambda x: x['score'], reverse=True)
    top_matches = matches[:top_k]
    return top_matches, matches  # Return both filtered and full match list for improved comparison

# Test cases
test_cases = [
    "I've been sneezing all day",
    "My head hurts really bad",
    "I am coughing a lot",
    "My stomach area has been itching like crazy",
    "I can't stop sneezing and my nose is blocked",
    "Having terrible pounding in my temples and feel nauseous",
    "My back is really painful",
    "I feel dizzy and want to throw up"
]

print("\n" + "="*80)
print("COMPARISON: BASE MODEL vs FINE-TUNED MODEL")
print("="*80)

threshold = 0.45
top_k = 5

for test_text in test_cases:
    print(f"\nüìù Input: \"{test_text}\"")
    print("-" * 80)

    # Base model results
    base_top, base_full = extract_symptoms_comparison(
        test_text, base_model, base_symptom_embeddings, SYMPTOMS, threshold=threshold, top_k=top_k
    )
    print("‚ùå BASE MODEL:")
    if base_top:
        for r in base_top:
            print(f"   ‚Ä¢ {r['symptom']} (confidence: {r['score']:.3f})")
    else:
        print("   ‚Ä¢ No symptoms detected")

    # Fine-tuned model results
    ft_top, ft_full = extract_symptoms_comparison(
        test_text, finetuned_model, finetuned_symptom_embeddings, SYMPTOMS, threshold=threshold, top_k=top_k
    )
    print("\n‚úÖ FINE-TUNED MODEL:")
    if ft_top:
        for r in ft_top:
            print(f"   ‚Ä¢ {r['symptom']} (confidence: {r['score']:.3f})")
    else:
        print("   ‚Ä¢ No symptoms detected")

    # Compare number and confidence of symptoms detected
    detected_improvement = len(ft_top) - len(base_top)
    if detected_improvement > 0:
        print(f"\nüí° Improvement: +{detected_improvement} more symptoms detected")
    elif detected_improvement < 0:
        print(f"\n‚ö†Ô∏è Note: {abs(detected_improvement)} fewer symptoms (may be more precise)")
    else:
        print(f"\n‚è∫Ô∏è No change in number of detected symptoms")

    # Compare score distribution for deeper insight if needed
    avg_base_conf = np.mean([x['score'] for x in base_top]) if base_top else 0
    avg_ft_conf = np.mean([x['score'] for x in ft_top]) if ft_top else 0
    print(f"‚ÜîÔ∏è Avg confidence: Base={avg_base_conf:.3f}, Fine-tuned={avg_ft_conf:.3f}")

    print()



COMPARISON: BASE MODEL vs FINE-TUNED MODEL

üìù Input: "I've been sneezing all day"
--------------------------------------------------------------------------------
‚ùå BASE MODEL:
   ‚Ä¢ sneezing (confidence: 0.763)

‚úÖ FINE-TUNED MODEL:
   ‚Ä¢ sneezing (confidence: 0.981)
   ‚Ä¢ sore in nose (confidence: 0.841)
   ‚Ä¢ nasal congestion (confidence: 0.812)
   ‚Ä¢ sore throat (confidence: 0.776)
   ‚Ä¢ cough (confidence: 0.765)

üí° Improvement: +4 more symptoms detected


üìù Input: "My head hurts really bad"
--------------------------------------------------------------------------------
‚ùå BASE MODEL:
   ‚Ä¢ headache (confidence: 0.569)
   ‚Ä¢ hurts to breath (confidence: 0.483)
   ‚Ä¢ ache all over (confidence: 0.472)
   ‚Ä¢ frontal headache (confidence: 0.468)
   ‚Ä¢ ear pain (confidence: 0.465)

‚úÖ FINE-TUNED MODEL:
   ‚Ä¢ symptoms of the face (confidence: 0.890)
   ‚Ä¢ pain in eye (confidence: 0.882)
   ‚Ä¢ redness in ear (confidence: 0.863)
   ‚Ä¢ neck pain (confidence: 0

---
## 6. Save Model & Metadata <a id='save'></a>

In [None]:
# Save fine-tuned model (if not already saved during training)
finetuned_model.save(output_path)

# Prepare enhanced metadata
metadata = {
    'base_model': base_model_name,
    'training_examples': len(train_data),
    'validation_examples': len(val_data),
    'num_symptoms': len(SYMPTOMS),
    'epochs': NUM_EPOCHS,
    'macro_f1_base': float(base_metrics['macro_f1']),
    'micro_f1_base': float(base_metrics['micro_f1']),
    'macro_f1_finetuned': float(ft_metrics['macro_f1']),
    'micro_f1_finetuned': float(ft_metrics['micro_f1']),
    'macro_improvement_pct': float((ft_metrics['macro_f1'] - base_metrics['macro_f1']) / base_metrics['macro_f1'] * 100),
    'micro_improvement_pct': float((ft_metrics['micro_f1'] - base_metrics['micro_f1']) / base_metrics['micro_f1'] * 100),
    'avg_conf_base': float(avg_base_conf),
    'avg_conf_finetuned': float(avg_ft_conf),
    'output_path': output_path
}

# Save metadata
import json, os
with open(f'{output_path}/training_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"‚úì Metadata saved to {output_path}/training_metadata.json")
print(f"\nüì¶ Model package contents:")
print(f"   {output_path}/")
for item in os.listdir(output_path):
    print(f"   ‚îú‚îÄ‚îÄ {item}")


‚úì Metadata saved to models/medical_symptom_matcher/training_metadata.json

üì¶ Model package contents:
   models/medical_symptom_matcher/
   ‚îú‚îÄ‚îÄ 1_Pooling
   ‚îú‚îÄ‚îÄ training_metadata.json
   ‚îú‚îÄ‚îÄ special_tokens_map.json
   ‚îú‚îÄ‚îÄ model.safetensors
   ‚îú‚îÄ‚îÄ config.json
   ‚îú‚îÄ‚îÄ README.md
   ‚îú‚îÄ‚îÄ 2_Normalize
   ‚îú‚îÄ‚îÄ modules.json
   ‚îú‚îÄ‚îÄ sentence_bert_config.json
   ‚îú‚îÄ‚îÄ config_sentence_transformers.json
   ‚îú‚îÄ‚îÄ tokenizer.json
   ‚îú‚îÄ‚îÄ vocab.txt
   ‚îú‚îÄ‚îÄ tokenizer_config.json


---
## 7. Quick Test Interface <a id='test'></a>

Interactive testing of your fine-tuned model.

In [None]:
def test_symptom_extraction(
    text, threshold=0.45, top_k=5,
    model_choice='finetuned'
):
    """
    Test the chosen model ('base' or 'finetuned') on custom input
    """
    print(f"\n{'='*60}")
    print(f"Input: \"{text}\"")
    print(f"Model: {'BASE' if model_choice=='base' else 'FINE-TUNED'}")
    print(f"Threshold: {threshold:.2f}  |  Top-K: {top_k}")
    print(f"{'='*60}")

    # Choose model and embeddings
    if model_choice == 'base':
        model = base_model
        symptom_embeddings = base_symptom_embeddings
    else:
        model = finetuned_model
        symptom_embeddings = finetuned_symptom_embeddings

    top_results, all_results = extract_symptoms_comparison(
        text, model, symptom_embeddings, SYMPTOMS, threshold=threshold, top_k=top_k
    )

    if top_results:
        print(f"\n‚úÖ Detected {len(top_results)} symptom(s):\n")
        for i, r in enumerate(top_results, 1):
            confidence_bar = '‚ñà' * int(r['score'] * 20)
            print(f"  {i}. {r['symptom']}")
            print(f"     Confidence: {r['score']:.3f} {confidence_bar}")
    else:
        print("\n‚ùå No symptoms detected above threshold")
        print(f"   Try lowering threshold (currently {threshold})")

    # Show full sorted scores for transparency
    print("\nAll Candidate Scores (top 10):")
    for r in all_results[:10]:
        print(f"  ‚Ä¢ {r['symptom']}: {r['score']:.3f}")

    return top_results

# Example usage / tests:
test_symptom_extraction("I've been feeling dizzy and have a terrible headache", model_choice='finetuned')
test_symptom_extraction("My throat is sore and I can't stop coughing", model_choice='base')
test_symptom_extraction("I have chest pain and shortness of breath", model_choice='finetuned')



Input: "I've been feeling dizzy and have a terrible headache"

‚úÖ Detected 5 symptom(s):

  1. headache
     Confidence: 0.960 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  2. dizziness
     Confidence: 0.952 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  3. feeling ill
     Confidence: 0.947 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  4. fatigue
     Confidence: 0.933 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  5. ache all over
     Confidence: 0.922 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

Input: "My throat is sore and I can't stop coughing"

‚úÖ Detected 5 symptom(s):

  1. sore throat
     Confidence: 0.959 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  2. throat irritation
     Confidence: 0.915 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  3. cough
     Confidence: 0.874 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
  4. lump in throat
     Confidence: 0.865 ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñ

[{'symptom': 'shortness of breath', 'score': 0.9336739778518677},
 {'symptom': 'burning chest pain', 'score': 0.9136399030685425},
 {'symptom': 'difficulty breathing', 'score': 0.8817726969718933},
 {'symptom': 'chest tightness', 'score': 0.8593177795410156},
 {'symptom': 'sharp chest pain', 'score': 0.8490051031112671}]

In [None]:
# Interactive testing (optional - uncomment to use)
# while True:
#     user_input = input("\nDescribe your symptoms (or 'quit' to exit): ")
#     if user_input.lower() in ['quit', 'exit', 'q']:
#         break
#     test_symptom_extraction(user_input)

---
## ‚úÖ Training Complete!

### Next Steps:

1. **Your fine-tuned model is saved at:** `models/medical_symptom_matcher/`

2. **To use it in your chatbot notebook:**
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('models/medical_symptom_matcher')
```

3. **Integration code is ready** - just copy the `extract_symptoms_comparison()` function to your main chatbot!

### Performance Summary:
- ‚úÖ Base model accuracy: {base_accuracy:.1%}
- ‚úÖ Fine-tuned accuracy: {finetuned_accuracy:.1%}
- ‚úÖ Improvement: {improvement_pct:+.1f}%

### Tips:
- Adjust `threshold` parameter to control sensitivity (lower = more symptoms detected)
- If you need better performance, train with more examples or more epochs
- The model works best with complete sentences (not just single words)

In [None]:
import shutil
import os

folder_to_download = 'models'
output_archive_name = folder_to_download + '.zip'

# Create the zip archive from folder
shutil.make_archive(folder_to_download, 'zip', folder_to_download)
print(f"Created archive: {output_archive_name}")

# For Google Colab: trigger download
try:
    from google.colab import files
    files.download(output_archive_name)
except ImportError:
    print("Download not supported, you can manually download the zip file.")

'/content/models.zip'