<a href="https://colab.research.google.com/github/Ashiq11/Fine-Tuning-a-Large-Language-Model-for-Medical-Chat-Summarization/blob/master/Fine_Tuning_FLAN_T5_for_Medical_Chat_(SOAP)_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.   Environment Setup

In [1]:
# Install required libraries
!pip install -q transformers datasets evaluate rouge-score sacrebleu accelerate


  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [3]:
# Import libraries
import logging
import torch
import pandas as pd
import numpy as np
import gc
logging.getLogger("torchao").setLevel(logging.ERROR)

from datasets import Dataset, DatasetDict
from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    Trainer,
    TrainingArguments,
    DataCollatorForSeq2Seq
)

import evaluate



In [4]:
import transformers
print(transformers.__version__)


4.57.3


## Hardware Detection

In [5]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))


Using device: cpu


# 2.  Load Dataset from Google Drive

**Mount Drive**

In [6]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


**Set Dataset Path**

In [7]:
#DATA_PATH = "https://drive.google.com/drive/folders/1PVqVdxhE25kv2fkxS4Uy_8FWxf1aGR2P"
output_path = '/content/dataset/SOAP_Assessment_Data'
!unzip "/content/drive/MyDrive/NewDataset/Medical_Chat_Summarization/SOAP_Assessment_Data.zip" -d "/content/dataset/"
TRAIN_FILE = output_path + "/medical_dialogue_train.csv"
VAL_FILE   = output_path + "/medical_dialogue_validation.xlsx"
TEST_FILE  = output_path + "/medical_dialogue_test.xlsx"


Archive:  /content/drive/MyDrive/NewDataset/Medical_Chat_Summarization/SOAP_Assessment_Data.zip
  inflating: /content/dataset/SOAP_Assessment_Data/train_llama_with_embeddings.pkl  
  inflating: /content/dataset/SOAP_Assessment_Data/medical_dialogue_train.csv  
  inflating: /content/dataset/SOAP_Assessment_Data/medical_dialogue_validation.xlsx  
  inflating: /content/dataset/SOAP_Assessment_Data/medical_dialogue_test.xlsx  


**Load CSV Files**

In [8]:
train_df = pd.read_csv(TRAIN_FILE)
val_df   = pd.read_excel(VAL_FILE)
test_df  = pd.read_excel(TEST_FILE)

train_df.head()


Unnamed: 0,dialogue,soap
0,"Doctor: Hello, how can I help you today?\nPati...",S: The patient's mother reports that her 13-ye...
1,"Doctor: Hello, what brings you in today?\nPati...","S: The patient, a 21-month-old male, presented..."
2,"Doctor: Hello, how can I help you today?\nPati...","S: Patient reports experiencing fatigue, night..."
3,"Doctor: Hello, Patient D. How are you feeling ...","S: Patient D, a 60-year-old African American m..."
4,"Doctor: Hello, I see that you have a history o...","S: The patient, a married woman with a 7-year ..."


# 3. Dataset Cleaning & Formatting

In [9]:
# Drop missing values
train_df = train_df.dropna()
val_df   = val_df.dropna()
test_df  = test_df.dropna()

print("Train size:", len(train_df))
print("Validation size:", len(val_df))
print("Test size:", len(test_df))


Train size: 9250
Validation size: 500
Test size: 250


**Convert to HuggingFace Dataset**

In [10]:
dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "validation": Dataset.from_pandas(val_df),
    "test": Dataset.from_pandas(test_df)
})

dataset


DatasetDict({
    train: Dataset({
        features: ['dialogue', 'soap'],
        num_rows: 9250
    })
    validation: Dataset({
        features: ['dialogue', 'soap'],
        num_rows: 500
    })
    test: Dataset({
        features: ['dialogue', 'soap'],
        num_rows: 250
    })
})

# 4. Model & Tokenizer Initialization

In [11]:
MODEL_NAME = "google/flan-t5-base"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME).to(device)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

# 5. Preprocessing & Tokenization

**Task Instruction**

In [12]:
PREFIX = "Summarize the following medical dialogue into a SOAP note:\n\n"


**Tokenization Function**

In [13]:
def preprocess_function(examples):
    inputs = [PREFIX + d for d in examples["dialogue"]]
    targets = examples["soap"]

    model_inputs = tokenizer(
        inputs,
        max_length=384,
        truncation=True,
        padding="max_length"
    )

    labels = tokenizer(
        targets,
        max_length=192,
        truncation=True,
        padding="max_length"
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [14]:
tokenized_datasets = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["train"].column_names
)


Map:   0%|          | 0/9250 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

## Data Collator (Fixes padding + Trainer warnings)

In [15]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model
)

# 6. Baseline (Zero-Shot) Evaluation

In [16]:
def generate_summary(text):
    input_text = PREFIX + text
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=192,
            num_beams=4
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)


In [17]:
# Baseline example
sample_dialogue = test_df.iloc[0]["dialogue"]
print("Dialogue:\n", sample_dialogue)
print("\nBaseline Summary:\n", generate_summary(sample_dialogue))


Dialogue:
 Doctor: Hello, can you please tell me about your past medical history?
Patient: Hi, I don't have any past medical history.
Doctor: Okay. What brings you in today?
Patient: I've been experiencing painless blurry vision in my right eye for a week now. I've also had intermittent fevers, headache, body aches, and a nonpruritic maculopapular rash on my lower legs for the past 6 months.
Doctor: Thank you for sharing that. Have you had any other symptoms such as neck stiffness, nausea, vomiting, Raynaud's phenomenon, oral ulcerations, chest pain, shortness of breath, abdominal pain, or photosensitivity?
Patient: No, only an isolated episode of left knee swelling and testicular swelling in the past.
Doctor: Do you work with any toxic substances or have any habits like smoking, drinking, or illicit drug use?
Patient: No, I work as a flooring installer and I don't have any toxic habits.
Doctor: Alright. We checked your vital signs and they were normal. During the physical exam, we fou

# 7. Fine-Tuning Configuration

In [18]:
# Use only a small subset (e.g., 5–10%)
train_small = tokenized_datasets["train"].shuffle(seed=42).select(range(500))
val_small   = tokenized_datasets["validation"].shuffle(seed=42).select(range(100))


In [19]:
training_args = TrainingArguments(
    output_dir="./soap_model",
    per_device_train_batch_size=1,     #  smallest possible
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,     # keeps effective batch
    num_train_epochs=1,                #  at least 1 epoch
    fp16=torch.cuda.is_available(),
    dataloader_pin_memory=False,
    eval_strategy="no",
    logging_steps=20,
    save_strategy="no",
    report_to="none"
)


In [None]:
training_args = TrainingArguments(
    output_dir="./soap_model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    fp16=torch.cuda.is_available(),
    dataloader_pin_memory=torch.cuda.is_available(),
    save_total_limit=1,
    logging_steps=50,
    report_to="none"
)


# 8. Trainer Setup & Training

In [20]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_small,
    eval_dataset=val_small,
    processing_class=data_collator
)

trainer.train()

Step,Training Loss
20,1.7939
40,1.6418


Step,Training Loss
20,1.7939
40,1.6418
60,1.5462


TrainOutput(global_step=63, training_loss=1.6663453692481631, metrics={'train_runtime': 4451.1428, 'train_samples_per_second': 0.112, 'train_steps_per_second': 0.014, 'total_flos': 256784007168000.0, 'train_loss': 1.6663453692481631, 'epoch': 1.0})

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    processing_class=data_collator
)



**Clear Memory Before Training (CRITICAL)**

In [None]:
gc.collect()
torch.cuda.empty_cache()


In [None]:
trainer.train()


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

# 9. Model Evaluation on Test Set

**Generate Predictions**

In [23]:
def generate_predictions(dataset):
    preds = []
    refs = []

    for example in dataset:
        summary = generate_summary(example["dialogue"])
        preds.append(summary)
        refs.append(example["soap"])

    return preds, refs


In [24]:
predictions, references = generate_predictions(test_df.to_dict("records"))


# 10. Automatic Metrics (ROUGE & BLEU)

In [25]:
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

In [26]:
rouge_results = rouge.compute(
    predictions=predictions,
    references=references
)

bleu_results = bleu.compute(
    predictions=predictions,
    references=[[r] for r in references]
)

rouge_results, bleu_results


({'rouge1': np.float64(0.38095084988545064),
  'rouge2': np.float64(0.20421386424320678),
  'rougeL': np.float64(0.2726184989562083),
  'rougeLsum': np.float64(0.33372743609609434)},
 {'bleu': 0.07913422787659156,
  'precisions': [0.6499495229100857,
   0.35538628229963554,
   0.23256198894296023,
   0.16506993357145242],
  'brevity_penalty': 0.2578799750207293,
  'length_ratio': 0.4245813918117335,
  'translation_length': 30707,
  'reference_length': 72323})

# 11. Qualitative Analysis

In [27]:
for i in range(3):
    print(f"\nExample {i+1}")
    print("Dialogue:\n", test_df.iloc[i]["dialogue"])
    print("\nReference SOAP:\n", references[i])
    print("\nGenerated SOAP:\n", predictions[i])



Example 1
Dialogue:
 Doctor: Hello, can you please tell me about your past medical history?
Patient: Hi, I don't have any past medical history.
Doctor: Okay. What brings you in today?
Patient: I've been experiencing painless blurry vision in my right eye for a week now. I've also had intermittent fevers, headache, body aches, and a nonpruritic maculopapular rash on my lower legs for the past 6 months.
Doctor: Thank you for sharing that. Have you had any other symptoms such as neck stiffness, nausea, vomiting, Raynaud's phenomenon, oral ulcerations, chest pain, shortness of breath, abdominal pain, or photosensitivity?
Patient: No, only an isolated episode of left knee swelling and testicular swelling in the past.
Doctor: Do you work with any toxic substances or have any habits like smoking, drinking, or illicit drug use?
Patient: No, I work as a flooring installer and I don't have any toxic habits.
Doctor: Alright. We checked your vital signs and they were normal. During the physical e

# 12. Save Fine-Tuned Model

In [28]:
model.save_pretrained("/content/drive/MyDrive/soap_finetuned_model")
tokenizer.save_pretrained("/content/drive/MyDrive/soap_finetuned_model")


('/content/drive/MyDrive/soap_finetuned_model/tokenizer_config.json',
 '/content/drive/MyDrive/soap_finetuned_model/special_tokens_map.json',
 '/content/drive/MyDrive/soap_finetuned_model/spiece.model',
 '/content/drive/MyDrive/soap_finetuned_model/added_tokens.json')

# 13. Key Takeaways (For Report)

* Fine-tuning significantly improves **SOAP structure adherence**

* ROUGE & BLEU show strong lexical overlap

* Qualitative analysis reveals occasional **medical hallucination**

* Instruction tuning is crucial for clinical summarization

