# 🚀 Building an Level Large Language Model (LLM) for Clinical Analytics

## 📌 Introduction
In the medical field, **clinical text analytics** plays a crucial role in **automating documentation, summarizing patient reports, and assisting in medical decision-making**. Traditional models struggle with domain-specific terminology, but **Large Language Models (LLMs)** can help **generate, summarize, and analyze clinical text efficiently**.

In this project, we will **build an LLM from scratch** and fine-tune it on **medical datasets** for real-world clinical applications.

---

## 🎯 Project Goal
We aim to develop an **LLM trained on medical text** to:
- ✅ **Summarize clinical reports** for quick insights.  
- ✅ **Generate accurate medical documentation**.  
- ✅ **Answer medical questions** based on clinical text.  

---

## 🛠️ Approach: How We'll Build It
We will break the process into **key stages**:

### 1️⃣ Data Acquisition
- Use publicly available **medical text datasets** (e.g., **PubMed, MIMIC-III, MedQA**).
- Preprocess text to remove noise, special characters, and ensure consistency.

### 2️⃣ Tokenization & Preprocessing
- Convert text into smaller subword units (**Byte-Pair Encoding (BPE)**).
- Prepare data for **efficient training**.

### 3️⃣ Model Architecture
- Use a **Transformer-based model** similar to **GPT-2**.
- Fine-tune a **pre-trained LLM** (e.g., **GPT-2, T5, LLaMA**) for **medical text generation**.

### 4️⃣ Training the Model
- Train the model using **clinical text**.
- Use **mini-batch optimization, Cross-Entropy Loss, and Adam optimizer**.

### 5️⃣ Model Evaluation
- Assess performance using:
  - **Perplexity (PPL)** → Measures fluency.
  - **BLEU Score** → Measures text similarity with real clinical reports.

### 6️⃣ Deploying & Testing
- Generate **medical text** from input prompts.
- Deploy the model as an **API for real-world usage**.

---

## 🚀 Expected Outcomes
- ✅ **A fine-tuned LLM that understands clinical text**.  
- ✅ **Can generate, summarize, and analyze patient reports**.  
- ✅ **Ready for real-world applications in clinical documentation & analytics**.  




# 🚀 Step 1: Data Acquisition  

## 📌 Data Selection  
To build a **medical text-focused LLM**, we need a **high-quality dataset** that contains **real-world clinical notes, medical research abstracts, or question-answer pairs**. Some publicly available options include:

1. **PubMedQA** 🏥 – A dataset with **medical question-answer pairs** derived from PubMed abstracts.  
2. **MIMIC-III / MIMIC-IV** 📑 – Contains **de-identified ICU clinical notes** (requires access approval).  
3. **MedQA (USMLE Questions)** 🩺 – A dataset with **clinical exam questions and explanations**.  

For this project, we will use **PubMedQA**, which is openly available and contains **high-quality medical text**.

---

## 📌 Steps
1. **Download the PubMedQA dataset** from Hugging Face.  
2. **Convert it into a structured Pandas DataFrame** for easy processing.  
3. **Explore the dataset** to understand its contents.  


In [None]:
#!pip install datasets

# Install necessary libraries
# !pip install datasets transformers torch rouge_score nltk

In [None]:
# Import required libraries
from datasets import load_dataset
import pandas as pd
import torch
import math
import re
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from torch.nn.utils.rnn import pad_sequence
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from transformers import pipeline
import torch.nn.functional as F
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer


# Load PubMedQA dataset from Hugging Face
dataset = load_dataset("pubmed_qa", "pqa_labeled")

# Convert dataset to Pandas DataFrame
df = pd.DataFrame(dataset["train"])

# Display first few rows to explore structure
df.head()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,pubid,question,context,long_answer,final_decision
0,21645374,Do mitochondria play a role in remodelling lac...,{'contexts': ['Programmed cell death (PCD) is ...,Results depicted mitochondrial dynamics in viv...,yes
1,16418930,Landolt C and snellen e acuity: differences in...,{'contexts': ['Assessment of visual acuity dep...,"Using the charts described, there was only a s...",no
2,9488747,"Syncope during bathing in infants, a pediatric...",{'contexts': ['Apparent life-threatening event...,"""Aquagenic maladies"" could be a pediatric form...",yes
3,17208539,Are the long-term results of the transanal pul...,{'contexts': ['The transanal endorectal pull-t...,Our long-term study showed significantly bette...,no
4,10808977,Can tailored interventions increase mammograph...,{'contexts': ['Telephone counseling and tailor...,The effects of the intervention were most pron...,yes


In [None]:
# Display column names
print("Dataset Columns:", df.columns)

# Print a sample of the dataset
print("\nSample Data:\n")
print(df.sample(3))  # Show 3 random examples


Dataset Columns: Index(['pubid', 'question', 'context', 'long_answer', 'final_decision'], dtype='object')

Sample Data:

        pubid                                           question  \
482  15477551  Chronic progressive cervical myelopathy with H...   
46   19504993          It's Fournier's gangrene still dangerous?   
255  24434052  Are we seeing the effects of public awareness ...   

                                               context  \
482  {'contexts': ['To investigate the role of huma...   
46   {'contexts': ['Fournier's gangrene is known to...   
255  {'contexts': ['The last 20 years has seen a ma...   

                                           long_answer final_decision  
482  These four cases may belong to a variant form ...            yes  
46   The interval from the onset of clinical sympto...            yes  
255  The proportion of thin 0-1 mm melanomas presen...          maybe  


# 🚀 Step 2: Text Preprocessing & Tokenization  


Before training, we need to:  
✅ **Clean the text** → Remove unwanted characters, whitespace, and formatting issues.  
✅ **Tokenize the text** → Convert raw text into smaller subword units using a **Byte-Pair Encoding (BPE) tokenizer**.  
✅ **Prepare data for efficient training** → Ensure tokens are in the right


#### Clean the Clinical Text


- Converts text to lowercase.
- Removes special characters except for medical-relevant ones.
- Strips unnecessary spaces.

In [None]:
import re

def clean_text(text):
    """
    Function to clean clinical text by removing unnecessary characters.
    """
    if isinstance(text, dict):  # Extract text if it's stored in a dictionary
        text = text.get("text", "")

    if not isinstance(text, str):  # Ensure text is a string
        return ""

    text = text.lower().strip()  # Convert to lowercase and strip whitespace
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'[^\w\s.,-]', '', text)  # Remove special characters except punctuation
    return text

# Apply cleaning to dataset
df["context"] = df["context"].apply(clean_text)
df["long_answer"] = df["long_answer"].apply(clean_text)

print("✅ Clinical text cleaned!")

# Display cleaned text
df.head()


✅ Clinical text cleaned!


Unnamed: 0,pubid,question,context,long_answer,final_decision
0,21645374,Do mitochondria play a role in remodelling lac...,,results depicted mitochondrial dynamics in viv...,yes
1,16418930,Landolt C and snellen e acuity: differences in...,,"using the charts described, there was only a s...",no
2,9488747,"Syncope during bathing in infants, a pediatric...",,aquagenic maladies could be a pediatric form o...,yes
3,17208539,Are the long-term results of the transanal pul...,,our long-term study showed significantly bette...,no
4,10808977,Can tailored interventions increase mammograph...,,the effects of the intervention were most pron...,yes


#### Tokenizing Medical Text

- Uses T5 tokenizer (since we may fine-tune T5 or GPT-2).
- Converts text into token IDs.
- Adds padding & truncation for uniform sequence length.

In [None]:
# Load GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Ensure GPT-2 has a padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # Set EOS token as padding


def tokenize_function(example):
    """
    Tokenize context and long_answer while ensuring padding consistency.
    """
    input_encoding = tokenizer(
        example["context"],
        truncation=True,
        padding="max_length",
        max_length=512,  # Ensure consistent input length
        return_tensors="pt",
    )

    label_encoding = tokenizer(
        example["long_answer"],
        truncation=True,
        padding="max_length",
        max_length=512,  # Match input length
        return_tensors="pt",
    )

    labels = label_encoding["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100  # Ignore padding tokens for loss computation

    return {
        "input_ids": input_encoding["input_ids"].squeeze(0),
        "labels": labels.squeeze(0),
    }

# Apply tokenization and correctly store it in a new DataFrame
tokenized_data = df.apply(tokenize_function, axis=1, result_type="expand")

# Merge tokenized data back into the original DataFrame
df = pd.concat([df, tokenized_data], axis=1)

print("✅ Tokenization completed successfully!")
print(df.head())  # Check if input_ids column is present


✅ Tokenization completed successfully!
      pubid                                           question context  \
0  21645374  Do mitochondria play a role in remodelling lac...           
1  16418930  Landolt C and snellen e acuity: differences in...           
2   9488747  Syncope during bathing in infants, a pediatric...           
3  17208539  Are the long-term results of the transanal pul...           
4  10808977  Can tailored interventions increase mammograph...           

                                         long_answer final_decision  \
0  results depicted mitochondrial dynamics in viv...            yes   
1  using the charts described, there was only a s...             no   
2  aquagenic maladies could be a pediatric form o...            yes   
3  our long-term study showed significantly bette...             no   
4  the effects of the intervention were most pron...            yes   

                                           input_ids  \
0  [tensor(50256), tensor(50256),

# 🚀 Step 3: Defining the LLM Model Architecture

## 📌 Steps
✅ **Choose a Transformer-based model** (GPT-2 or T5) for medical text generation.  
✅ **Load a pre-trained model** to fine-tune on clinical datasets.  
✅ **Prepare the model for training** (moving it to GPU if available).  


#### Load Pre-Trained GPT-2 Model


- Loads GPT-2, a Transformer-based LLM.
- Uses Hugging Face's AutoModelForCausalLM for text generation.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Choose a pre-trained model (GPT-2 for generative tasks)
model_name = "gpt2"  # Can be replaced with "t5-small" or "facebook/opt-1.3b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print(f"✅ Loaded {model_name} successfully!")


✅ Loaded gpt2 successfully!


#### Understanding the Model Architecture

- Displays the GPT-2 model architecture.
- Shows number of trainable parameters.

In [None]:
# Print model summary
print(model)
print(f"\nTotal Trainable Parameters: {sum(p.numel() for p in model.parameters())}")


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

Total Trainable Parameters: 124439808


# 🚀 Step 4: Fine-Tuning GPT-2 on Medical Text

## 📌 Steps
✅ **Convert tokenized text into a format suitable for training**.  
✅ **Ensure correct PyTorch tensor structure for `input_ids` and `labels`**.  
✅ **Fine-tune GPT-2 on medical text using PyTorch & Hugging Face's `Trainer`**.  


#### Ensure Correct Tensor Format
- Ensures input_ids and labels are correctly formatted as PyTorch tensors.
- Avoids unnecessary tensor conversion warnings using .clone().detach().

In [None]:
# Load GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Ensure GPT-2 has a padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # Use EOS token as padding
    tokenizer.pad_token_id = tokenizer.eos_token_id  # Set padding token ID

print(f"✅ Padding token set to: {tokenizer.pad_token}")

✅ Padding token set to: <|endoftext|>


#### Convert Data into PyTorch Dataset

- Formats the dataset into a PyTorch Dataset class.
- Uses pad_sequence() to ensure uniform tensor size for batch processing

In [None]:
class MedicalTextDataset(Dataset):
    def __init__(self, df):
        self.input_ids = pad_sequence(
            df["input_ids"].tolist(),
            batch_first=True,
            padding_value=tokenizer.pad_token_id  # ✅ Ensure padding value is properly set
        )

        self.labels = pad_sequence(
            df["labels"].tolist(),
            batch_first=True,
            padding_value=-100  # ✅ Standard practice for ignoring padded tokens
        )

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "labels": self.labels[idx]
        }


In [None]:
'''
# Fix tokenization by converting lists of tensors into proper PyTorch tensors
df["input_ids"] = df["input_ids"].apply(lambda x: x[0].clone().detach())  # Avoids unnecessary tensor creation
df["labels"] = df["labels"].apply(lambda x: x[0].clone().detach())  # Ensures correct tensor format

# Verify the new format
print(df["input_ids"].head())
print(df["labels"].head())
'''

'\n# Fix tokenization by converting lists of tensors into proper PyTorch tensors\ndf["input_ids"] = df["input_ids"].apply(lambda x: x[0].clone().detach())  # Avoids unnecessary tensor creation\ndf["labels"] = df["labels"].apply(lambda x: x[0].clone().detach())  # Ensures correct tensor format\n\n# Verify the new format\nprint(df["input_ids"].head())\nprint(df["labels"].head())\n'

In [None]:
# Split dataset into train and evaluation sets
split_ratio = 0.1  # 90% training, 10% validation
split_index = int(len(df) * (1 - split_ratio))

train_df = df.iloc[:split_index]
eval_df = df.iloc[split_index:]

# Create dataset objects
train_dataset = MedicalTextDataset(train_df)
eval_dataset = MedicalTextDataset(eval_df)

print("✅ Training and Evaluation datasets created successfully!")

✅ Training and Evaluation datasets created successfully!


In [None]:
train_dataset = MedicalTextDataset(train_df)
eval_dataset = MedicalTextDataset(eval_df)

print("✅ Training and Evaluation datasets created successfully!")

✅ Training and Evaluation datasets created successfully!


In [None]:
'''
# Reload dataset with corrected format
train_dataset = MedicalTextDataset(df)

print("✅ Dataset is ready for training!")
'''

#### Define Training Parameters

- Defines the learning rate, batch size, and number of epochs.
- Uses CrossEntropyLoss as the loss function.
- Uses Adam optimizer for updating model weights.

### Reduce Batch Size & Enable Gradient Checkpointing
📌 Modify TrainingArguments to lower memory usage

In [None]:
'''
from transformers import Trainer, TrainingArguments

model.config.use_cache = False
print("✅ `use_cache` disabled for gradient checkpointing!")
'''

# Define improved training arguments
training_args = TrainingArguments(
    output_dir="./gpt2_medical_improved",
    per_device_train_batch_size=4,  # Lower batch size to prevent RAM crashes
    gradient_accumulation_steps=2,
    num_train_epochs=2,  # More training epochs
    save_strategy="epoch",
    evaluation_strategy="epoch",
    logging_dir="./logs",
    learning_rate=3e-5,  # Reduced learning rate for better stability
    lr_scheduler_type="cosine",  # Use cosine learning rate scheduling
    weight_decay=0.01,
    save_total_limit=2,
    fp16=True,
    gradient_checkpointing=True,
)

# Restart training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,7.115085


TrainOutput(global_step=224, training_loss=7.0616030011858255, metrics={'train_runtime': 22177.4582, 'train_samples_per_second': 0.081, 'train_steps_per_second': 0.01, 'total_flos': 467190153216000.0, 'train_loss': 7.0616030011858255, 'epoch': 1.9866666666666668})

In [None]:
# ✅ Save trained model before evaluation
save_directory = "./gpt2_medical_finetuned"

model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

print(f"✅ Model and tokenizer saved to {save_directory}")

# Reload model for evaluation (optional, to ensure we're using the saved model)
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(save_directory)
tokenizer = AutoTokenizer.from_pretrained(save_directory)

print("✅ Model reloaded for evaluation")

✅ Model and tokenizer saved to ./gpt2_medical_finetuned
✅ Model reloaded for evaluation


## Ensure Dataset is in PyTorch Format
- Converting dataset into torch format prevents unnecessary conversions, saving RAM.

# 🚀 Step 5: Generating and Evaluating Medical Text

## 📌 What We Will Do
✅ **Generate medical text using the trained model**.  
✅ **Evaluate performance using standard NLP metrics**.  

## 📊 Evaluation Metrics
🔹 **Perplexity (PPL)** → Measures fluency of generated text.  
🔹 **BLEU Score** → Measures similarity with real clinical reports.  
🔹 **ROUGE Score** → Measures how well the generated text summarizes information.  


Generate Medical Text

- Uses GPT-2 to generate clinical text from an input prompt.
- Sets temperature (randomness) and top-k sampling (controls diversity).

In [None]:
# ✅ Proceed with Evaluation
from transformers import pipeline
import math
import torch.nn.functional as F
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer


# Load text generation pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Define a sample medical input prompt
prompt = "Patient diagnosed with Type 2 Diabetes, symptoms include"

# Generate text
generated_text = text_generator(prompt, max_length=100, num_return_sequences=1, temperature=0.7, top_k=50)

# Print generated medical text
print("Generated Clinical Text:\n", generated_text[0]["generated_text"])


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Generated Clinical Text:
 Patient diagnosed with Type 2 Diabetes, symptoms include a normalization of the insulin-dependent glucose-induced insulin resistance and, in a minority, a high-grade insulin-dependent glucose-induced glucose-induced diabetes, or hypertension.

hoc.c. of the study, in patients with the same-type diabetes, significant patients with this disease had significantly higher levels of gastric or systemic insulin-dependent glucose. In patients with a low-grade or a low-grade diabetes,


In [None]:
# ✅ Generate Medical Text for Custom Input
user_prompt = input("Enter a medical prompt: ")

# Generate text based on the user input
generated_response = text_generator(user_prompt, max_length=100, num_return_sequences=1, temperature=0.7, top_k=50)

# Print the generated text
print("\n🩺 Generated Clinical Text:\n", generated_response[0]["generated_text"])


Enter a medical prompt: Symptoms of Fever includes,

🩺 Generated Clinical Text:
 Symptoms of Fever includes, in many cases, a lack of air-ventricular tachycardia. however, other risk factors may be associated with this. and the incidence may be increased with a higher degree of severity. patients with a low fever and, thus, are more likely to be hospitalized. to determine the association of the low-grade and- to the high-grade patients of the two cases. patients with a high fever and a high-grade with a low-grade.


Evaluate Model Using Perplexity (PPL)

- Uses the Cross-Entropy loss to compute Perplexity (PPL).
- Lower PPL means better fluency.

In [None]:
import math
import torch.nn.functional as F

def calculate_perplexity(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss.item()
    perplexity = math.exp(loss)
    return perplexity

# Sample medical sentence
sample_text = "Patient was admitted with severe chest pain and diagnosed with myocardial infarction."

# Compute Perplexity
ppl_score = calculate_perplexity(model, tokenizer, sample_text)
print(f"✅ Perplexity Score: {ppl_score:.2f}")


✅ Perplexity Score: 16.06


Evaluate Model Using BLEU Score

- Compares generated text vs. real clinical reports using BLEU Score.
- Higher BLEU means better text similarity.

In [None]:
from nltk.translate.bleu_score import sentence_bleu

# Reference real medical text
reference_text = ["Patient was diagnosed with severe chest pain and found to have a heart attack."]

# Generate a prediction using the model
generated_text = text_generator("Patient was diagnosed with severe chest pain", max_length=30)[0]["generated_text"]

# Tokenize both texts
reference_tokens = [reference_text[0].split()]
generated_tokens = generated_text.split()

# Compute BLEU score
bleu_score = sentence_bleu(reference_tokens, generated_tokens)

print(f"✅ BLEU Score: {bleu_score:.2f}")


✅ BLEU Score: 0.24


Evaluate Model Using ROUGE Score

- Measures how well the generated text summarizes medical information.
- Higher ROUGE scores mean better summarization.

In [None]:
#!pip install rouge_score

In [None]:
from rouge_score import rouge_scorer

# Define a reference summary and model-generated summary
reference_summary = "Patient admitted with chest pain, diagnosed with myocardial infarction."
generated_summary = generated_text

# Compute ROUGE score
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
scores = scorer.score(reference_summary, generated_summary)

print(f"✅ ROUGE Scores: {scores}")


✅ ROUGE Scores: {'rouge1': Score(precision=0.19230769230769232, recall=0.5555555555555556, fmeasure=0.28571428571428575), 'rouge2': Score(precision=0.08, recall=0.25, fmeasure=0.12121212121212122), 'rougeL': Score(precision=0.15384615384615385, recall=0.4444444444444444, fmeasure=0.2285714285714286)}


# 🚀 Step 6: Fine-Tuning Improvements & Deployment

## 📌 What We Will Do
✅ **Further improve the fine-tuned model**.  
✅ **Deploy the model as an API for real-world usage**.  

## 🔄 Fine-Tuning Improvements
- Increase training epochs for better performance.
- Use a **larger dataset** or more diverse examples.
- Apply **data augmentation techniques**.
- Implement **hyperparameter tuning**.

## 🌍 Deploying the Model
- Convert model to a deployable format (ONNX, TorchScript, or HF API).
- Build an API using FastAPI.
- Deploy it on cloud platforms like AWS, Hugging Face Spaces, or Google Cloud.
