
# Fine-Tuning LLMs with Hugging Face and PyTorch

This notebook demonstrates how to fine-tune a pretrained transformer model using Hugging Face `transformers` and `datasets`, specifically for supervised fine-tuning (SFT) using the `SFTTrainer`.

## Overview
- Install dependencies
- Load pretrained model and tokenizer
- Load and tokenize dataset
- Configure LoRA for parameter-efficient tuning
- Train model using `SFTTrainer`
- Generate text


In [1]:
# @title Install Required Libraries

!pip install -q transformers datasets peft accelerate bitsandbytes
!pip install trl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m66.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# @title Import Required Libraries

import torch
from datasets import Dataset, DatasetDict, load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from transformers import BitsAndBytesConfig, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
import pandas as pd
import numpy as np

In [3]:
# @title Select the device

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

Using device: cuda


In [4]:
# @title Load Pretrained Model and Tokenizer

model_name = "EleutherAI/gpt-neo-125M"

model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

In [5]:
# @title Test the model before fine-tuning to illustrate baseline performance
prompt = "What is Beriberi ?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=50)

print("🔍 Pretrained Model Output:")
print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


🔍 Pretrained Model Output:
What is Beriberi?

Beriberi is a small, medium-sized, and very popular coffee shop in the city of Beriberi, in the city of Cagliari, in the province of Cagliari, in the province of Cag


In [6]:
# @title Load Dataset

# Load a medical dialogue or medical Q&A dataset.
print("\nTrying medical_dialog dataset...")

print("Attempting to load lavita/MedQuAD...")
try:
    dataset = load_dataset("lavita/MedQuAD", split="train")
    print("Successfully loaded!")
except Exception as e:
    print(f"Error loading dataset: {e}")

# Print the structure of the dataset and show one example from the training set
print("Dataset structure after splitting:")
print(dataset)
print("\nFirst example from the training dataset:")
print(dataset[0])

# You can also inspect the features (columns) of the dataset
print("\nFeatures of the dataset:")
print(dataset.features)


Trying medical_dialog dataset...
Attempting to load lavita/MedQuAD...


README.md: 0.00B [00:00, ?B/s]

(…)-00000-of-00001-e36383d177026d53.parquet:   0%|          | 0.00/10.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/47441 [00:00<?, ? examples/s]

Successfully loaded!
Dataset structure after splitting:
Dataset({
    features: ['document_id', 'document_source', 'document_url', 'category', 'umls_cui', 'umls_semantic_types', 'umls_semantic_group', 'synonyms', 'question_id', 'question_focus', 'question_type', 'question', 'answer'],
    num_rows: 47441
})

First example from the training dataset:
{'document_id': '0000559', 'document_source': 'GHR', 'document_url': 'https://ghr.nlm.nih.gov/condition/keratoderma-with-woolly-hair', 'category': None, 'umls_cui': 'C0343073', 'umls_semantic_types': 'T047', 'umls_semantic_group': 'Disorders', 'synonyms': 'KWWH', 'question_id': '0000559-1', 'question_focus': 'keratoderma with woolly hair', 'question_type': 'information', 'question': 'What is (are) keratoderma with woolly hair ?', 'answer': 'Keratoderma with woolly hair is a group of related conditions that affect the skin and hair and in many cases increase the risk of potentially life-threatening heart problems. People with these conditions

In [7]:
# @title Dataset Cleaning

print("=== STARTING DATASET CLEANING ===")
print(f"Original dataset size: {len(dataset)}")

# Step 1: Filter out examples with missing answers
print("\n=== FILTERING OUT MISSING ANSWERS ===")
filtered_examples = []
missing_count = 0

for example in dataset:
    if example['answer'] is not None and example['answer'].strip() != '':
        filtered_examples.append(example)
    else:
        missing_count += 1

print(f"Examples with valid answers: {len(filtered_examples)}")
print(f"Examples with missing answers: {missing_count}")
print(f"Kept {len(filtered_examples)/len(dataset)*100:.1f}% of original data")

# Step 2: Convert to pandas DataFrame and then back to Dataset
print("\n=== CREATING CLEANED DATASET ===")
df = pd.DataFrame(filtered_examples)
cleaned_dataset = Dataset.from_pandas(df)

print("✅ Cleaned dataset created successfully!")
print(f"Size: {len(cleaned_dataset)} examples")

# Step 3: Show sample from cleaned dataset
print("\n=== SAMPLE FROM CLEANED DATASET ===")
sample = cleaned_dataset[0]
print(f"Question: {sample['question']}")
print(f"Answer: {sample['answer'][:300]}...")
print(f"Category: {sample['category']}")
print(f"Question Type: {sample['question_type']}")
print(f"Source: {sample['document_source']}")

# Step 4: Create train/test split
print("\n=== CREATING TRAIN/TEST SPLIT ===")
split_datasets = cleaned_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_datasets["train"]
eval_dataset = split_datasets["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Evaluation examples: {len(eval_dataset)}")

# Step 5: Create simplified version with essential fields only
print("\n=== CREATING SIMPLIFIED VERSION ===")
def simplify_example(example):
    return {
        'question': example['question'],
        'answer': example['answer']
    }

simplified_train = train_dataset.map(simplify_example)
simplified_eval = eval_dataset.map(simplify_example)

print("Simplified datasets created!")
print(f"Training features: {list(simplified_train.features.keys())}")

# Step 6: Show statistics about the cleaned data
print("\n=== CLEANED DATASET STATISTICS ===")

# Count question types
question_types = {}
categories = {}

for example in train_dataset:
    qtype = example['question_type']
    question_types[qtype] = question_types.get(qtype, 0) + 1

    cat = example['category']
    categories[cat] = categories.get(cat, 0) + 1

print("Top question types:")
for qtype, count in sorted(question_types.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"  {qtype}: {count}")

print("\nCategory distribution:")
for cat, count in sorted(categories.items(), key=lambda x: x[1], reverse=True):
    print(f"  {cat}: {count}")

# Step 7: Create final dataset dictionary
print("\n=== CREATING FINAL DATASET DICTIONARY ===")
final_datasets = DatasetDict({
    'train': simplified_train,
    'test': simplified_eval
})

print(f"Final datasets created:")
print(f"- Training: {len(final_datasets['train'])} examples")
print(f"- Testing: {len(final_datasets['test'])} examples")

# Step 8: Show sample questions from different types
print("\n=== SAMPLE QUESTIONS BY TYPE ===")
seen_types = set()
for example in train_dataset:
    qtype = example['question_type']
    if qtype not in seen_types and len(seen_types) < 5:
        print(f"\n{qtype.upper()}:")
        print(f"Q: {example['question']}")
        print(f"A: {example['answer'][:200]}...")
        seen_types.add(qtype)

# Step 9: Export information for training
print("\n=== READY FOR TRAINING ===")
print("SUCCESS! Your medical Q&A dataset is ready!")
print(f"Dataset Summary:")
print(f"   - Original size: {len(dataset):,} examples")
print(f"   - Cleaned size: {len(cleaned_dataset):,} examples")
print(f"   - Training set: {len(final_datasets['train']):,} examples")
print(f"   - Test set: {len(final_datasets['test']):,} examples")
print(f"   - Data retention: {len(cleaned_dataset)/len(dataset)*100:.1f}%")

print(f"\nUsage:")
print("   - Use 'final_datasets['train']' for training")
print("   - Use 'final_datasets['test']' for evaluation")
print("   - Each example has: question, answer, category, question_type, document_source")

print(f"\nDataset focuses on:")
print("   - Medical conditions and genetic disorders")
print("   - Treatment information")
print("   - Symptoms and diagnosis")
print("   - Inheritance patterns")
print("   - Genetic changes and causes")

# Variables now available for use:
# - dataset (original)
# - cleaned_dataset (filtered)
# - final_datasets (ready for training)
# - train_dataset, eval_dataset (full versions)
# - simplified_train, simplified_eval (simplified versions)

=== STARTING DATASET CLEANING ===
Original dataset size: 47441

=== FILTERING OUT MISSING ANSWERS ===
Examples with valid answers: 16407
Examples with missing answers: 31034
Kept 34.6% of original data

=== CREATING CLEANED DATASET ===
✅ Cleaned dataset created successfully!
Size: 16407 examples

=== SAMPLE FROM CLEANED DATASET ===
Question: What is (are) keratoderma with woolly hair ?
Answer: Keratoderma with woolly hair is a group of related conditions that affect the skin and hair and in many cases increase the risk of potentially life-threatening heart problems. People with these conditions have hair that is unusually coarse, dry, fine, and tightly curled. In some cases, the hair is a...
Category: None
Question Type: information
Source: GHR

=== CREATING TRAIN/TEST SPLIT ===
Training examples: 14766
Evaluation examples: 1641

=== CREATING SIMPLIFIED VERSION ===


Map:   0%|          | 0/14766 [00:00<?, ? examples/s]

Map:   0%|          | 0/1641 [00:00<?, ? examples/s]

Simplified datasets created!
Training features: ['document_id', 'document_source', 'document_url', 'category', 'umls_cui', 'umls_semantic_types', 'umls_semantic_group', 'synonyms', 'question_id', 'question_focus', 'question_type', 'question', 'answer']

=== CLEANED DATASET STATISTICS ===
Top question types:
  information: 4082
  symptoms: 2507
  treatment: 2180
  inheritance: 1304
  frequency: 1004
  genetic changes: 981
  causes: 651
  exams and tests: 595
  research: 351
  outlook: 314

Category distribution:
  None: 13874
  Disease: 623
  Other: 269

=== CREATING FINAL DATASET DICTIONARY ===
Final datasets created:
- Training: 14766 examples
- Testing: 1641 examples

=== SAMPLE QUESTIONS BY TYPE ===

SYMPTOMS:
Q: What are the symptoms of Mental retardation-hypotonic facies syndrome X-linked, 1 ?
A: What are the signs and symptoms of Mental retardation-hypotonic facies syndrome X-linked, 1? The Human Phenotype Ontology provides the following list of signs and symptoms for Mental reta

## Prepare the Data

In [8]:
# @title Create formatted text for causal language modeling
def format_medical_qa(example):
    """
    Format the question-answer pair for causal language modeling.
    We'll use a clear format that the model can learn to follow.
    """
    question = example['question'].strip()
    answer = example['answer'].strip()

    # Create a formatted text with clear delimiters
    formatted_text = f"Question: {question}\nAnswer: {answer}{tokenizer.eos_token}"

    return {"text": formatted_text}

print("Formatting Q&A pairs...")
formatted_train = final_datasets['train'].map(format_medical_qa)
formatted_test = final_datasets['test'].map(format_medical_qa)

print("Sample formatted text:")
print(formatted_train[0]['text'])
print("\n" + "="*50 + "\n")

Formatting Q&A pairs...


Map:   0%|          | 0/14766 [00:00<?, ? examples/s]

Map:   0%|          | 0/1641 [00:00<?, ? examples/s]

Sample formatted text:
Question: What are the symptoms of Mental retardation-hypotonic facies syndrome X-linked, 1 ?
Answer: What are the signs and symptoms of Mental retardation-hypotonic facies syndrome X-linked, 1? The Human Phenotype Ontology provides the following list of signs and symptoms for Mental retardation-hypotonic facies syndrome X-linked, 1. If the information is available, the table below includes how often the symptom is seen in people with this condition. You can use the MedlinePlus Medical Dictionary to look up the definitions for these medical terms. Signs and Symptoms Approximate number of patients (when available) Abnormality of the palate 90% Anteverted nares 90% Cognitive impairment 90% Depressed nasal bridge 90% Microcephaly 90% Narrow forehead 90% Short stature 90% Tented upper lip vermilion 90% Behavioral abnormality 50% Genu valgum 50% Neurological speech impairment 50% Obesity 50% Seizures 35% Abnormality of the hip bone 7.5% Camptodactyly of finger 7.5% Cr

In [9]:
# @title Tokenization

def tokenize_function(examples):
    """
    Tokenize the formatted text for causal language modeling.
    For GPT-style models, input_ids and labels are the same (shifted internally).
    """
    # Tokenize the text
    tokenized = tokenizer(
        examples['text'],
        truncation=True,        # Ensure long sequences are cut
        padding=False,          # We'll pad later in batches with DataCollator
        max_length=512,         # Set your desired maximum sequence length here
        return_tensors=None     # Return lists, not tensors
    )

    # For causal LM, labels are the same as input_ids
    # The model will internally shift them for next-token prediction
    tokenized['labels'] = tokenized['input_ids'].copy()

    return tokenized

print("Tokenizing datasets...")
# Apply the tokenization function to your formatted datasets
tokenized_train = formatted_train.map(
    tokenize_function,
    batched=True,
    # Remove the original 'text' column, as it's no longer needed after tokenization
    remove_columns=formatted_train.column_names
)
tokenized_test = formatted_test.map(
    tokenize_function,
    batched=True,
    remove_columns=formatted_test.column_names
)

print(f"Tokenized training examples: {len(tokenized_train)}")
print(f"Tokenized test examples: {len(tokenized_test)}")

print("\nSample tokenized example:")
sample = tokenized_train[0]
print(f"Input IDs shape: {len(sample['input_ids'])}")
print(f"Labels shape: {len(sample['labels'])}")
print(f"Attention mask shape: {len(sample['attention_mask'])}")
# Decode a portion to verify
sample_text = tokenizer.decode(sample['input_ids'][:100], skip_special_tokens=False)
print(f"Sample decoded text (first 100 tokens): {sample_text}")
print(f"Sample tokenized IDs (first 20): {sample['input_ids'][:20]}") # Added this for inspection

Tokenizing datasets...


Map:   0%|          | 0/14766 [00:00<?, ? examples/s]

Map:   0%|          | 0/1641 [00:00<?, ? examples/s]

Tokenized training examples: 14766
Tokenized test examples: 1641

Sample tokenized example:
Input IDs shape: 512
Labels shape: 512
Attention mask shape: 512
Sample decoded text (first 100 tokens): Question: What are the symptoms of Mental retardation-hypotonic facies syndrome X-linked, 1?
Answer: What are the signs and symptoms of Mental retardation-hypotonic facies syndrome X-linked, 1? The Human Phenotype Ontology provides the following list of signs and symptoms for Mental retardation-hypotonic facies syndrome X-linked, 1. If the information is available, the table below includes how often the symptom is seen in people with this
Sample tokenized IDs (first 20): [24361, 25, 1867, 389, 262, 7460, 286, 21235, 42964, 341, 12, 36362, 313, 9229, 1777, 444, 14027, 1395, 12, 25614]


In [10]:
# @title A function to show model predictions (for testing)

def test_model_response(model, tokenizer, question, max_length=100):
    """Test the model's response to a question with proper attention mask handling"""
    model.eval()

    # Tokenize the input
    inputs = tokenizer(question, return_tensors="pt", padding=True, truncation=True)

    # Move to the same device as the model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        # Generate response with attention mask
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],  # Include attention mask
            max_length=inputs["input_ids"].shape[1] + max_length,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Decode the response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Remove the original question from the response
    if question in response:
        response = response.replace(question, "").strip()

    return response

In [42]:
# @title Test the model before training
print("\nTesting model before training...")
test_question = "What is hereditary xanthinuria ?"
pre_training_response = test_model_response(model, tokenizer, test_question)
print(f"Pre-training response to '{test_question}':")
print(f"Answer: {pre_training_response}")



Testing model before training...
Pre-training response to 'What is hereditary xanthinuria ?':
Answer: What is hereditary xanthinuria?
Hexanthinuria is a disorder that affects the kidneys and has several different causes. These include protein-rich foods, infections, malnutrition, and other conditions that affect the nervous system. When this condition is present in the first person with the disorder, it is called hereditary xanthinuria. The condition is caused by mutations in the SLC26A1 gene that is located on the X chromosome. This gene provides instructions for making an enzyme called aryl hydrocarbon receptor type 1


## FINE-TUNNING

In [34]:
# @title Config Lora
lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.01,
    bias="none",
    task_type="CAUSAL_LM"
)


In [35]:
# @title Define the Training Arguments

training_args = TrainingArguments(
    output_dir="./sft-model",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=500,
    #max_steps=100,
    num_train_epochs=10,
    save_steps=500,
    save_total_limit=2,
    fp16=True,
    report_to="none",
    label_names=["labels"]
)


In [36]:
# @title Define the Supervised Fine-Tuning Trainer

trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_train, # Pass the dataset that has ALREADY been tokenized
    eval_dataset=tokenized_test,   # It's good practice to include the evaluation set too
    peft_config=lora_config,
    args=training_args
)



Truncating train dataset:   0%|          | 0/14766 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/1641 [00:00<?, ? examples/s]

In [37]:
# @title Train the model

trainer.train()

Step,Training Loss
500,1.9529
1000,1.7522
1500,1.7054
2000,1.6651
2500,1.6556
3000,1.645
3500,1.6246
4000,1.6058
4500,1.5869
5000,1.5845


TrainOutput(global_step=18460, training_loss=1.5458635588508516, metrics={'train_runtime': 6919.1469, 'train_samples_per_second': 21.341, 'train_steps_per_second': 2.668, 'total_flos': 2.5605556367616e+16, 'train_loss': 1.5458635588508516})

In [38]:
# @title Save the Model

trainer.save_model("./fine-tuned-model")


In [41]:
# @title Test the model after training
test_question_after_training = "What is hereditary xanthinuria ?" # You can change this question
post_training_response = test_model_response(model, tokenizer, test_question_after_training)
print(f"Post-training response to '{test_question_after_training}':")
print(f"Answer: {post_training_response}")

Post-training response to 'What is hereditary xanthinuria ?':
Answer: What is hereditary xanthinuria?
Hereditary xanthinuria (HXA) is a disorder characterized by a deficiency of xanthine, the active form of the vitamin that is found in the body. These deficiencies are caused by the xanthine-rich body protein called xanthine inositol. Xanthine is a metabolite that is stored in the liver and binds to proteins to form the enzyme xanthine. The enzyme xanthine is important for normal blood flow and blood clotting.


In [40]:
# @title Evaluate the model after training

print("\nEvaluating model on test dataset...")

# Create a sliced version of the test dataset
# This will take the first 25 examples from your tokenized_test dataset
#subset_test_dataset = tokenized_test.select(range(50))

evaluation_results = trainer.evaluate(eval_dataset=tokenized_test)

print("Evaluation Results:")
for key, value in evaluation_results.items():
    print(f"- {key}: {value:.4f}")

print("\n" + "="*50 + "\n")


Evaluating model on test dataset...


Evaluation Results:
- eval_loss: 1.5418
- eval_runtime: 40.3260
- eval_samples_per_second: 40.6930
- eval_steps_per_second: 5.1080


