<a href="https://colab.research.google.com/github/L-Semakale/domain-specific-assistant/blob/main/Untitled11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Healthcare Chatbot - LLM Fine-tuning with LoRA

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/YOUR_USERNAME/YOUR_REPO/blob/main/healthcare_chatbot_finetuning.ipynb)

### Project Overview

This notebook implements a **domain-specific healthcare assistant** by fine-tuning a Large Language Model using **LoRA (Low-Rank Adaptation)**. The chatbot provides accurate medical information while maintaining ethical guardrails.

### Key Features:
-  Medical domain specialization
-  Parameter-efficient fine-tuning (LoRA)
-  Comprehensive evaluation (BLEU, ROUGE)
-  Interactive Gradio interface
-  Medical safety disclaimers

### Pipeline:
1. **Setup** - Install dependencies and configure environment
2. **Data Preprocessing** - Load and format medical Q&A dataset
3. **Model Loading** - Load base model with 4-bit quantization
4. **Fine-tuning** - Train with LoRA on medical data
5. **Evaluation** - Calculate metrics and compare models
6. **Deployment** - Create interactive web interface

# 1. Environment Setup

Installing all required packages and verifying GPU availability.

In [3]:
%%capture
# Install required packages (suppress output for cleaner notebook)
!pip install -q transformers>=4.35.0
!pip install -q datasets>=2.14.0
!pip install -q accelerate>=0.24.0
!pip install -q peft>=0.6.0
!pip install -q bitsandbytes>=0.41.0
!pip install -q trl>=0.7.0
!pip install -q gradio>=4.0.0
!pip install -q evaluate>=0.4.0
!pip install -q rouge-score>=0.1.2
!pip install -q nltk>=3.8.0

In [4]:
# Import libraries
import os
import torch
import pandas as pd
import numpy as np
from datetime import datetime
import json

# Hugging Face
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments
)

# PEFT and LoRA
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer

# Evaluation
from evaluate import load as load_metric
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import nltk
nltk.download('punkt', quiet=True)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Gradio for deployment
import gradio as gr

# Set style
sns.set_style('whitegrid')

print(" Libraries imported successfully!")

 Libraries imported successfully!


In [5]:
# Check GPU availability and specs
print("="*80)
print("GPU INFORMATION")
print("="*80)

if torch.cuda.is_available():
    print(f" GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"   CUDA Version: {torch.version.cuda}")
else:
    print(" No GPU available. Training will be slow on CPU.")
    print("   Please enable GPU in Runtime > Change runtime type > T4 GPU")

print("="*80)

GPU INFORMATION
 No GPU available. Training will be slow on CPU.
   Please enable GPU in Runtime > Change runtime type > T4 GPU


# 2. Configuration

Set all hyperparameters and configurations in one place.

In [1]:
# CONFIGURATION PARAMETERS

# Dataset Configuration
DATASET_NAME = "medalpaca/medical_meadow_medical_flashcards"
TRAIN_SIZE = 3000
VAL_SIZE = 500
TEST_SIZE = 500
RANDOM_SEED = 42

# Model Configuration
MODEL_NAME = "google/gemma-2b"
USE_4BIT_QUANTIZATION = True

# LoRA Configuration
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
LORA_TARGET_MODULES = ["q_proj", "v_proj"]

# Training Configuration
NUM_EPOCHS = 2
BATCH_SIZE = 4
GRADIENT_ACCUMULATION = 4
LEARNING_RATE = 2e-4
MAX_SEQ_LENGTH = 512
WARMUP_STEPS = 50

# Evaluation Configuration
EVAL_SAMPLES = 100
CALCULATE_PERPLEXITY = False

# Output Configuration
OUTPUT_DIR = "./healthcare-chatbot-lora"
SAVE_STEPS = 100
LOGGING_STEPS = 10
EVAL_STEPS = 50

# Display configuration
print("="*80)
print("EXPERIMENT CONFIGURATION")
print("="*80)
print(f"Dataset: {DATASET_NAME}")
print(f"Model: {MODEL_NAME}")
print(f"Training samples: {TRAIN_SIZE}")
print(f"LoRA rank: {LORA_R}")
print(f"Learning rate: {LEARNING_RATE}")
print(f"Epochs: {NUM_EPOCHS}")
print(f"Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION}")
print("="*80)

EXPERIMENT CONFIGURATION
Dataset: medalpaca/medical_meadow_medical_flashcards
Model: google/gemma-2b
Training samples: 3000
LoRA rank: 16
Learning rate: 0.0002
Epochs: 2
Effective batch size: 16


# 3. Data Preprocessing

Load and prepare the medical Q&A dataset with **proper train/validation split** for scientific evaluation.

In [2]:
from datasets import load_dataset

# Load medical flashcards dataset
DATASET_NAME = "medalpaca/medical_meadow_medical_flashcards"
print(f"Loading dataset: {DATASET_NAME}")

dataset = load_dataset(DATASET_NAME)
print(f"Original dataset size: {len(dataset['train'])}")

# Use subset for efficient Colab training (full dataset takes too long)
dataset = dataset["train"].shuffle(seed=42).select(range(4000))
print(f"Selected subset: {len(dataset)} samples")

#  CRITICAL: Create proper 90/10 train/validation split
# This enables model generalization evaluation
dataset = dataset.train_test_split(test_size=0.1, seed=42)

train_dataset = dataset["train"]
eval_dataset = dataset["test"]

print("\n" + "="*60)
print(" DATASET SPLIT (Train/Validation)")
print("="*60)
print(f"Training samples:   {len(train_dataset):,}")
print(f"Validation samples: {len(eval_dataset):,}")
print(f"Split ratio:        90/10")
print("="*60)

# Show sample structure
print("\nSample data structure:")
sample = train_dataset[0]
for key in sample.keys():
    print(f"  {key}: {str(sample[key])[:100]}...")

Loading dataset: medalpaca/medical_meadow_medical_flashcards


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]



medical_meadow_wikidoc_medical_flashcard(…):   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/33955 [00:00<?, ? examples/s]

Original dataset size: 33955
Selected subset: 4000 samples

 DATASET SPLIT (Train/Validation)
Training samples:   3,600
Validation samples: 400
Split ratio:        90/10

Sample data structure:
  input: ...
  output: ...
  instruction: Answer this question truthfully...


In [3]:
# Explore sample data
print("="*80)
print("SAMPLE DATA")
print("="*80)

for i in range(3):
    sample = dataset['train'][i]
    print(f"\n--- Sample {i+1} ---")

    # Extract fields
    question = sample.get('instruction', sample.get('input', sample.get('question', '')))
    answer = sample.get('output', sample.get('answer', sample.get('response', '')))

    print(f"Question: {question[:150]}...")
    print(f"Answer: {answer[:150]}...")

print("\n" + "="*80)

SAMPLE DATA

--- Sample 1 ---
Question: Answer this question truthfully...
Answer: ...

--- Sample 2 ---
Question: Answer this question truthfully...
Answer: In prerenal azotemia, the urine osmolality is typically greater than 500 mOsm/kg, which is within the normal range....

--- Sample 3 ---
Question: Answer this question truthfully...
Answer: The interlobular arteries of the kidney divide into the afferent arterioles....



In [4]:
# Define formatting function
def format_instruction(sample):
    """
    Format data into instruction-following template.
    """
    # Extract fields
    instruction = sample.get('instruction', sample.get('input', sample.get('question', '')))
    response = sample.get('output', sample.get('answer', sample.get('response', '')))

    # Create prompt
    prompt = f"""Below is a medical question. Provide an accurate, helpful, and professional response.

### Question:
{instruction}

### Response:
{response}"""

    return prompt

# Test formatting
print("Formatted Example:")
print("="*80)
print(format_instruction(dataset['train'][0]))
print("="*80)

Formatted Example:
Below is a medical question. Provide an accurate, helpful, and professional response.

### Question:
Answer this question truthfully

### Response:



In [5]:
# Create train/validation/test splits
print("Creating dataset splits...")

# Shuffle dataset
dataset_shuffled = dataset['train'].shuffle(seed=RANDOM_SEED)

# Get the total number of available samples after shuffling
total_available_samples = len(dataset_shuffled)

# Ensure TRAIN_SIZE does not exceed total available samples
actual_train_size = min(TRAIN_SIZE, total_available_samples)

# Calculate remaining samples for validation and test
remaining_samples = total_available_samples - actual_train_size

# Distribute remaining samples for validation and test proportionally or equally
# For simplicity, we'll split them equally here if original VAL_SIZE + TEST_SIZE > remaining
if VAL_SIZE + TEST_SIZE > remaining_samples:
    print(f"\nWarning: Configured VAL_SIZE ({VAL_SIZE}) + TEST_SIZE ({TEST_SIZE}) exceeds remaining samples ({remaining_samples}).")
    print("Adjusting validation and test sizes dynamically.")
    actual_val_size = remaining_samples // 2
    actual_test_size = remaining_samples - actual_val_size
else:
    actual_val_size = VAL_SIZE
    actual_test_size = TEST_SIZE

# Create splits
train_dataset = dataset_shuffled.select(range(actual_train_size))
val_dataset = dataset_shuffled.select(range(actual_train_size, actual_train_size + actual_val_size))
test_dataset = dataset_shuffled.select(range(actual_train_size + actual_val_size,
                                             actual_train_size + actual_val_size + actual_test_size))

# Add formatted text field
def add_text_field(example):
    example['text'] = format_instruction(example)
    return example

train_dataset = train_dataset.map(add_text_field)
val_dataset = val_dataset.map(add_text_field)
test_dataset = test_dataset.map(add_text_field)

print(f"\n Splits created:")
print(f"   Training: {len(train_dataset)} samples")
print(f"   Validation: {len(val_dataset)} samples")
print(f"   Test: {len(test_dataset)} samples")

Creating dataset splits...

Adjusting validation and test sizes dynamically.


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]


 Splits created:
   Training: 3000 samples
   Validation: 300 samples
   Test: 300 samples
