# üè• Medical LLM Fine-Tuning with LoRA/QLoRA

This notebook demonstrates how to fine-tune a Large Language Model on medical data using Parameter Efficient Fine-Tuning (PEFT) techniques.

## Overview

We will:
1. Install and import required libraries
2. Load a base LLM model with 4-bit quantization
3. Prepare medical datasets for instruction tuning
4. Configure LoRA/QLoRA parameters
5. Fine-tune the model
6. Export to GGUF format for Ollama
7. Test the fine-tuned model

**Based on:** [LLM-Medical-Finetuning](https://github.com/Shekswess/LLM-Medical-Finetuning)

‚ö†Ô∏è **Disclaimer:** This model is for educational purposes only. Always consult healthcare professionals for medical advice.

## 1. Install and Import Required Libraries

First, let's install all necessary packages. Choose the appropriate installation based on your GPU.

In [None]:
# Install packages for RTX 30xx, 40xx, A100, H100, L40 GPUs
# Uncomment and run if needed
# !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# For older GPUs (V100, T4, RTX 20xx) - no flash attention
# !pip install --no-deps xformers trl peft accelerate bitsandbytes
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Standard installation without unsloth (works on most systems)
# !pip install torch transformers datasets peft trl accelerate bitsandbytes huggingface_hub

In [None]:
# Import necessary libraries
import json
import os
import torch
from datasets import load_dataset
from huggingface_hub import notebook_login
from transformers import TrainingArguments
from trl import SFTTrainer

# Try to import unsloth (for faster training)
try:
    from unsloth import FastLanguageModel
    UNSLOTH_AVAILABLE = True
    print("‚úÖ Unsloth is available - using optimized training!")
except ImportError:
    UNSLOTH_AVAILABLE = False
    print("‚ö†Ô∏è Unsloth not available - using standard transformers")
    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Check GPU availability
print(f"\nüñ•Ô∏è CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"üìä GPU: {torch.cuda.get_device_name(0)}")
    print(f"üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    print(f"üî¢ BF16 Support: {torch.cuda.is_bf16_supported()}")

In [None]:
# Login to Hugging Face Hub (required for some models and pushing to hub)
# You'll need a HuggingFace account and access token
notebook_login()

## 2. Load Base Model with 4-bit Quantization

We'll use a 4-bit quantized model to reduce memory requirements. Available models:
- `unsloth/llama-3-8b-Instruct-bnb-4bit` - Llama 3 (8B)
- `unsloth/llama-2-7b-chat-bnb-4bit` - Llama 2 (7B)
- `unsloth/mistral-7b-instruct-v0.2-bnb-4bit` - Mistral (7B)
- `unsloth/gemma-1.1-7b-it-bnb-4bit` - Gemma (7B)

In [None]:
# Configuration for the fine-tuning
config = {
    "hugging_face_username": "your_username",  # Replace with your HF username
    
    "model_config": {
        "base_model": "unsloth/llama-3-8b-Instruct-bnb-4bit",  # Base model to fine-tune
        "finetuned_model": "llama-3-8b-medical",  # Name for your fine-tuned model
        "max_seq_length": 2048,  # Maximum sequence length
        "dtype": torch.float16,  # Data type
        "load_in_4bit": True,  # Use 4-bit quantization
    },
    
    "lora_config": {
        "r": 16,  # LoRA rank (8, 16, 32, 64)
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj"],
        "lora_alpha": 16,
        "lora_dropout": 0,
        "bias": "none",
        "use_gradient_checkpointing": True,
        "use_rslora": False,
        "use_dora": False,
        "loftq_config": None
    },
    
    "training_dataset": {
        "name": "Shekswess/medical_llama3_instruct_dataset_short",  # Short dataset (2000 samples)
        # "name": "Shekswess/medical_llama3_instruct_dataset",  # Full dataset
        "split": "train",
        "input_field": "prompt",
    },
    
    "training_config": {
        "per_device_train_batch_size": 2,
        "gradient_accumulation_steps": 4,
        "warmup_steps": 5,
        "max_steps": 0,  # 0 to use num_train_epochs instead
        "num_train_epochs": 1,
        "learning_rate": 2e-4,
        "fp16": not torch.cuda.is_bf16_supported() if torch.cuda.is_available() else True,
        "bf16": torch.cuda.is_bf16_supported() if torch.cuda.is_available() else False,
        "logging_steps": 1,
        "optim": "adamw_8bit",
        "weight_decay": 0.01,
        "lr_scheduler_type": "linear",
        "seed": 42,
        "output_dir": "outputs",
    }
}

print("üìã Configuration loaded!")
print(f"   Base model: {config['model_config']['base_model']}")
print(f"   LoRA rank: {config['lora_config']['r']}")
print(f"   Dataset: {config['training_dataset']['name']}")

In [None]:
# Load the model and tokenizer
if UNSLOTH_AVAILABLE:
    # Using Unsloth for optimized loading
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=config["model_config"]["base_model"],
        max_seq_length=config["model_config"]["max_seq_length"],
        dtype=config["model_config"]["dtype"],
        load_in_4bit=config["model_config"]["load_in_4bit"],
    )
    print("‚úÖ Model loaded with Unsloth!")
else:
    # Standard transformers loading with 4-bit quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        config["model_config"]["base_model"].replace("unsloth/", ""),
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        config["model_config"]["base_model"].replace("unsloth/", ""),
        trust_remote_code=True,
    )
    tokenizer.pad_token = tokenizer.eos_token
    print("‚úÖ Model loaded with standard transformers!")

## 3. Prepare Medical Dataset

Load the pre-processed medical instruction dataset from HuggingFace. The dataset contains medical Q&A pairs formatted for instruction tuning.

In [None]:
# Load the medical dataset
dataset_train = load_dataset(
    config["training_dataset"]["name"], 
    split=config["training_dataset"]["split"]
)

print(f"üìä Dataset Statistics:")
print(f"   Number of samples: {len(dataset_train)}")
print(f"   Columns: {dataset_train.column_names}")
print(f"\nüìù Sample prompt preview:")
print("-" * 50)
print(dataset_train[0]["prompt"][:500] + "...")

## 4. Configure LoRA/QLoRA Parameters

Set up the PEFT (Parameter Efficient Fine-Tuning) configuration with LoRA adapters. This allows us to fine-tune only a small number of parameters while keeping the base model frozen.

In [None]:
# Setup LoRA/QLoRA adapters
if UNSLOTH_AVAILABLE:
    model = FastLanguageModel.get_peft_model(
        model,
        r=config["lora_config"]["r"],
        target_modules=config["lora_config"]["target_modules"],
        lora_alpha=config["lora_config"]["lora_alpha"],
        lora_dropout=config["lora_config"]["lora_dropout"],
        bias=config["lora_config"]["bias"],
        use_gradient_checkpointing=config["lora_config"]["use_gradient_checkpointing"],
        random_state=42,
        use_rslora=config["lora_config"]["use_rslora"],
        use_dora=config["lora_config"]["use_dora"],
        loftq_config=config["lora_config"]["loftq_config"],
    )
else:
    # Standard PEFT setup
    model = prepare_model_for_kbit_training(model)
    
    peft_config = LoraConfig(
        r=config["lora_config"]["r"],
        lora_alpha=config["lora_config"]["lora_alpha"],
        lora_dropout=config["lora_config"]["lora_dropout"],
        bias=config["lora_config"]["bias"],
        task_type="CAUSAL_LM",
        target_modules=config["lora_config"]["target_modules"],
    )
    model = get_peft_model(model, peft_config)

# Print trainable parameters
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(f"üîß Trainable parameters: {trainable_params:,} / {all_param:,} ({100 * trainable_params / all_param:.2f}%)")

print_trainable_parameters(model)

## 5. Set Up Training Arguments

Configure the training hyperparameters including learning rate, batch size, and optimization settings.

In [None]:
# Create output directory
os.makedirs(config["training_config"]["output_dir"], exist_ok=True)

# Setup the trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset_train,
    dataset_text_field=config["training_dataset"]["input_field"],
    max_seq_length=config["model_config"]["max_seq_length"],
    dataset_num_proc=2,
    packing=False,  # Can set to True for short sequences
    args=TrainingArguments(
        per_device_train_batch_size=config["training_config"]["per_device_train_batch_size"],
        gradient_accumulation_steps=config["training_config"]["gradient_accumulation_steps"],
        warmup_steps=config["training_config"]["warmup_steps"],
        max_steps=config["training_config"]["max_steps"] if config["training_config"]["max_steps"] > 0 else -1,
        num_train_epochs=config["training_config"]["num_train_epochs"],
        learning_rate=config["training_config"]["learning_rate"],
        fp16=config["training_config"]["fp16"],
        bf16=config["training_config"]["bf16"],
        logging_steps=config["training_config"]["logging_steps"],
        optim=config["training_config"]["optim"],
        weight_decay=config["training_config"]["weight_decay"],
        lr_scheduler_type=config["training_config"]["lr_scheduler_type"],
        seed=config["training_config"]["seed"],
        output_dir=config["training_config"]["output_dir"],
        report_to="none",  # Set to "wandb" for Weights & Biases logging
    ),
)

print("‚úÖ Trainer configured!")
print(f"   Batch size: {config['training_config']['per_device_train_batch_size']}")
print(f"   Gradient accumulation: {config['training_config']['gradient_accumulation_steps']}")
print(f"   Effective batch size: {config['training_config']['per_device_train_batch_size'] * config['training_config']['gradient_accumulation_steps']}")
print(f"   Learning rate: {config['training_config']['learning_rate']}")

## 6. Fine-Tune the Model

Now we'll run the training loop. This may take a while depending on your GPU and dataset size.

In [None]:
# Memory statistics before training
if torch.cuda.is_available():
    gpu_stats = torch.cuda.get_device_properties(0)
    reserved_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 2)
    max_memory = round(gpu_stats.total_memory / 1024**3, 2)
    print(f"üíæ Memory before training:")
    print(f"   Reserved: {reserved_memory} GB")
    print(f"   Total: {max_memory} GB")

In [None]:
# üöÄ Start training!
print("üöÄ Starting training...")
trainer_stats = trainer.train()
print("‚úÖ Training complete!")

In [None]:
# Memory statistics after training
if torch.cuda.is_available():
    used_memory = round(torch.cuda.max_memory_allocated() / 1024**3, 2)
    used_memory_lora = round(used_memory - reserved_memory, 2)
    used_memory_pct = round((used_memory / max_memory) * 100, 2)
    
    print(f"\nüíæ Memory after training:")
    print(f"   Used: {used_memory} GB ({used_memory_pct}%)")
    print(f"   Used for LoRA training: {used_memory_lora} GB")
    
# Save training stats
with open(os.path.join(config["training_config"]["output_dir"], "trainer_stats.json"), "w") as f:
    json.dump({
        "training_loss": trainer_stats.training_loss,
        "global_step": trainer_stats.global_step,
    }, f, indent=4)
print("üìä Training stats saved!")

## 7. Save and Export Fine-Tuned Model

Save the model locally and optionally export to GGUF format for use with Ollama.

In [None]:
# Save the fine-tuned model locally
model_save_path = os.path.join(config["training_config"]["output_dir"], config["model_config"]["finetuned_model"])
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)
print(f"‚úÖ Model saved to {model_save_path}")

In [None]:
# Export to GGUF format for Ollama (requires unsloth)
# Choose quantization: "q4_k_m" (balanced), "q8_0" (higher quality), "f16" (full precision)

if UNSLOTH_AVAILABLE:
    gguf_output_path = os.path.join(config["training_config"]["output_dir"], "gguf")
    
    # Export with q4_k_m quantization (good balance of size and quality)
    model.save_pretrained_gguf(
        gguf_output_path,
        tokenizer,
        quantization_method="q4_k_m"  # Options: "q4_k_m", "q8_0", "f16"
    )
    print(f"‚úÖ GGUF model exported to {gguf_output_path}")
else:
    print("‚ö†Ô∏è GGUF export requires unsloth. Install it and re-run this cell.")
    print("   You can also use llama.cpp to convert the model manually.")

In [None]:
# Create Ollama Modelfile
modelfile_content = '''# Modelfile for Medical LLM
# Created from fine-tuned Llama 3 on medical data

FROM ./outputs/gguf/model-q4_k_m.gguf

# System prompt for medical assistant
SYSTEM """
You are a helpful medical assistant trained to answer medical questions accurately and professionally. 

Important: Always recommend consulting a qualified healthcare professional for actual medical advice. 
The information provided is for educational purposes only and should not be used for self-diagnosis or treatment.
"""

# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 2048

# Template for Llama 3 format
TEMPLATE """<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
'''

modelfile_path = os.path.join(config["training_config"]["output_dir"], "Modelfile")
with open(modelfile_path, 'w') as f:
    f.write(modelfile_content)

print(f"‚úÖ Modelfile created at {modelfile_path}")
print("\nüìã To create an Ollama model, run:")
print(f"   ollama create medical-llama3 -f {modelfile_path}")
print("\nüìã To run the model:")
print("   ollama run medical-llama3")

In [None]:
# Optional: Push model to HuggingFace Hub
# Uncomment to push your model to HuggingFace

# hf_repo_name = f"{config['hugging_face_username']}/{config['model_config']['finetuned_model']}"
# model.push_to_hub(hf_repo_name, tokenizer=tokenizer)
# print(f"‚úÖ Model pushed to HuggingFace: https://huggingface.co/{hf_repo_name}")

## 8. Test the Fine-Tuned Model

Let's test the fine-tuned model with some medical questions to see how well it performs.

In [None]:
# Enable inference mode
if UNSLOTH_AVAILABLE:
    FastLanguageModel.for_inference(model)

# Test questions
test_questions = [
    "What are the symptoms of diabetes?",
    "How is hypertension treated?",
    "What causes pneumonia?",
]

# Generate responses
for question in test_questions:
    # Format the prompt for Llama 3
    prompt = f"""<|start_header_id|>system<|end_header_id|>

Answer the question truthfully, you are a medical professional.<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
    
    # Tokenize and generate
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs, 
            max_new_tokens=256, 
            use_cache=True,
            temperature=0.7,
            do_sample=True,
        )
    
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    
    # Extract just the answer
    if "<|start_header_id|>assistant<|end_header_id|>" in response:
        answer = response.split("<|start_header_id|>assistant<|end_header_id|>")[-1].strip()
    else:
        answer = response
    
    print(f"\n{'='*60}")
    print(f"‚ùì Question: {question}")
    print(f"{'='*60}")
    print(f"ü§ñ Answer: {answer[:500]}...")
    print()

## üéâ Congratulations!

You have successfully fine-tuned an LLM on medical data! Here's what you can do next:

### Using with Ollama

1. Make sure Ollama is installed: https://ollama.ai
2. Create the model:
   ```bash
   ollama create medical-llama3 -f outputs/Modelfile
   ```
3. Run the model:
   ```bash
   ollama run medical-llama3
   ```

### Next Steps

- üìä Train on larger datasets for better performance
- üîß Experiment with different LoRA ranks (8, 32, 64)
- üìà Increase training epochs
- üß™ Evaluate on medical benchmark datasets
- üî¨ Try different base models (Mistral, Gemma)

### ‚ö†Ô∏è Important Disclaimer

This model is for **educational purposes only**. It should NOT be used for:
- Medical diagnosis
- Treatment recommendations
- Clinical decision making

Always consult qualified healthcare professionals for medical advice.