# Medical Terminology Education Assistant - Fine-Tuning
## Fine-tuning Mistral 7B for Medical Education

**Objective**: Fine-tune Mistral 7B using LoRA/PEFT on synthetic medical terminology dataset

**Hardware Requirements**: GPU with at least 16GB VRAM (T4, V100, A100)

**Key Steps**:
1. Install dependencies
2. Load base model (Mistral 7B)
3. Prepare medical terminology dataset
4. Configure LoRA parameters
5. Train model
6. Evaluate results
7. Export for Ollama deployment

## 1. Install Dependencies

In [1]:
# Install required packages
!pip install -q unsloth
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q trl peft accelerate bitsandbytes datasets

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m66.6/66.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m381.1/381.1 kB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m506.8/506.8 kB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m423.1/423.1 kB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0

## 2. GPU Check and Setup

In [2]:
import torch
import json

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB" if torch.cuda.is_available() else "")

if not torch.cuda.is_available():
    print("\n‚ö†Ô∏è WARNING: No GPU detected. Please enable GPU in Runtime > Change runtime type > GPU")

CUDA available: True
GPU: NVIDIA A100-SXM4-80GB
GPU Memory: 85.17 GB


## 3. Load Data Files

Upload your `medical_qa_train.jsonl` and `medical_qa_test.jsonl` files to Colab

In [3]:
from google.colab import files

print("Please upload medical_qa_train.jsonl")
uploaded = files.upload()

print("\nPlease upload medical_qa_test.jsonl")
uploaded_test = files.upload()

print("\n‚úÖ Files uploaded successfully!")

Please upload medical_qa_train.jsonl


Saving medical_qa_train.jsonl to medical_qa_train.jsonl

Please upload medical_qa_test.jsonl


Saving medical_qa_test.jsonl to medical_qa_test.jsonl

‚úÖ Files uploaded successfully!


## 4. Load and Inspect Dataset

In [4]:
import json

# Load training data
train_data = []
with open('medical_qa_train.jsonl', 'r') as f:
    for line in f:
        train_data.append(json.loads(line))

# Load test data
test_data = []
with open('medical_qa_test.jsonl', 'r') as f:
    for line in f:
        test_data.append(json.loads(line))

print(f"Training examples: {len(train_data)}")
print(f"Test examples: {len(test_data)}")

# Show first example
print("\nüìù Sample training example:")
print(json.dumps(train_data[0], indent=2))

# Category distribution
from collections import Counter
categories = Counter([item['category'] for item in train_data])
print("\nüìä Category distribution:")
for cat, count in categories.items():
    print(f"  {cat}: {count}")

Training examples: 65
Test examples: 44

üìù Sample training example:
{
  "instruction": "Explain the medical term: Hypertension",
  "input": "",
  "output": "**Definition**: Hypertension is persistently elevated blood pressure in the arteries.\n\n**Explanation**: Blood pressure is measured as systolic over diastolic pressure (e.g., 120/80 mmHg). Hypertension is typically defined as blood pressure consistently at or above 130/80 mmHg. The heart must work harder to pump blood through vessels with increased resistance, which can damage arteries and organs over time.\n\n**Clinical Relevance**: Hypertension is a major risk factor for heart disease, stroke, and kidney disease. It's often called the \"silent killer\" because it typically has no symptoms until serious complications occur.\n\n**Note**: This is educational information only. Always consult healthcare professionals for medical decisions.",
  "category": "medical_terminology",
  "safety_note": "Educational content only - not for 

## 5. Load Base Model (Mistral 7B)

Using Unsloth for efficient 4-bit quantization and faster training

In [5]:
from unsloth import FastLanguageModel

# Model configuration
model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"
max_seq_length = 2048  # Maximum sequence length for medical responses
dtype = None  # Auto-detect optimal dtype
load_in_4bit = True  # Use 4-bit quantization to save memory

print(f"Loading model: {model_name}")
print("This may take a few minutes...\n")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print("‚úÖ Model loaded successfully!")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Loading model: unsloth/mistral-7b-instruct-v0.3-bnb-4bit
This may take a few minutes...

==((====))==  Unsloth 2026.1.2: Fast Mistral patching. Transformers: 4.57.3.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/157 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

‚úÖ Model loaded successfully!


## 6. Add LoRA Adapters

Configure LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning

In [6]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,  # Increased from 16
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,  # Increased from 16
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=False,
    loftq_config=None,
)

print("‚úÖ LoRA adapters added successfully!")

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())
print(f"\nTrainable parameters: {trainable_params:,} ({100 * trainable_params / all_params:.2f}% of total)")

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2026.1.2 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


‚úÖ LoRA adapters added successfully!

Trainable parameters: 83,886,080 (2.18% of total)


## 7. Format Dataset for Training

Convert JSONL data into instruction-following format for Mistral

In [7]:
from datasets import Dataset

# Mistral instruction template
def format_prompt(example):
    """Format examples into Mistral instruction format"""
    instruction = example['instruction']
    output = example['output']

    # Mistral chat format
    prompt = f"""<s>[INST] {instruction} [/INST]
{output}</s>"""

    return {"text": prompt}

# Format training data
formatted_train = [format_prompt(item) for item in train_data]
train_dataset = Dataset.from_list(formatted_train)

# Format test data
formatted_test = [format_prompt(item) for item in test_data]
test_dataset = Dataset.from_list(formatted_test)

print(f"‚úÖ Formatted {len(train_dataset)} training examples")
print(f"‚úÖ Formatted {len(test_dataset)} test examples")

# Show formatted example
print("\nüìù Sample formatted prompt:")
print(train_dataset[0]['text'][:500] + "...")

‚úÖ Formatted 65 training examples
‚úÖ Formatted 44 test examples

üìù Sample formatted prompt:
<s>[INST] Explain the medical term: Hypertension [/INST]
**Definition**: Hypertension is persistently elevated blood pressure in the arteries.

**Explanation**: Blood pressure is measured as systolic over diastolic pressure (e.g., 120/80 mmHg). Hypertension is typically defined as blood pressure consistently at or above 130/80 mmHg. The heart must work harder to pump blood through vessels with increased resistance, which can damage arteries and organs over time.

**Clinical Relevance**: Hyperten...


## 8. Configure Training Parameters

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments

# Training configuration
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Don't pack sequences for better instruction following
    args=TrainingArguments(
        # Core training parameters
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=20,  # Increased from 10
        num_train_epochs=5,  # Increased from 3
        learning_rate=3e-4,  # Increased from 2e-4

        # Optimization
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",  # Changed from linear

        # Logging and evaluation
        logging_steps=5,
        eval_strategy="steps",
        eval_steps=15,  # More frequent evaluation
        save_strategy="steps",
        save_steps=30,
        save_total_limit=2,

        # Output
        output_dir="outputs",
        report_to="none",

        # Memory optimization
        gradient_checkpointing=True,
        seed=42,
    ),
)

print("‚úÖ Training configuration ready!")

# Extract training arguments for display
args = trainer.args
batch_size = args.per_device_train_batch_size
grad_accum = args.gradient_accumulation_steps
effective_batch = batch_size * grad_accum
num_epochs = int(args.num_train_epochs)
total_steps = (len(train_dataset) * num_epochs) // effective_batch

print(f"\nüìä Training Configuration:")
print(f"  Dataset size: {len(train_dataset)} examples")
print(f"  Epochs: {num_epochs}")
print(f"  Batch size per device: {batch_size}")
print(f"  Gradient accumulation steps: {grad_accum}")
print(f"  Effective batch size: {effective_batch}")
print(f"  Total training steps: ~{total_steps}")
print(f"  Learning rate: {args.learning_rate}")
print(f"  LR scheduler: {args.lr_scheduler_type}")
print(f"  Warmup steps: {args.warmup_steps}")
print(f"  Evaluation every: {args.eval_steps} steps")

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/65 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/44 [00:00<?, ? examples/s]

‚úÖ Training configuration ready!

üìä Training Configuration:
  Dataset size: 65 examples
  Epochs: 5
  Batch size per device: 2
  Gradient accumulation steps: 4
  Effective batch size: 8
  Total training steps: ~40
  Learning rate: 0.0003
  LR scheduler: SchedulerType.COSINE
  Warmup steps: 20
  Evaluation every: 15 steps


## 9. Train the Model

This will take 15-30 minutes depending on your GPU

In [9]:
print("üöÄ Starting training...\n")
print("This will take 15-30 minutes. Monitor the loss curve below.\n")

# Train
trainer_stats = trainer.train()

print("\n‚úÖ Training complete!")
print(f"\nüìä Final training loss: {trainer_stats.training_loss:.4f}")

The model is already on multiple devices. Skipping the move to device specified in `args`.


üöÄ Starting training...

This will take 15-30 minutes. Monitor the loss curve below.



==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 65 | Num Epochs = 5 | Total steps = 45
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 83,886,080 of 7,331,909,632 (1.14% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
15,0.783,0.96582
30,0.3273,1.085843
45,0.1228,1.001463


Unsloth: Not an error, but MistralForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient



‚úÖ Training complete!

üìä Final training loss: 0.6179


## 10. Test the Fine-Tuned Model

In [10]:
# Enable fast inference mode
FastLanguageModel.for_inference(model)

# Test prompts
test_prompts = [
    "Explain the medical term: Hypertension",
    "Explain the medical abbreviation: CBC",
    "Explain the disease mechanism: Type 2 Diabetes",
]

print("üß™ Testing fine-tuned model:\n")

for i, prompt in enumerate(test_prompts, 1):
    print(f"\n{'='*80}")
    print(f"Test {i}: {prompt}")
    print('='*80)

    # Format input
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")

    # Generate response
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

    # Decode and print
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract only the model's response (after [/INST])
    if "[/INST]" in response:
        response = response.split("[/INST]")[-1].strip()

    print(f"\n{response}\n")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


üß™ Testing fine-tuned model:


Test 1: Explain the medical term: Hypertension

Explain the medical term: Hypertension
**Definition**: Hypertension is persistently elevated blood pressure in the arteries.

**Explanation**: Blood pressure is measured as systolic over diastolic pressure (e.g., 120/80 mmHg). Hypertension is typically defined as blood pressure consistently at or above 130/80 mmHg. The heart must work harder to pump blood through vessels with increased resistance. Over time, this damages arteries and strains the heart, increasing risks of serious complications.

**Clinical Relevance**: Hypertension is a major risk factor for heart disease, stroke, and kidney disease. It's often called the "silent killer" because it typically has no symptoms until serious complications occur.

**Note**: This is educational information only. Always consult healthcare professionals for medical decisions.


Test 2: Explain the medical abbreviation: CBC

Explain the medical abbreviation: CBC
**

## 11. Save Model for Ollama

Export the fine-tuned model in GGUF format for Ollama deployment

In [11]:
print("üíæ Saving model in GGUF format for Ollama...\n")

# Save as GGUF with q4_k_m quantization (good balance of quality and size)
model.save_pretrained_gguf(
    "medical_mistral_gguf",
    tokenizer,
    quantization_method="q4_k_m"
)

print("\n‚úÖ Model saved successfully!")
print("\nSaved files:")
import os
for file in os.listdir("medical_mistral_gguf"):
    file_path = os.path.join("medical_mistral_gguf", file)
    size_mb = os.path.getsize(file_path) / (1024 * 1024)
    print(f"  - {file} ({size_mb:.1f} MB)")

üíæ Saving model in GGUF format for Ollama...

Unsloth: Merging model weights to 16-bit format...


config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00003.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  33%|‚ñà‚ñà‚ñà‚ñé      | 1/3 [00:09<00:19,  9.77s/it]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 2/3 [00:22<00:11, 11.60s/it]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:35<00:00, 11.75s/it]
Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:56<00:00, 18.86s/it]


Unsloth: Merge process complete. Saved to `/content/medical_mistral_gguf`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF bf16 might take 3 minutes.
\        /    [2] Converting GGUF bf16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into bf16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['mistral-7b-instruct-v0.3.BF

## 12. Download Model Files

In [12]:
import os
from google.colab import files

print("üîç Searching for GGUF file...\n")

# Search in multiple possible locations
search_paths = [
    "medical_mistral_gguf",
    ".",  # Current directory
    "/content",
]

gguf_file_path = None

for path in search_paths:
    if os.path.exists(path):
        for root, dirs, file_list in os.walk(path):
            for file in file_list:
                if file.endswith(".gguf"):
                    gguf_file_path = os.path.join(root, file)
                    size_mb = os.path.getsize(gguf_file_path) / (1024 * 1024)
                    print(f"‚úÖ Found: {file}")
                    print(f"   Location: {gguf_file_path}")
                    print(f"   Size: {size_mb:.1f} MB\n")
                    break
            if gguf_file_path:
                break
    if gguf_file_path:
        break

if gguf_file_path:
    print("üì• Downloading GGUF file...")
    print("This will take 3-5 minutes depending on your connection.\n")
    files.download(gguf_file_path)
    print("\nüéâ Download complete!")
else:
    print("‚ùå GGUF file not found anywhere!")
    print("\nLet's check what files were actually created:")
    print("\nFiles in medical_mistral_gguf/:")
    if os.path.exists("medical_mistral_gguf"):
        for f in os.listdir("medical_mistral_gguf"):
            size = os.path.getsize(os.path.join("medical_mistral_gguf", f)) / (1024 * 1024)
            print(f"  - {f} ({size:.1f} MB)")

üîç Searching for GGUF file...

‚úÖ Found: mistral-7b-instruct-v0.3.Q4_K_M.gguf
   Location: ./mistral-7b-instruct-v0.3.Q4_K_M.gguf
   Size: 4170.2 MB

üì• Downloading GGUF file...
This will take 3-5 minutes depending on your connection.



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


üéâ Download complete!
