<a href="https://colab.research.google.com/github/AnshPunia26/Mistral_7B_medical_agent/blob/main/llama_medical_finetuning_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mistral 7B Medical Q&A Fine-tuning on Google Colab

This notebook fine-tunes a Mistral 7B model on medical question-answering data using LoRA (Low-Rank Adaptation) and automatically pushes it to Hugging Face Hub. Optimized for A100 GPU.

## Setup Instructions:
1. **Get Hugging Face Token**: Visit https://huggingface.co/settings/tokens and create a token with "write" permissions
2. **Enable GPU**: Runtime → Change runtime type → GPU (A100 recommended, T4 or better)
3. Upload your `mistral_fine_tune_format.jsonl` file to Colab (or use the provided upload cell)
4. Run all cells in order
5. The fine-tuned model will be automatically pushed to your Hugging Face account

## What You'll Need:
- Hugging Face account (free at https://huggingface.co)
- Hugging Face token with write permissions
- Your `mistral_fine_tune_format.jsonl` data file (already in Mistral chat format)


## Step 1: Install Dependencies and Setup Hugging Face


In [None]:
%pip install -q transformers datasets accelerate peft bitsandbytes sentencepiece tensorboard huggingface_hub


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### Login to Hugging Face

You need a Hugging Face token to push the model. Get one from: https://huggingface.co/settings/tokens


In [None]:
from huggingface_hub import login, HfApi
import getpass

# Login to Hugging Face
print("Please enter your Hugging Face token:")
print("Get your token from: https://huggingface.co/settings/tokens")
hf_token = getpass.getpass("Hugging Face Token: ")

# Login
login(token=hf_token)
print("✓ Successfully logged in to Hugging Face!")

# Set your Hugging Face username (for model repository name)
HF_USERNAME = input("Enter your Hugging Face username: ").strip()
MODEL_REPO_NAME = f"{HF_USERNAME}/mistral-7b-medical-qa-finetuned"
print(f"Model will be pushed to: {MODEL_REPO_NAME}")


Please enter your Hugging Face token:
Get your token from: https://huggingface.co/settings/tokens
Hugging Face Token: ··········
✓ Successfully logged in to Hugging Face!
Enter your Hugging Face username: anshpunia8597
Model will be pushed to: anshpunia8597/mistral-7b-medical-qa-finetuned


## Step 2: Upload Data File

Upload your `mistral_fine_tune_format.jsonl` file. If you already have it in your Google Drive, you can mount Drive instead.


In [None]:
from google.colab import files
import os

# Create data directory
os.makedirs('data', exist_ok=True)

# Upload file
print("Please upload your mistral_fine_tune_format.jsonl file:")
uploaded = files.upload()

# Move to data directory if needed
for filename in uploaded.keys():
    if filename.endswith('.jsonl'):
        os.rename(filename, f'data/{filename}')
        print(f"✓ File saved to: data/{filename}")


Please upload your mistral_fine_tune_format.jsonl file:


Saving mistral_fine_tune_format.jsonl to mistral_fine_tune_format.jsonl
✓ File saved to: data/mistral_fine_tune_format.jsonl


## Step 3: Fine-tuning Configuration

**Note:** The dataset should already be in Mistral format (messages with user/assistant roles). If you need to convert data, use the conversion script first.


In [None]:
# Configuration optimized for A100 GPU
import torch

MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"  # Mistral 7B Instruct model

# Find the uploaded file (should be mistral_fine_tune_format.jsonl)
from pathlib import Path
data_files = list(Path('data').glob('*.jsonl'))
if not data_files:
    data_files = list(Path('.').glob('*.jsonl'))

if data_files:
    DATA_PATH = str(data_files[0])
    print(f"✓ Found dataset: {DATA_PATH}")
else:
    DATA_PATH = "data/mistral_fine_tune_format.jsonl"
    print(f"⚠ Dataset not found, will use: {DATA_PATH}")

OUTPUT_DIR = "./mistral_medical_finetuned"
MAX_SEQ_LENGTH = 2048  # Larger context for A100 GPU
BATCH_SIZE = 8  # Larger batch size for A100 (can go up to 16-32)
GRADIENT_ACCUMULATION_STEPS = 4  # Effective batch size = 8 * 4 = 32
NUM_EPOCHS = 3
LEARNING_RATE = 5e-5  # Slightly lower for larger model

# LoRA configuration (Medium settings)
LORA_R = 64  # Medium rank (was 128 high)
LORA_ALPHA = 128  # Typically 2x the rank (was 256 high)
LORA_DROPOUT = 0.05

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")


✓ Found dataset: data/mistral_fine_tune_format.jsonl
PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
GPU Memory: 42.47 GB


## Step 4: Load Model and Tokenizer

**Note:** Make sure you've logged in to Hugging Face in Step 1 before proceeding.


In [None]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from datasets import load_dataset

print(f"Loading model: {MODEL_NAME}")
print("Using 4-bit quantization for memory efficiency...")

# Configure tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    padding_side="right",
)

# Set pad token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,  # Use bfloat16 on A100 for better stability
    bnb_4bit_use_double_quant=True,
)

# Load model with quantization
print("Loading model with 4-bit quantization...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",  # Automatically place on GPU
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # Use bfloat16 on A100
)

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

print("✓ Model loaded successfully!")


Loading model: mistralai/Mistral-7B-Instruct-v0.2
Using 4-bit quantization for memory efficiency...


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Loading model with 4-bit quantization...


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

✓ Model loaded successfully!


## Step 5: Setup LoRA


In [None]:
print("Setting up LoRA for efficient fine-tuning...")

# Get target modules for Mistral model
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=target_modules,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

print("✓ LoRA setup complete!")


Setting up LoRA for efficient fine-tuning...
trainable params: 167,772,160 || all params: 7,409,504,256 || trainable%: 2.2643
✓ LoRA setup complete!


## Step 6: Define Helper Functions for Dataset


In [None]:
def preprocess_function(examples, tokenizer):
    """Preprocess the dataset for training using Mistral chat template."""
    # Format prompts from messages using tokenizer's chat template
    prompts = []
    for messages_list in examples["messages"]:
        # Use the tokenizer's chat template to format messages
        if hasattr(tokenizer, 'apply_chat_template') and tokenizer.chat_template:
            formatted = tokenizer.apply_chat_template(
                messages_list,
                tokenize=False,
                add_generation_prompt=False
            )
        else:
            # Fallback: Manual formatting for Mistral
            # Mistral format: [INST] instruction [/INST] response
            if len(messages_list) >= 2:
                user_msg = messages_list[0].get('content', '')
                assistant_msg = messages_list[1].get('content', '')
                formatted = f"[INST] {user_msg} [/INST] {assistant_msg}"
            else:
                formatted = ""
        prompts.append(formatted)

    # Tokenize with padding=False (we'll pad in the collator)
    # DataCollatorForLanguageModeling will automatically create labels from input_ids
    model_inputs = tokenizer(
        prompts,
        max_length=MAX_SEQ_LENGTH,
        truncation=True,
        padding=False,
        return_tensors=None,  # Return Python lists, not tensors
    )

    return model_inputs

def load_and_prepare_dataset(data_path: str, tokenizer):
    """Load and prepare the dataset."""
    print(f"Loading dataset from {data_path}...")

    # Load JSONL file
    dataset = load_dataset("json", data_files=data_path, split="train")

    print(f"Dataset loaded: {len(dataset)} examples")

    # Preprocess
    print("Preprocessing dataset...")
    dataset = dataset.map(
        lambda examples: preprocess_function(examples, tokenizer),
        batched=True,
        remove_columns=dataset.column_names,
    )

    # Split into train/validation (90/10)
    dataset = dataset.train_test_split(test_size=0.1, seed=42)

    print(f"Train examples: {len(dataset['train'])}")
    print(f"Validation examples: {len(dataset['test'])}")

    return dataset


## Step 7: Load and Prepare Dataset


In [None]:
dataset = load_and_prepare_dataset(DATA_PATH, tokenizer)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]


Loading dataset from data/mistral_fine_tune_format.jsonl...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset loaded: 16407 examples
Preprocessing dataset...


Map:   0%|          | 0/16407 [00:00<?, ? examples/s]

Train examples: 14766
Validation examples: 1641


## Step 8: Setup Training


In [None]:
# Training arguments optimized for A100 GPU
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    bf16=True,  # Use bfloat16 on A100 for better performance and stability
    logging_steps=10,
    eval_steps=100,
    save_steps=500,
    eval_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=True,
    report_to="tensorboard",
    remove_unused_columns=False,
    warmup_steps=100,
    lr_scheduler_type="cosine",
    save_total_limit=3,  # Keep last 3 checkpoints
    dataloader_pin_memory=True,  # Faster data loading on GPU
    gradient_checkpointing=True,  # Save memory on A100
    optim="paged_adamw_8bit",  # Use 8-bit optimizer for memory efficiency
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Causal LM, not masked LM
    pad_to_multiple_of=8,  # Pad to multiple of 8 for efficiency
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
)

print("✓ Training setup complete!")
print(f"Total training steps: {len(train_dataset) // (BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS) * NUM_EPOCHS}")


✓ Training setup complete!
Total training steps: 1383


## Step 9: Start Training


In [None]:
print("=" * 80)
print("Starting training...")
print("=" * 80)

# Train
trainer.train()

# Save final model
print(f"\nSaving final model to {OUTPUT_DIR}...")
trainer.save_model()
tokenizer.save_pretrained(OUTPUT_DIR)

print("\n" + "=" * 80)
print("Training complete!")
print("=" * 80)


Starting training...


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss
100,0.856,0.834063
200,0.7947,0.78983
300,0.7812,0.763817
400,0.7533,0.747624
500,0.6162,0.749088
600,0.5828,0.738834
700,0.582,0.727691
800,0.5931,0.724152
900,0.533,0.713815
1000,0.4093,0.75662



Saving final model to ./mistral_medical_finetuned...

Training complete!


## Step 10: Push Model to Hugging Face Hub


In [None]:
from huggingface_hub import HfApi, create_repo
import os

# Create repository on Hugging Face (if it doesn't exist)
api = HfApi()
try:
    create_repo(
        repo_id=MODEL_REPO_NAME,
        token=hf_token,
        private=False,  # Set to True if you want a private repo
        repo_type="model",
        exist_ok=True,
    )
    print(f"✓ Repository created/verified: {MODEL_REPO_NAME}")
except Exception as e:
    print(f"Note: {e}")

# Push model to Hugging Face Hub
print(f"\nPushing model to Hugging Face Hub: {MODEL_REPO_NAME}")
print("This may take a few minutes...")

# Push the model
api.upload_folder(
    folder_path=OUTPUT_DIR,
    repo_id=MODEL_REPO_NAME,
    token=hf_token,
    repo_type="model",
)

print(f"\n✓ Model successfully pushed to Hugging Face!")
print(f"View your model at: https://huggingface.co/{MODEL_REPO_NAME}")


## Step 11: Download Fine-tuned Model (Optional - Local Backup)


## Step 12: Test the Fine-tuned Model (Optional)


In [None]:
from peft import PeftModel

# Option 1: Load from local directory
# base_model = AutoModelForCausalLM.from_pretrained(
#     MODEL_NAME,
#     quantization_config=bnb_config,
#     device_map="auto",
#     torch_dtype=torch.bfloat16,
# )
# model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
# tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)

# Option 2: Load directly from Hugging Face Hub (recommended)
print(f"Loading model from Hugging Face: {MODEL_REPO_NAME}")
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, MODEL_REPO_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO_NAME)

# Test with a medical question
test_question = "What is diabetes?"
messages = [
    {"role": "user", "content": test_question}
]

# Use tokenizer's chat template
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Question:", test_question)
print("\nResponse:")
print(response)


In [None]:
test_question = "i have smoking addication, how can i detox my lungs"
messages = [
    {"role": "user", "content": test_question}
]

# Use tokenizer's chat template
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Question:", test_question)
print("\nResponse:")
print(response)



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Question: i have smoking addication, how can i detox my lungs

Response:
[INST] i have smoking addication, how can i detox my lungs [/INST] It is not possible to detox your lungs. Smoking does not cause toxic substances to build up in the lungs. Within 20 minutes after you quit smoking, your heart rate and blood pressure decrease. After 8 hours, the carbon monoxide level in your blood drops to normal. After 12 hours, the level of carbon monoxide in your blood drops to normal. After 24 hours, your lungs start to clear out mucus and other smoke components. After 48 hours, breathing becomes easier and there is less coughing because your lung's cilia (cells that line the airways) start to regain their ability to sweep mucus out of the airways. After 72 hours, your body is better able to fight off infections because your immune system is working better. After 3 months, your circulation begins to improve. After 9 months, you may notice less shortness of breath. After a


In [None]:
import markdown
test_question = "I have severe headache, i also put in long hours at work?"
system_prompt = """ You are a medical diagnostics agent, your job is to be empethetic to the paithent and respond according to the question asked, answer in markdown language, in points maxmium number of pointers is 5"""

messages = [
    {"role":"system", "content":system_prompt},
    {"role": "user", "content": test_question}
]

# Use tokenizer's chat template
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Question:", test_question)
print("\n\nResponse:")
print(response)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Question: I have skin rash after i ate something new, what are the possible causes?

Response:
[INST]  You are a medical diagnostics agent, your job is to be empethetic to the paithent and respond according to the question asked

I have skin rash after i ate something new, what are the possible causes? [/INST] I'm sorry to hear that you're having a reaction to a food you recently ate. Many foods can cause skin rashes, including nuts, seafood, dairy products, and wheat. Sometimes, fruits and vegetables, such as tomatoes and potatoes, can also cause reactions. Food additives, such as preservatives and artificial colors, may also cause problems. In some cases, the cause of a food-related skin rash is not known. 

If you have a food allergy, eating the food that you are allergic to can cause an immune response. This can lead to hives, eczema, or other skin problems. Food allergies are most common in children, but they can occur at any age. 

Food intolerances, which are more common than fo

In [None]:
import markdown

test_question = "I have severe headache, i also put in long hours at work?"

system_prompt = """
You are a medical diagnostics agent.
- Be empathetic
- Answer strictly in Markdown
- Use bullet points only
- Maximum 5 bullet points
"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": test_question}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=500,
    temperature=0.1,
    do_sample=True
)

# ✅ Extract ONLY assistant output
generated_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("Markdown Response:\n")
print(response)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Markdown Response:

- You may be experiencing stress. 
- You may be dehydrated. 
- You may have a migraine. 
- You may have a tension headache. 
- You should consult a doctor.
