# Fine-tuning LLM

This is notebook for adapting LLM to medical domain by fine-tuning it (SFT) on medical dataset of triples of question-CoT-answer. This notebook produces final SFT model for the project.

*Base model*: [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) - the smallest reasoning DeepSeek-R1 model.

*Dataset*: [FreedomIntelligence/medical-o1-reasoning-SFT](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT) - dataset of medical triples of question-CoT-answer designed for SFT and instruction tuning.

*Produced model*: [MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged](https://huggingface.co/MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged) 

### Setup

Setup seed, loading HF and WanDB API keys and set training device as cuda.

In [1]:
from IPython.display import clear_output

!pip install unsloth transformers datasets trl torch huggingface-hub wandb scikit-learn bitsandbytes accelerate
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
clear_output(wait=False)

In [2]:
import random
import numpy as np
import torch
import gc

gc.collect()
torch.cuda.empty_cache()

SEED = 4242
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

In [3]:
from huggingface_hub import login
from dotenv import load_dotenv
import os

load_dotenv()
hf_token = os.getenv('HF_TOKEN')

login(hf_token)

  from .autonotebook import tqdm as notebook_tqdm
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [5]:
import wandb

wandb_api = os.getenv('WANDB_API')
wandb.login(key=wandb_api)

run = wandb.init(
    project='Deepseek-R1-Qwen-1.5b sft on medical dataset full 1 epoch v.1',
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\milya\_netrc
[34m[1mwandb[0m: Currently logged in as: [33mmiliusha2801[0m ([33mmiliusha2801-innopolis-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [6]:
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)
device

'cuda'

### Model loading and QLoRA setup

Load the base model quantized to 4 bits and setup LoRA adapter for training.

In [7]:
import unsloth
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch


model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
max_seq_length = 4096
dtype = torch.bfloat16 if is_bfloat16_supported() else torch.float16
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-10 15:26:50 [__init__.py:256] Automatically detected platform cuda.


  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.0. vLLM: 0.8.0.
   \\   /|    NVIDIA GeForce RTX 4060 Laptop GPU. Num GPUs = 1. Max memory: 7.996 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [8]:
print(next(model.parameters()).dtype)

torch.bfloat16


In [9]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=SEED,
    use_rslora=True,
    loftq_config=None,
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.3.19 patched 28 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


### Dataset loading and preparation

Load *medical-o1-reasoning-SFT* - dataset of triples of question-CoT-answer of medical domain designed for SFT and instruction tuning. Prepare it for training.

In [10]:
from datasets import load_dataset

ds = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en", split="train[:20000]", trust_remote_code=True)

In [11]:
ds

Dataset({
    features: ['Question', 'Complex_CoT', 'Response'],
    num_rows: 20000
})

In [12]:
ds[0]

{'Question': 'A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?',
 'Complex_CoT': "Okay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that ups her abdominal pressure like coughing or sneezing. This sounds a lot like stress urinary incontinence to me. Now, it's interesting that she doesn't have any issues at night; she isn't experiencing leakage while sleeping. This likely means her bladder's ability to hold urine is fine when she isn't under physical stress. Hmm, that's a clue that we're dealing with something related to pressure rather than a bladder muscle problem. \n\nThe fact that she underwent a Q-tip test is intriguing too. This 

In [13]:
train_prompt_style = """
### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
{}
</think>
{}
"""

In [14]:
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    questions = examples["Question"]
    thoughts = examples["Complex_CoT"]
    responses = examples["Response"]
    texts = []
    for question, thought, response in zip(questions, thoughts, responses):
        text = train_prompt_style.format(question, thought, response) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

In [15]:
ds_formatted = ds.map(
    formatting_prompts_func,
    batched=True,
    remove_columns=["Question", "Complex_CoT", "Response"]
)

In [16]:
ds_formatted[0]["text"]

"\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. \nPlease answer the following medical question. \n\n### Question:\nA 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?\n\n### Response:\n<think>\nOkay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that ups her abdominal pressure like coughing or sneezing. This sounds a lot like stress urinary incontinence to me. Now, it's interesting that she doesn't have any issues at night; she isn't experiencing leakage while sleeping. This likely means her bladder's ability to hold urine is fine when she isn't under physic

In [17]:
from datasets import *

ds_splitted = ds_formatted.train_test_split(test_size=0.05, seed=SEED)

In [18]:
ds_splitted

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 19000
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1000
    })
})

### Fine-tuning LLM (SFT)

Setup training hyperparameters for LLM fine-tuning and conduct training.

In [19]:
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

finetune_name = "DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged"

training_args = UnslothTrainingArguments(
    output_dir=finetune_name,
    eval_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    save_steps=200,
    save_total_limit=5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    optim="adamw_torch_fused",
    lr_scheduler_type="cosine",
    warmup_steps=300,
    learning_rate=1e-4,
    num_train_epochs=1,
    weight_decay = 0.01,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    tf32=False,
    seed=SEED,
    report_to="wandb",
    hub_model_id=finetune_name,
    gradient_checkpointing=True,
)

In [20]:
trainer = UnslothTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds_splitted["train"],
    eval_dataset=ds_splitted["test"],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=1,
    args=training_args,
)

Unsloth: Tokenizing ["text"]: 100%|██████████| 19000/19000 [00:11<00:00, 1671.44 examples/s]
Unsloth: Tokenizing ["text"]: 100%|██████████| 1000/1000 [00:00<00:00, 1515.36 examples/s]


In [21]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 19,000 | Num Epochs = 1 | Total steps = 2,375
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 18,464,768/5,000,000,000 (0.37% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
100,2.0196,1.799228
200,1.8044,1.773752
300,1.8081,1.764513
400,1.7562,1.749474
500,1.7685,1.738221
600,1.7404,1.724447
700,1.7202,1.722521
800,1.7295,1.701173
900,1.695,1.710757
1000,1.7407,1.704478


Unsloth: Not an error, but Qwen2ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


### Saving model in HF Hub

Load model from the last chekpoint of the training and save it on the HuggingFace Hub for convenient using it while inference.

In [None]:
import unsloth
from unsloth import FastLanguageModel
import torch

checkpoint_path = "./DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged\checkpoint-2375"
output_hub_model_name = "MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged"
max_seq_length = 4096
dtype = None
load_in_4bit = False


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=checkpoint_path,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-10 22:05:55 [__init__.py:256] Automatically detected platform cuda.


  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.0. vLLM: 0.8.0.
   \\   /|    NVIDIA GeForce RTX 4060 Laptop GPU. Num GPUs = 1. Max memory: 7.996 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.3.19 patched 28 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [2]:
model = model.merge_and_unload()

model.push_to_hub(output_hub_model_name)
tokenizer.push_to_hub(output_hub_model_name)

100%|██████████| 1/1 [05:03<00:00, 303.38s/it]


Saved model to https://huggingface.co/MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-msft-merged-unsloth


tokenizer.json: 100%|██████████| 11.4M/11.4M [00:01<00:00, 6.86MB/s]
100%|██████████| 1/1 [00:01<00:00,  1.91s/it]


### Check fine-tuned LLM inference

Check if loading to HF hub was successfully done and model can be used by loading from the hub.

In [1]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

In [None]:
from unsloth import FastLanguageModel

model_name = "MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=4096,
    load_in_4bit=True,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-10 22:14:12 [__init__.py:256] Automatically detected platform cuda.


  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.0. vLLM: 0.8.0.
   \\   /|    NVIDIA GeForce RTX 4060 Laptop GPU. Num GPUs = 1. Max memory: 7.996 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [3]:
from transformers import TextStreamer
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
)
model = FastLanguageModel.for_inference(model)

In [4]:
prompt = """
You are an expert in solving multiple-choice questions accurately and explaining your reasoning clearly.
Given a question and a list of answer choices (A, B, C, D), your task is to:
1. Reason shortly about the question and answer choices to find evidances to support your answer.
2. Identify the correct answer.
3. Output the final answer in the format: Answer: [Option Letter]

Here is a question: Which vitamin is supplied from only animal source?
A. Vitamin C
B. Vitamin B7
C. Vitamin B12
D. Vitamin D

Reasoning:
"""
inputs = tokenizer([prompt], return_tensors="pt", padding=True, truncation=True).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=4096)

First, I need to figure out which vitamins are animal sources and which are plant sources.

Vitamin C is an animal source. It's produced by plants and can be found in animal foods like meat and eggs. So, this one's not an animal source.

Vitamin B7 is a plant source. It's primarily found in vegetables and fruits. I know it's not an animal source.

Vitamin B12 is an animal source. It's mainly produced in the liver and can be found in animal products like meat. So, it's an animal source.

Vitamin D is also an animal source. It's produced by the liver and can be found in animal products. So, it's an animal source too.

So, based on this, the only plant source among the options is Vitamin B12. Therefore, the correct answer should be C.

Let me double-check. Yes, Vitamin C comes from plants, Vitamin B7 from plants, Vitamin D from animals, and Vitamin B12 from animals. So, yeah, Vitamin B12 is definitely the only plant source. That checks out.
</think>
The correct answer is C. Vitamin B12.



Correct answer is Vitamin B12.

Fine-tuned LLM gave correct answer, but the reasoning is contradictory and incorrect.