# Instruction-tuning LLM (or SFT, part of Multi-step fine-tuning)

TO_DO: rewrite this

(This is notebook for adapting continually pretrained before LLM (DeepSeek-R1-Distill-Qwen-1.5B-medical-continual-pretrain-merged-f32) to medical domain by doing instruction tuning (using SFT form unsloth) on medical dataset of triples of question-CoT-answer. This notebook produces final Multi-step fine-tuned model for the project.)

### Setup

In [1]:
from IPython.display import clear_output

!pip install unsloth transformers datasets trl torch huggingface-hub wandb scikit-learn bitsandbytes accelerate
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
clear_output(wait=False)

In [2]:
import random
import numpy as np
import torch
import gc

gc.collect()
torch.cuda.empty_cache()

SEED = 4242
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

In [3]:
from huggingface_hub import login
from dotenv import load_dotenv
import os

load_dotenv()
hf_token = os.getenv('HF_TOKEN')

login(hf_token)

  from .autonotebook import tqdm as notebook_tqdm
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [None]:
import wandb

wandb_api = os.getenv('WANDB_API')
wandb.login(key=wandb_api)

run = wandb.init(
    project='Deepseek-R1-Qwen-1.5b continual pretrain on medical dataset, full 1 epoch v.0',
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: C:\Users\milya\_netrc
[34m[1mwandb[0m: Currently logged in as: [33mmiliusha2801[0m ([33mmiliusha2801-innopolis-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [4]:
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)
device

'cuda'

### Model loading and QLoRA setup

In [None]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch


model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
max_seq_length = 4096
dtype = torch.bfloat16 if is_bfloat16_supported() else torch.float16
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.49.0.
   \\   /|    NVIDIA GeForce RTX 4060 Laptop GPU. Num GPUs = 1. Max memory: 7.996 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We add `embed_tokens` and `lm_head` to allow the model to learn out of distribution data.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=8,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",
                    "embed_tokens", "lm_head"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=SEED,
    use_rslora=True,
    loftq_config=None,
)

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2025.3.19 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM


### Datasets loading and preparation

In [7]:
from datasets import load_dataset

ds_textbooks = load_dataset("MedRAG/textbooks")
ds_statpearls = load_dataset("MilyaShams/MedRAG_statpearls")

In [8]:
ds_textbooks

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'content', 'contents'],
        num_rows: 125847
    })
})

In [9]:
ds_textbooks['train'][0]

{'id': 'Anatomy_Gray_0',
 'title': 'Anatomy_Gray',
 'content': 'What is anatomy? Anatomy includes those structures that can be seen grossly (without the aid of magnification) and microscopically (with the aid of magnification). Typically, when used by itself, the term anatomy tends to mean gross or macroscopic anatomy—that is, the study of structures that can be seen without using a microscopic. Microscopic anatomy, also called histology, is the study of cells and tissues using a microscope. Anatomy forms the basis for the practice of medicine. Anatomy leads the physician toward an understanding of a patient’s disease, whether he or she is carrying out a physical examination or using the most advanced imaging techniques. Anatomy is also important for dentists, chiropractors, physical therapists, and all others involved in any aspect of patient treatment that begins with an analysis of clinical signs. The ability to interpret a clinical observation correctly is therefore the endpoint of

In [10]:
ds_statpearls

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'content', 'contents'],
        num_rows: 334231
    })
})

In [11]:
ds_statpearls['train'][0]

{'id': 'statpearls_NBK430685\\article-100024_0',
 'title': 'Chronic Total Occlusion of the Coronary Artery -- Continuing Education Activity',
 'content': "Chronic total occlusion (CTO) lesions are diagnosed in patients who are undergoing coronary angiography as part of the evaluation of ischemic heart disease, cardiomyopathy, or valvular heart disease. CTO revascularization has not shown benefit in rates of all-cause mortality, myocardial infarction, stroke, and repeat revascularization and is commonly performed to improve a patient's quality of life by reducing their angina symptoms. This activity reviews the evaluation and treatment of chronic total occlusion of the coronary artery and highlights the role of the interprofessional team in evaluating and treating this condition.",
 'contents': "Chronic Total Occlusion of the Coronary Artery -- Continuing Education Activity. Chronic total occlusion (CTO) lesions are diagnosed in patients who are undergoing coronary angiography as part o

In [12]:
from datasets import concatenate_datasets

ds = concatenate_datasets([ds_textbooks['train'], ds_statpearls['train']]).shuffle(seed=SEED)
ds

Dataset({
    features: ['id', 'title', 'content', 'contents'],
    num_rows: 460078
})

In [13]:
ds[0]

{'id': 'statpearls_NBK430685\\article-31230_28',
 'title': 'Vitamin K Deficiency -- Treatment / Management',
 'content': 'Treatment of neonatal VKDB: The treatment typically involves administering 1 to 2 mg of vitamin K1 via slow IV or subcutaneous infusion. In cases of severe bleeding, fresh frozen plasma may be required at a dosage of 10 to 15 mL/kg. [14]',
 'contents': 'Vitamin K Deficiency -- Treatment / Management. Treatment of neonatal VKDB: The treatment typically involves administering 1 to 2 mg of vitamin K1 via slow IV or subcutaneous infusion. In cases of severe bleeding, fresh frozen plasma may be required at a dosage of 10 to 15 mL/kg. [14]'}

In [14]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    contents = examples["contents"]
    outputs = []
    for content in contents:
        text = content + EOS_TOKEN
        outputs.append(text)
    return {"text" : outputs}

In [15]:
ds = ds.map(
    formatting_prompts_func,
    batched=True,
    remove_columns=["id", "title", "content", "contents"]
)

In [16]:
ds

Dataset({
    features: ['text'],
    num_rows: 460078
})

In [17]:
from datasets import *

ds = ds.train_test_split(test_size=0.05, seed=SEED)
ds

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 437074
    })
    test: Dataset({
        features: ['text'],
        num_rows: 23004
    })
})

### Continual pretraining

Set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

In [18]:
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments


training_args = UnslothTrainingArguments(
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,
    warmup_ratio=0.05,
    num_train_epochs=1,
    learning_rate=1e-5,
    embedding_learning_rate=5e-6,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps=1000,
    save_steps=1000,
    save_total_limit=5,
    eval_strategy="steps",
    eval_steps=1000,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    seed=SEED,
    report_to="wandb",
)

In [19]:
trainer = UnslothTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=1,
    args=training_args,
)

Unsloth: Tokenizing ["text"]: 100%|██████████| 23004/23004 [00:03<00:00, 7020.84 examples/s]


In [None]:
# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4060 Laptop GPU. Max memory = 7.996 GB.
1.773 GB of memory reserved.


In [21]:
import gc

gc.collect()
torch.cuda.empty_cache()

In [22]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 437,074 | Num Epochs = 1 | Total steps = 27,317
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 475,979,776/5,000,000,000 (9.52% trained)


Step,Training Loss,Validation Loss
1000,3.459,3.015484
2000,2.8868,2.815437
3000,2.7883,2.760508
4000,2.752,2.728818
5000,2.7221,2.706187
6000,2.7078,2.689549
7000,2.6933,2.676444
8000,2.6786,2.663862
9000,2.6635,2.654351
10000,2.6571,2.646235


Unsloth: Will smartly offload gradients to save VRAM!


Unsloth: Not an error, but Qwen2ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

Remember to use https://translate.google.com/!

In [23]:
FastLanguageModel.for_inference(model)
prompt = "What type of cement bonds to tooth structure, provides an anticariogenic effect, has a degree of translucency, and is non-irritating to the pulp?"
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
tokenizer.batch_decode(outputs)

['<｜begin▁of▁sentence｜>What type of cement bonds to tooth structure, provides an anticariogenic effect, has a degree of translucency, and is non-irritating to the pulp? -- Introduction. There are several types of cement that are used in the dental field. The cement can be either cemented or cementous. The cement can be cemented cement, cemented cementous, or cementous cement. There are also different types of cement, such as calcium carbide, calcium carbonate, calcium hydroxide, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium carbonate, calcium hydroxide, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium hydroxide hydroxychloride, calcium hydroxide hydro

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [24]:
FastLanguageModel.for_inference(model)
prompt = "What type of cement bonds to tooth structure, provides an anticariogenic effect, has a degree of translucency, and is non-irritating to the pulp?"
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=1024)

<｜begin▁of▁sentence｜>What type of cement bonds to tooth structure, provides an anticariogenic effect, has a degree of translucency, and is non-irritating to the pulp? -- Other Issues. There are two types of cement: cement-based and non-cement-based.<｜end▁of▁sentence｜>


### Saving model in HF Hub

In [None]:
import unsloth
from unsloth import FastLanguageModel
import torch

checkpoint_path = "./trainer_output/checkpoint-27317"
output_hub_model_name = "MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-continual-pretrain-merged-f32"
max_seq_length = 4096
dtype = None
load_in_4bit = False


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=checkpoint_path,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-08 19:15:01 [__init__.py:256] Automatically detected platform cuda.


  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.0. vLLM: 0.8.0.
   \\   /|    NVIDIA GeForce RTX 4060 Laptop GPU. Num GPUs = 1. Max memory: 7.996 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.3.19 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [None]:
model = model.merge_and_unload()

model.push_to_hub(output_hub_model_name, dtype=torch.float32)
tokenizer.push_to_hub(output_hub_model_name, dtype=torch.float32)

Pushing merged model to MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-continual-pretrain-merged-f32...


model.safetensors: 100%|██████████| 3.55G/3.55G [10:09<00:00, 5.83MB/s]   


Saved model to https://huggingface.co/MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-continual-pretrain-merged-f32


tokenizer.json: 100%|██████████| 11.4M/11.4M [00:01<00:00, 8.22MB/s]


Model and tokenizer pushed successfully.


### Check continually pretrained LLM inference

In [1]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

In [2]:
from unsloth import FastLanguageModel

model_name = "MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-continual-pretrain-merged-f32"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=8192,
    load_in_4bit=True,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-09 16:21:38 [__init__.py:256] Automatically detected platform cuda.


  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.0. vLLM: 0.8.0.
   \\   /|    NVIDIA GeForce RTX 4060 Laptop GPU. Num GPUs = 1. Max memory: 7.996 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [3]:
from transformers import TextStreamer
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
)
FastLanguageModel.for_inference(model)

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1536, padding_idx=151654)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear4bit(in_features=1536, out_features=1536, bias=True)
          (k_proj): Linear4bit(in_features=1536, out_features=256, bias=True)
          (v_proj): Linear4bit(in_features=1536, out_features=256, bias=True)
          (o_proj): Linear4bit(in_features=1536, out_features=1536, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=1536, out_features=8960, bias=False)
          (up_proj): Linear4bit(in_features=1536, out_features=8960, bias=False)
          (down_proj): Linear4bit(in_features=8960, out_features=1536, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((153

In [4]:
question = "What type of cement bonds to tooth structure, provides an anticariogenic effect, has a degree of translucency, and is non-irritating to the pulp?"

messages = [{"from": "human", "value": question}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=4096, use_cache=True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<｜begin▁of▁sentence｜><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>human<|end_header_id|>

What type of cement bonds to tooth structure, provides an anticariogenic effect, has a degree of translucency, and is non-irritating to the pulp?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<|eot_id|><|start_header_id|>carved<|end_header_id|>carved, or carved, are the most common types of cement used for tooth restoration. Carved cement is made from a mixture of cement, silica, and calcium hydroxide, which are mixed with a water mixture. This cement is used to fill the dentin of the tooth, and it is the most commonly used cement for dental restoration. It is a very strong cement, and it is also very dense, so it is used for the restoration of teeth that are already highly worn. Carved cement is also used to fill the dentin of teeth that are undergoing dental decay. This cement is used for the re

KeyboardInterrupt: 