# Parameter-Efficient Supervised Fine-Tuning of LLaM

## Enviroment pakages 

In [1]:
%%capture
!pip install unsloth vllm
!pip install --force-reinstall --no-cache=dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
!pip install rouge_score



### import libraries

In [3]:
import unsloth
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
import torch



🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-06-26 03:06:07.987705: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750907168.010954     416 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750907168.018058     416 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 06-26 03:06:14 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 06-26 03:06:14 [__init__.py:239] Automatically detected platform cuda.


In [4]:


from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset, DatasetDict
from huggingface_hub import login
import wandb
import numpy as np
from rouge_score import rouge_scorer


## Logging into Hugging Face and Weights & Biases 

In [5]:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from unsloth import is_bfloat16_supported
from huggingface_hub import login 
from transformers import TrainingArguments



In [6]:
# Initialize Hugging Face & WnB tokens
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient() # from kaggle_secrets import UserSecretsClient
hugging_face_token = user_secrets.get_secret("hf_token")
wnb_token = user_secrets.get_secret("wnbs")

# Login to Hugging Face

login(hugging_face_token)

# Login to WnB

wandb.login(key=wnb_token)  #import wandb

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mhamayoonali38[0m ([33mhamayoonali38-datacamp[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## Loading Model and Tokenizer

In [7]:
max_seq_length = 2048
dtype =None
load_in_4bit =True

model , tokenizer =FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

==((====))==  Unsloth 2025.6.5: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Dataset preparation

In [8]:
dataset=load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en",split="train[0:500]",trust_remote_code =True)
dataset


Dataset({
    features: ['Question', 'Complex_CoT', 'Response'],
    num_rows: 500
})

In [9]:
from datasets import DatasetDict

validation_size = 100
train_size = len(dataset) - validation_size
indices = np.random.permutation(len(dataset))
train_indices, val_indices = indices[:train_size], indices[train_size:]

train_dataset = dataset.select(train_indices)
val_dataset = dataset.select(val_indices)

dataset_dict = DatasetDict({"train": train_dataset, "validation": val_dataset})


def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []

    for input_text, cot, output in zip(inputs, cots, outputs):
        convo = [
            {"role": "user", "content": input_text},
            {"role": "assistant", "content": f"</think>{cot}</think>\n<response>{output}</response>"}
        ]
        text = tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False
        )
        texts.append(text)

    return {"text": texts}


dataset_dict = dataset_dict.map(
    formatting_prompts_func,
    batched=True,
    num_proc=2
)


Map (num_proc=2):   0%|          | 0/400 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

## Fine-Tunning Setup with LoRA

In [10]:
model_lora = FastLanguageModel.get_peft_model(

    model,
    r=16,
    target_modules=["q_proj", "k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,

)

Unsloth 2025.6.5 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


## Compute ROUGE -L Scores Before Fine Tuning


In [11]:
from rouge_score import rouge_scorer
import numpy as np

def compute_rouge_l(dataset, model, tokenizer, num_samples=10):
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    scores = []
    model.eval()
    
    for example in dataset.select(range(min(num_samples, len(dataset)))):
        question = example["Question"]
        ground_truth = example["Response"]
        
        convo = [{"role": "user", "content": question}]
        inputs = tokenizer(
            [tokenizer.apply_chat_template(convo, tokenize=False)],
            return_tensors="pt"
        ).to("cuda")
        
        outputs = model.generate(**inputs, max_new_tokens=1200, use_cache=True)
        prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        score = scorer.score(ground_truth, prediction)['rougeL'].fmeasure
        scores.append(score)
    
    return np.mean(scores)

pre_finetune_rouge= compute_rouge_l(dataset_dict["validation"],model_lora,tokenizer)
run =wandb.init(
    project="Parameter-Efficient Supervised Fine-Tuning of LLaM",
    job_type="training",
    anonymous="allow",
    
)
wandb.log({"pre_finetune_rouge_l":pre_finetune_rouge})


## intialize Fine-Tuning Trainer

In [12]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model_lora,
    tokenizer=tokenizer,
    train_dataset=dataset_dict['train'],
    eval_dataset=dataset_dict['validation'],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,

    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,  
        warmup_steps=100,
        max_steps=250,
        learning_rate=2e-4,  
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",  
        seed=3407,
        output_dir="outputs",

        report_to="wandb",  
        run_name="llama3-3B-lr2e-4-steps1000",
        save_strategy='steps',
        save_steps=50,
        save_total_limit=1,
        eval_strategy="steps",  # ✅ changed from evaluation_strategy
        eval_steps=50,

        hub_model_id="Hamayyoon/LLaMA3.2B-Medcot",
        logging_first_step=True,  
    ),
)


Unsloth: Tokenizing ["text"]:   0%|          | 0/400 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"]:   0%|          | 0/100 [00:00<?, ? examples/s]

## model Training

In [13]:
trainer_state = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 400 | Num Epochs = 21 | Total steps = 250
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 24,313,856/3,000,000,000 (0.81% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
50,1.5543,1.541682
100,1.3431,1.453986
150,1.0089,1.633494
200,0.6689,2.065748
250,0.4611,2.326826


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


In [15]:
post_finetune_rouge = compute_rouge_l(dataset_dict["validation"], model_lora, tokenizer)
wandb.log({"post_finetune_rouge_l": post_finetune_rouge})

In [16]:
model_lora.save_pretrained("lora_adapters")
tokenizer.save_pretrained("lora_adapters")

('lora_adapters/tokenizer_config.json',
 'lora_adapters/special_tokens_map.json',
 'lora_adapters/tokenizer.json')

In [19]:
# Load custom token
user_secrets = UserSecretsClient()
upload_token = user_secrets.get_secret("hf_token")

# Log in with that token
login(token=upload_token)

model_lora.push_to_hub("Hamayyoon/medcot-llama3.2-3b-model", token=upload_token)
tokenizer.push_to_hub("Hamayyoon/medcot-llama3.2-3b-tokenizer", token=upload_token)

Uploading...:   0%|          | 0.00/97.3M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved model to https://huggingface.co/Hamayyoon/medcot-llama3.2-3b-model


README.md: 0.00B [00:00, ?B/s]

Uploading...:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


In [20]:
def generate_response(question, model, tokenizer):
    
    FastLanguageModel.for_inference(model)
    convo = [{"role": "user", "content": question}]
    inputs = tokenizer([tokenizer.apply_chat_template(convo, tokenize=False)], return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=1200, use_cache=True)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("<response>")[1].split("</response>")[0] if "<response>" in response else response

question = """A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no 
leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, 
what would cystometry most likely reveal about her residual volume and detrusor contractions?"""

response = generate_response(question, model_lora, tokenizer)
print(f"Response: {response}")

Response: In this scenario, the symptoms and the Q-tip test suggest stress urinary incontinence, which is involuntary urine leakage accompanied by physical stress or exertion. 

On cystometry, you would typically see a normal residual volume (RV) because there's no indication of neurological bladder dysfunction that would affect emptying. 

Regarding detrusor contractions, under stress or exertional pressure, such as during activities like coughing or sneezing, you would likely observe detrusor contractions. These contractions are a normal response of the bladder muscle to increased pressure, leading to the leakage of urine. However, it's important to note that the frequency and amplitude of these contractions might not be constant and could vary with different levels of exertion. 

Overall, the findings on cystometry would align with stress urinary incontinence, confirming the suspicion based on her symptoms and the Q-tip test.


In [21]:
wandb.finish()

0,1
eval/loss,▂▁▂▆█
eval/runtime,█▂▂▁▂
eval/samples_per_second,▁▇▇█▇
eval/steps_per_second,▁▇▇█▇
post_finetune_rouge_l,█▁
pre_finetune_rouge_l,▁
train/epoch,▁▁▂▂▂▂▂▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇████
train/global_step,▁▁▂▂▂▂▂▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇▇██████
train/grad_norm,▄▄▃▂▂▁▁▁▁▂▂▃▃▅▄▅▆▆▇▇█▇▇▇▇▇
train/learning_rate,▁▂▂▃▄▄▅▆▇▇██▇▇▆▆▅▅▄▄▃▃▂▂▂▁

0,1
eval/loss,2.32683
eval/runtime,47.18
eval/samples_per_second,2.12
eval/steps_per_second,0.276
post_finetune_rouge_l,0.12072
pre_finetune_rouge_l,0.20637
total_flos,1.1233494505301606e+17
train/epoch,19.24
train/global_step,250.0
train/grad_norm,0.73752


##   