## Credits

Learnt this from Abid Ali Awan's tutorial on Fine-Tuning DeepSeek R1, posted on DataCamp

## Packages & Tools used
- `unsloth`: Fine-Tuning for LLMs
    - `FastLanguageModel` for optimizing Fine-Tuning & inference
- `peft` for LoRa (Low Rank Adapdation)
    - "LoRa" fine-tunes *some* of the parameters, making it efficient for weaker GPUs
- Hugging Face Models like:
    - `transformers` works with the dataset, handles model work
    - `trl` Transformer Reinforcement Learning gives supervised fine-tuning learning & uses the `SFFTrainer` wrapper
    - `datasets` to access the dataset i'm using from HuggingFace
- `torch` DL framework
- `wandb` Weights & Biases, tracks the fine-tuning 


In [29]:
# %%capture

# !pip install unsloth # install unsloth
# print('installed unsloth')
# print('installed')
# !pip install unsloth
!pip install unsloth
# !pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [30]:
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git # Also get the latest version Unsloth!
print('install done')

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-req-build-grlhkvdq
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-req-build-grlhkvdq
  Resolved https://github.com/unslothai/unsloth.git to commit 37e577a91386cb5b4a7b818a1418b66beef17296
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: unsloth
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
  Created wheel for unsloth: filename=unsloth-2025.6.2-py3-none-any.whl size=278541 sha256=f147931fbe941a7572908163ec191229fcf4b86ab37fb30e9cdaa22eb5bee072
  Stored in directory: /tmp/pip-ephem-wheel-cache-wv6tfvil/wheels/d1/17/05/850ab10c33284a4763b0595cd8ea9d01fce6e221cac24b3c01
Successfully built unsloth
Installing collected packages: unsloth
 

In [31]:
# Modules for fine-tuning
from unsloth import FastLanguageModel
import torch # Import PyTorch
from trl import SFTTrainer # Trainer for supervised fine-tuning (SFT)
from unsloth import is_bfloat16_supported # Checks if the hardware supports bfloat16 precision
# Hugging Face modules
from huggingface_hub import login # Lets you login to API
from transformers import TrainingArguments # Defines training hyperparameters
from datasets import load_dataset # Lets you load fine-tuning datasets
# Import weights and biases
import wandb
# Import kaggle secrets
from kaggle_secrets import UserSecretsClient
print('setup complete')

setup complete


In [32]:
    # Initialize Hugging Face & WnB tokens
    user_secrets = UserSecretsClient() # from kaggle_secrets import UserSecretsClient
    hugging_face_token = user_secrets.get_secret("Hugging_Face_Token")
    wnb_token = user_secrets.get_secret("wnb")
    
    # Login to Hugging Face
    login(hugging_face_token) # from huggingface_hub import login
    
    # Login to WnB
    wandb.login(key=wnb_token) # import wandb
    run = wandb.init(
        project='Fine-tune-DeepSeek-R1-Distill-Llama-8B on Medical COT Dataset', 
        job_type="training", 
        anonymous="allow"
    )

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


# Loading DeepSeek & Tokenizer
`FastLanguageModel.from_pretrained()` loads DeepSeek & its tokenizer. Using a distilled 8B version of R1 for faster computation. Also, configuration of parameters in code is explained. 

**4-bit quantization**
Enabling 4bQ allows for improved memory saving - important for limited GPUs. 4bQ reduces the precision of model weights, whilst keeping most of the accuracy. This allows it to be smalled & faster. 

4Bq compresses numbers into 4 bit values (instead of 32 / 16 bits), which is key to running models on normal GPUs. 

In [33]:
max_seq_length = 2048 # max seq. length of model (i.e max no. of tokens processed)
dtype = None # defautls to floating-point 16 (i think?)
load_in_4bit = True # Enables 4 bit quantization — a memory saving optimization 

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/DeepSeek-R1-Distill-Llama-8B",  # Load the DeepSeek R1 model (8B parameter version)
    max_seq_length=1024, # model processes 2048 sequeces/tokens
    dtype=dtype, # defaults to FP16 or BF16 
    load_in_4bit=load_in_4bit, 
    token=hugging_face_token,
)



==((====))==  Unsloth 2025.6.2: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla P100-PCIE-16GB. Num GPUs = 1. Max memory: 15.888 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 6.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


# Example of response before Fine-Tuning

1) Define a system prompt  
2) Running an Inference on the Model  


In [34]:
# Define a system prompt under prompt_style 
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>{}"""


# Creating a test medical question for inference
question = """A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or 
              sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, 
              what would cystometry most likely reveal about her residual volume and detrusor contractions?"""

# Enable optimized inference mode for Unsloth models (improves speed and efficiency)
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!

# Format the question using the structured prompt (`prompt_style`) and tokenize it
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")  # Convert input to PyTorch tensor & move to GPU

# Generate a response using the model
outputs = model.generate(
    input_ids=inputs.input_ids, # Tokenized input question
    attention_mask=inputs.attention_mask, # Attention mask to handle padding
    max_new_tokens=1200, # Limit response length to 1200 tokens (to prevent excessive output)
    use_cache=True, # Enable caching for faster inference
)

# Decode the generated output tokens into human-readable text
response = tokenizer.batch_decode(outputs)

# Extract and print only the relevant response part (after "### Response:")
print(response[0].split("### Response:")[1])  


<think>
Okay, so I'm trying to figure out what cystometry would show for this 61-year-old woman. Let me break this down.

First, the patient has a history of involuntary urine loss when she coughs or sneezes, but she doesn't leak at night. That makes me think she might have some kind of bladder control issue, maybe related to the lower urinary tract. Since it happens during activities that put pressure on the bladder, like coughing, it's probably due to an overactive bladder or maybe some nerve issue affecting the bladder's control.

She underwent a gynecological exam and a Q-tip test. I'm not too familiar with the Q-tip test, but I think it's used to check for urethral obstruction. If the Q-tip is placed in the urethra and stays there, it suggests that the urethral opening is narrow or blocked, leading to difficulty in voiding. So if the Q-tip test came back positive, that would mean there's obstruction at the urethral level.

Now, considering the findings, what would cystometry reve

# Issues w/ response

The reasoning process is showcased in between the `<think></think>`

This response has many issues. The reasoning process, while detailed, was long-winded and not concise. Furthermore, we want the final answer to be consistent in a certain style. 

# Fine-Tuning the model 

Step 1: Update the system prompt to include `<think></think>`

In [35]:
# Updated training prompt style to add </think> tag 
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
{}
</think>
{}"""


Step 2: Using the medical Medical O1 Reasoninng SFT ([HuggingFace Link](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT)) to fine-tune our model.

Originally, this model was used to fine-tune HuatuoGPT-o1, a medical LLM. 


In [36]:
# Download the dataset using Hugging Face — function imported using from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True) # 500 rows for train, the rest for test/validation
dataset


Dataset({
    features: ['Question', 'Complex_CoT', 'Response'],
    num_rows: 500
})

In [37]:
dataset[499]

{'Question': 'What type of contrast agent is recommended for use in a patient with renal dysfunction to prevent contrast nephropathy?',
 'Complex_CoT': "Okay, so let's figure this out. Contrast nephropathy is something to be cautious about, especially when we're dealing with patients with kidney issues. This condition happens after using contrast agents during imaging procedures, and it can lead to acute kidney injury. It sounds pretty serious, so we definitely want to choose the safest option for these patients.\n\nThere are several types of contrast media available, and they come in different osmolarities. We've got high-osmolar contrast media (HOCM), low-osmolar contrast media (LOCM), and then there's iso-osmolar contrast media (IOCM). From what I remember, high-osmolar agents aren't really great for people with kidney problems because they tend to have a higher risk of causing nephropathy.\n\nOh, for patients with renal dysfunction, we need to be extra careful. We should opt for co

Each question has a "chain of thought" and a response.  
This dataset format ensures every training example follows a consistent pattern  
We're also gonna prevent the model from yapping tm by adding an EOS (end of sentence?) token 

In [38]:
EOS_TOKEN = tokenizer.eos_token  # Define EOS_TOKEN which the model when to stop generating text during training
EOS_TOKEN

'<｜end▁of▁sentence｜>'

In [39]:
# Define formatting prompt function
def formatting_prompts_func(examples):  # Takes a batch of dataset examples as input
    inputs = examples["Question"]       # Extracts the medical question from the dataset
    cots = examples["Complex_CoT"]      # Extracts the chain-of-thought reasoning (logical step-by-step explanation)
    outputs = examples["Response"]      # Extracts the final model-generated response (answer)
    
    texts = []  # Initializes an empty list to store the formatted prompts
    
    # Iterate over the dataset, formatting each question, reasoning step, and response
    for input, cot, output in zip(inputs, cots, outputs):  
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN  # Insert values into prompt template & append EOS token
        texts.append(text)  # Add the formatted text to the list

    return {
        "text": texts,  # Return the newly formatted dataset with a "text" column containing structured prompts
    }


In [40]:
# Update dataset formatting
dataset_finetune = dataset.map(formatting_prompts_func, batched = True)
dataset_finetune["text"][0]

"Below is an instruction that describes a task, paired with an input that provides further context. \nWrite a response that appropriately completes the request. \nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. \nPlease answer the following medical question. \n\n### Question:\nGiven the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?\n\n### Response:\n<think>\nOkay, let's see what's going on here. We've got sudden weakness in the person's left arm and leg - and that screams something neuro-related, maybe a stroke?\n\nBut wait, there's more. The right lower leg i

Step 3: Using LoRa  
LoRA (**Low-Rank Adaptation**) will allow me to fine-tune this model efficiently. This is because a standard LLM has millions/billions of parameters that determine how to process/generate text. Fine-tuning requires updating all the weights, which requires too much resources.

LoRa adds small, trainable adapters to specific layers, which capture task-specific knowledge whilst leaving the original model unchanged. This reduces the number of trainable parameters **by more than 90%**, making fine-tuning **faster and more memory-efficient**.  

Tutorial by [Sebastian Raschka](https://www.youtube.com/watch?v=rgmJep4Sb4&t).

using `get_peft_model()` function (Parameter-Efficient Fine-Tuning), this function wraps the base model (`model`) with LoRA modifications, ensuring that only specific parameters are trained.  

In [41]:
# Apply LoRA (Low-Rank Adaptation) fine-tuning to the model 
model_lora = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank: Determines the size of the trainable adapters (higher = more parameters, lower = more efficiency)
    target_modules=[  # List of transformer layers where LoRA adapters will be applied
        "q_proj",   # Query projection in the self-attention mechanism
        "k_proj",   # Key projection in the self-attention mechanism
        "v_proj",   # Value projection in the self-attention mechanism
        "o_proj",   # Output projection from the attention layer
        "gate_proj",  # Used in feed-forward layers (MLP)
        "up_proj",    # Part of the transformer’s feed-forward network (FFN)
        "down_proj",  # Another part of the transformer’s FFN
    ],
    lora_alpha=16,  # Scaling factor for LoRA updates (higher values allow more influence from LoRA layers)
    lora_dropout=0,  # Dropout rate for LoRA layers (0 means no dropout, full retention of information)
    bias="none",  # Specifies whether LoRA layers should learn bias terms (setting to "none" saves memory)
    use_gradient_checkpointing="unsloth",  # Saves memory by recomputing activations instead of storing them (recommended for long-context fine-tuning)
    random_state=3407,  # Sets a seed for reproducibility, ensuring the same fine-tuning behavior across runs
    use_rslora=False,  # Whether to use Rank-Stabilized LoRA (disabled here, meaning fixed-rank LoRA is used)
    loftq_config=None,  # Low-bit Fine-Tuning Quantization (LoFTQ) is disabled in this configuration
)

In [42]:
# Now, we initialize SFTTrainer, a supervised fine-tuning trainer 
# from `trl` (Transformer Reinforcement Learning), 
# to fine-tune our model efficiently on a dataset.
# Initialize the fine-tuning trainer — Imported using from trl import SFTTrainer
trainer = SFTTrainer(
    model=model_lora,  # The model to be fine-tuned
    tokenizer=tokenizer,  # Tokenizer to process text inputs
    train_dataset=dataset_finetune,  # Dataset used for training
    dataset_text_field="text",  # Specifies which field in the dataset contains training text
    max_seq_length=max_seq_length,  # Defines the maximum sequence length for inputs
    dataset_num_proc=2,  # Uses 2 CPU threads to speed up data preprocessing

    # Define training arguments
    args=TrainingArguments(
        per_device_train_batch_size=2,  # Number of examples processed per device (GPU) at a time
        gradient_accumulation_steps=4,  # Accumulate gradients over 4 steps before updating weights
        num_train_epochs=1, # Full fine-tuning run
        warmup_steps=5,  # Gradually increases learning rate for the first 5 steps
        max_steps=60,  # Limits training to 60 steps (useful for debugging; increase for full fine-tuning)
        learning_rate=2e-4,  # Learning rate for weight updates (tuned for LoRA fine-tuning)
        fp16=not is_bfloat16_supported(),  # Use FP16 (if BF16 is not supported) to speed up training
        bf16=is_bfloat16_supported(),  # Use BF16 if supported (better numerical stability on newer GPUs)
        logging_steps=10,  # Logs training progress every 10 steps
        optim="adamw_8bit",  # Uses memory-efficient AdamW optimizer in 8-bit mode
        weight_decay=0.01,  # Regularization to prevent overfitting
        lr_scheduler_type="linear",  # Uses a linear learning rate schedule
        seed=3407,  # Sets a fixed seed for reproducibility
        output_dir="outputs",  # Directory where fine-tuned model checkpoints will be saved
    ),
)


Step 4: Model Training  


In [43]:
# fine-tuning process
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Step,Training Loss
10,1.9381
20,1.4178
30,1.3975
40,1.3547
50,1.3836
60,1.3633


In [44]:
# Save the fine-tuned model
wandb.finish()
print('finished!')

0,1
train/epoch,▁▂▄▅▇██
train/global_step,▁▂▄▅▇██
train/grad_norm,█▂▁▁▂▁
train/learning_rate,█▇▅▄▂▁
train/loss,█▂▂▁▁▁

0,1
total_flos,1.6770438117310464e+16
train/epoch,0.96
train/global_step,60.0
train/grad_norm,0.25618
train/learning_rate,0.0
train/loss,1.3633
train_loss,1.47585
train_runtime,2481.8979
train_samples_per_second,0.193
train_steps_per_second,0.024


finished!


Step 5: New model response after fine-tuning

In [45]:
question = """A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing 
              but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, 
              what would cystometry most likely reveal about her residual volume and detrusor contractions?"""

# Load the inference model using FastLanguageModel (Unsloth optimizes for speed)
FastLanguageModel.for_inference(model_lora)  # Unsloth has 2x faster inference!

# Tokenize the input question with a specific prompt format and move it to the GPU
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

# Generate a response using LoRA fine-tuned model with specific parameters
outputs = model_lora.generate(
    input_ids=inputs.input_ids,          # Tokenized input IDs
    attention_mask=inputs.attention_mask, # Attention mask for padding handling
    max_new_tokens=1200,                  # Maximum length for generated response
    use_cache=True,                        # Enable cache for efficient generation
)

# Decode the generated response from tokenized format to readable text
response = tokenizer.batch_decode(outputs)

# Extract and print only the model's response part after "### Response:"
print(response[0].split("### Response:")[1])


<think>
Okay, so let's break this down. We have a 61-year-old woman who's been dealing with involuntary urine loss during things like coughing or sneezing, but she's not having any problems at night. That's interesting because it usually points towards some kind of issue with her bladder or urethra, but not necessarily something that affects her sleep.

When they did the gynecological exam and the Q-tip test, they found something. Now, I'm not entirely sure what they found, but I'm guessing it's not something that's super serious. The Q-tip test is usually done to see if there's a urethral obstruction or something like that. So if it's positive, it suggests there's an obstruction. If it's negative, then maybe it's more of a bladder issue.

Now, the big question is what would cystometry show. Cystometry is like a test where they fill up the bladder and see how it behaves. If there's a residual volume that's more than usual, it could mean that her bladder isn't emptying completely. That

In [46]:
question = """A 59-year-old man presents with a fever, chills, night sweats, and generalized fatigue, 
              and is found to have a 12 mm vegetation on the aortic valve. Blood cultures indicate gram-positive, catalase-negative, 
              gamma-hemolytic cocci in chains that do not grow in a 6.5% NaCl medium. 
              What is the most likely predisposing factor for this patient's condition?"""

# Tokenize the input question with a specific prompt format and move it to the GPU
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

# Generate a response using LoRA fine-tuned model with specific parameters
outputs = model_lora.generate(
    input_ids=inputs.input_ids,          # Tokenized input IDs
    attention_mask=inputs.attention_mask, # Attention mask for padding handling
    max_new_tokens=1200,                  # Maximum length for generated response
    use_cache=True,                        # Enable cache for efficient generation
)

# Decode the generated response from tokenized format to readable text
response = tokenizer.batch_decode(outputs)

# Extract and print only the model's response part after "### Response:"
print(response[0].split("### Response:")[1])


<think>
Alright, let's think about this. We have a 59-year-old man with some classic symptoms of an infection—fever, chills, night sweats, and fatigue. And there's this vegetation on his aortic valve, which really catches my attention. That makes me think about endocarditis, specifically valvular endocarditis. 

Okay, so now let's look at the microbiology details. Blood cultures show gram-positive, catalase-negative, gamma-hemolytic cocci in chains that don't grow in a 6.5% NaCl medium. Hmm, that's interesting. These characteristics remind me of something I've heard about before. I think it's called viridans group streptococcus or maybe something similar.

Oh wait, viridans group streptococcus! That's right. It's known for causing endocarditis, especially on the aortic valve. Now, what's the big risk factor for this kind of infection? I remember that it's often related to some kind of underlying heart condition or maybe even a previous heart valve issue. 

Oh, and I just thought of so