In [1]:
%%capture
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
import wandb
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-05-26 23:43:06.699994: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748302987.135390      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748302987.257296      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


🦥 Unsloth Zoo will now patch everything to make training faster!


In [3]:
# Inistialize HF and WB tokens
secrets = UserSecretsClient() # Create instance via importing UserSecretsClient from kaggle_secrets
hf_token = secrets.get_secret("hf_token")
wb_token = secrets.get_secret("wb_token")

#Login to HF
login(hf_token)

#Login to wb
wandb.login(key=wb_token)
run = wandb.init(
    project='DeepSeek-R1-Distill-Qwen-1.5B for medical use', 
    job_type="training", 
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mhanksouyang5155[0m ([33mhanksouyang5155-northeastern-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [4]:
# Set parameters
max_seq_length = 2048 # Define the maximum sequence length a model can handle (i.e. how many tokens can be processed at once)
dtype = None # AUtomatically detect by setting to default
load_in_4bit = True # Enables 4 bit quantization for memory saving optimization

# Load the Deepseek model and tokenizer using unsloth - imported using: from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token, 
)

==((====))==  Unsloth 2025.5.7: Fast Qwen2 patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.81G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/6.78k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

In [11]:
# Define a system prompt using prompt_style
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>{}"""

In [12]:
# Create a test medical question for inference
question = """A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing
                but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would 
                cystometry most likely reveal about her residual volume and detrusor contractions?"""

# Enable optimized inference mode for Unsloth models (improves speed and efficiency)
FastLanguageModel.for_inference(model)

# FOrmat the question using the structured prompt ("prompt_style") and tokenize it
inputs = tokenizer([prompt_style.format(question, "")], return_tensors = "pt").to("cuda")

# Generate a response using the model
outputs = model.generate(
    input_ids = inputs.input_ids,
    attention_mask = inputs.attention_mask,
    max_new_tokens = 1200,
    use_cache = True
)

# Decode the generated output tokens into readable text
response = tokenizer.batch_decode(outputs)

# Extract and print only relavent response part (after "### Response:")
print(response[0].split("### Response:")[1])



<think>
Okay, so I'm trying to figure out what the medical question is about and how to answer it. Let me start by breaking down the question.

The patient is a 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night. She's undergoing a gynecological exam and a Q-tip test. The question is about what cystometry would reveal regarding her residual volume and detrusor contractions.

First, I need to understand the context. The patient has involuntary urine loss during activities but not at night. That suggests she might have a functional urethral sphincter (FUS) because the sphincter usually controls the flow of urine. Normally, the FUS opens during activity, allowing urine to pass, and closes at rest. The absence of leakage at night might indicate that the sphincter doesn't open as much at night, or perhaps the bladder is more efficiently emptied at rest.

Next, the exam findings are a gynecological exam and a Q

In [10]:
# Add the third placeholder for the complex chain of thought column.
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
{}
</think>

{}"""

In [35]:
# Download the dataset from HuggingFace
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:1000]",trust_remote_code=True)
dataset

Dataset({
    features: ['Question', 'Complex_CoT', 'Response'],
    num_rows: 1000
})

In [36]:
# Signale End of a Text Sequence (Otherwise, leading to incomplete or overly verbose responses)
EOS_token = tokenizer.eos_token
if EOS_token is None:
    EOS_token = tokenizer.pad_token
if EOS_token is None:
    EOS_token = "</s>"

print(f"Using EOS token: {repr(EOS_token)}")

Using EOS token: '<｜end▁of▁sentence｜>'


In [37]:
# Define formatting prompt function
def formatting_prompts_func(examples):
    texts = []
    
    # Handle both single examples and batches
    if isinstance(examples["Question"], list):
        # Batch processing
        inputs = examples["Question"]
        cots = examples["Complex_CoT"]
        outputs = examples["Response"]
    else:
        # Single example processing
        inputs = [examples["Question"]]
        cots = [examples["Complex_CoT"]]
        outputs = [examples["Response"]]
    
    for input_text, cot, output_text in zip(inputs, cots, outputs):
        # Skip any None or empty entries
        if input_text and cot and output_text:
            text = train_prompt_style.format(input_text, cot, output_text) + EOS_token
            texts.append(text)
    
    return {"text": texts}

In [38]:
# Update dataset format
dataset = dataset.map(formatting_prompts_func, batched=True, remove_columns=dataset.column_names) # Apply the same transformation (cookie-making) to every item (dough ball) with batches
dataset["text"][0]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

"Below is an instruction that describes a task, paired with an input that provides further context. \nWrite a response that appropriately completes the request. \nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. \nPlease answer the following medical question. \n\n### Question:\nGiven the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?\n\n### Response:\n<think>\nOkay, let's see what's going on here. We've got sudden weakness in the person's left arm and leg - and that screams something neuro-related, maybe a stroke?\n\nBut wait, there's more. The right lower leg i

In [39]:
# Filter out any empty texts
dataset = dataset.filter(lambda x: len(x["text"]) > 1000)

print(f"Dataset size after processing: {len(dataset)}")
print("Sample text length:", len(dataset["text"][0]))

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset size after processing: 1000
Sample text length: 3055


In [40]:
# Apply LoRA (Low-rank Adaptation) fine-tuning to the model
model_lora = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Higher r allows the LoRA adapters to capture more complex changes, but it also increases the number of trainable parameters and memory usage.
    target_modules = [
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        #"up_proj",
        #"#down_proj",
        #"embed_tokens", 
        #"lm_head"
    ],
    lora_alpha= 16, # Higher r allows the LoRA adapters to capture more complex changes, but it also increases the number of trainable parameters and memory usage.
    lora_dropout=0, # no dropout is applied
    bias= "none", # No bias terms will be trained
    use_gradient_checkpointing="unsloth",  # True or "unsloth". works by recomputing activations during the backward pass instead of storing them all in memory during the forward pass. This saves significant GPU memory at the cost of slightly more computation time.
    random_state=3407,
    use_rslora=False,  # 'rslora' stands for Rank-Stabilized LoRA that is a variant of LoRA
    loftq_config=None,
)

In [41]:
import os

os.environ["UNSLOTH_DISABLE_FAST_CROSS_ENTROPY"] = "1"
os.environ["UNSLOTH_DISABLE_FAST_CROSS_ENTROPY_LOSS"] = "1" 
os.environ["DISABLE_FUSED_CROSS_ENTROPY"] = "1"

In [16]:
from transformers import DataCollatorForLanguageModeling

In [24]:
# Initialize the fine-tuning trainer
"""trainer = SFTTrainer(
    model = model_lora,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text", # Contains the actual formatted text that should be tokenized and fed to the model
    max_seq_length=1024, # Use 2 parallel processes (CPU cores) to prepare the dataset
    dataset_num_proc=1,
    packing=False,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),

    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4, # Accumulate gradients for 4 steps before actually updating the model's weights.
        # Use num_train_epochs = 1, warmup_ratio for full training runs! The number of steps over which the learning rate will gradually increase from 0 to its learning_rate value
        warmup_steps=2,
        max_steps=30,# The total number of optimization steps to perform during training.
        learning_rate=1e-4,#This determines how large of a step the optimizer takes when updating model weights based on gradients
        fp16=True,
        bf16=False,
        logging_steps=5,
        optim="adamw_torch", # Use an 8-bit optimized version of the AdamW optimizer (from the bitsandbytes library), significantly reducing the memory footprint of the optimizer's states
        weight_decay=0.01, # A form of regularization (L2 regularization) that penalizes large weights, helping to prevent overfitting.
        lr_scheduler_type="linear", # Linearly decay from its peak (after warm-up) down to 0
        seed=3407,
        output_dir="outputs", 
        dataloader_drop_last=True,      
        remove_unused_columns=True,  # Changed to True to clean up data
        dataloader_pin_memory=False,  # Disable pin memory to avoid issues
    ),
)"""

Unsloth: Tokenizing ["text"]:   0%|          | 0/500 [00:00<?, ? examples/s]

In [42]:
os.environ['UNSLOTH_RETURN_LOGITS'] = '1'

In [18]:
# Use standard Transformers Trainer instead of SFTTrainer
from transformers import Trainer
import torch.nn as nn

In [43]:
# Custom trainer class 
class StandardTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        # Check if labels exist, if not create them from input_ids
        if "labels" not in inputs or inputs["labels"] is None:
            inputs["labels"] = inputs["input_ids"].clone()
        
        labels = inputs["labels"]
        outputs = model(**inputs)
        logits = outputs.get("logits")
        
        # Use standard PyTorch CrossEntropyLoss
        loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
        
        return (loss, outputs) if return_outputs else loss

In [44]:
# Tokenize dataset properly with labels
def tokenize_function(examples):
    tokenized = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [45]:

# Create standard trainer
trainer = StandardTrainer(
    model=model_lora,
    tokenizer=tokenizer,
    train_dataset=tokenized_dataset,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=5e-4,
        fp16=True,
        bf16=False,
        logging_steps=5,
        optim="adamw_torch",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        dataloader_drop_last=True,
        remove_unused_columns=False,
    ),
)

  trainer = StandardTrainer(


In [46]:
# Train with standard trainer
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 2
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 13,934,592/5,000,000,000 (0.28% trained)


Step,Training Loss
5,6.8911
10,6.6679
15,6.8663
20,6.7497
25,6.9911
30,6.8122
35,6.7947
40,6.772
45,6.9389
50,6.7693


In [47]:
# Get summary
wandb.finish()

0,1
train/epoch,▁▂▃▃▄▄▁▁▂▂▂▃▃▁▁▂▂▂▃▃▃▃▄▄▄▄▁▂▃▃▄▄▅▆▆▇▇██
train/global_step,▂▃▄▅▇▇▁▂▂▃▄▄▄▁▂▂▃▄▄▅▅▆▇▇██▁▂▂▃▄▄▅▅▆▇▇██
train/grad_norm,█▅▅▅▄▆▅▅▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/learning_rate,▂▂▂▁▁▇▇▆▄▃▁▇█▇▇▆▅▅▄▃▂▂▁▇█▇▇▆▅▅▄▃▂▂▁
train/loss,█▆▅▄▄▃▄▄▄▄▄▁▁▂▃▃▄▄▄▄▄▄▄▃▃▃▃▄▃▃▃▄▃▃▃

0,1
total_flos,5281940546519040.0
train/epoch,0.96
train/global_step,60.0
train/grad_norm,0.66169
train/learning_rate,1e-05
train/loss,6.5796
train_loss,6.78009
train_runtime,318.1776
train_samples_per_second,3.017
train_steps_per_second,0.189


In [48]:
# Model inference after fine-tuning
question_1 = """A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing 
            but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most 
            likely reveal about her residual volume and detrusor contractions?"""

# Load the inference model using FastLanguageModel
FastLanguageModel.for_inference(model)  

# Tokenize the input question with a specific prompt format and run with GPU
inputs_1 = tokenizer([prompt_style.format(question_1, "")], return_tensors="pt").to("cuda")

# Generate a response using LoRA
outputs_1 = model.generate(
    input_ids=inputs_1.input_ids, # TOkenized input IDs
    attention_mask=inputs_1.attention_mask, # Attention mask for padding handling
    max_new_tokens=1200, # Maximum length for generated response
    use_cache=True, # Enable cache for efficient generation
)

# Decode the generated response from tokenized format to readable text
response_1 = tokenizer.batch_decode(outputs_1)

# Extract and print the part after '### Response'
print(response_1[0].split("### Response:")[1])



<think>
Alright, so we have a 61-year-old woman here. She's got a long history of losing urine when she coughs or sneezes, but she doesn't leak at night. That's interesting. I know that could mean something's off with her kidneys or maybe her bladder. Hmm, let's think about what could cause that.

Now, she's got a gynecological exam and a Q-tip test. The Q-tip test is pretty standard for checking bladder function. It's like checking if the bladder can hold enough fluid when you push it out. But, she's got a long history of urine loss, so that's a big clue. If her bladder is not holding enough, it could explain the leaky night too.

Okay, so if the bladder isn't holding enough fluid, that usually means there's something wrong with the detrusor muscles. They're responsible for holding onto the fluid in the bladder, and if they're not working right, the bladder could leak.

Now, thinking about cystometry, that's a test we use to look at how much fluid is in the bladder. It's usually used

In [33]:
# Test 2nd example
question_2 = "A 59-year-old man presents with a fever, chills, night sweats, and generalized fatigue, and is found to have a 12 mm vegetation on the aortic valve. Blood cultures indicate gram-positive, catalase-negative, gamma-hemolytic cocci in chains that do not grow in a 6.5% NaCl medium. What is the most likely predisposing factor for this patient's condition?"

inputs_2 = tokenizer([prompt_style.format(question_2, "")], return_tensors="pt").to("cuda")

outputs_2 = model.generate(
    input_ids=inputs_2.input_ids, # TOkenized input IDs
    attention_mask=inputs_2.attention_mask, # Attention mask for padding handling
    max_new_tokens=1200, # Maximum length for generated response
    use_cache=True, # Enable cache for efficient generation
)

# Decode the generated response from tokenized format to readable text
response_2 = tokenizer.batch_decode(outputs_2)

# Extract and print the part after '### Response'
print(response_2[0].split("### Response:")[1])




<think>
Alright, let's break this down. We've got a 59-year-old guy who's showing signs of a severe illness: fever, chills, night sweats, and fatigue. That's a lot to take in. And then there's this 12 mm vegetation on his aortic valve. Hmm, that's interesting because it's not a common thing, especially in someone who's already got these symptoms.

Now, let's think about the blood cultures. They're gram-positive, which means they're not white blood cells. And they're catalase-negative, which is a clue. Catalase is an enzyme that breaks down lactose into glucose and galactose, so if this enzyme is negative, it means lactose is not being broken down. That's a big red flag.

The fact that the cocci chains don't grow in a 6.5% NaCl medium means they're not resistant to this kind of acid. It's like they're not stable and can be hydrolyzed. This is usually a sign of some kind of acid resistance issue.

Putting all these clues together, it's starting to sound like we're dealing with something

In [49]:
# Save the model locally
new_model_local = "DeepSeek-R1-Medical-COT-1.5B"
model.save_pretrained(new_model_local) # Only saves the LoRA adapter weights, training later or to experiment with different base models.
tokenizer.save_pretrained(new_model_local) # Essential for converting text to numerical IDs

# Merge the newly trained LoRA adapter weights directly into the original frozen base model weights
model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)

Unsloth: You have 2 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 1.8G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 18.29 out of 31.35 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 44.25it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving DeepSeek-R1-Medical-COT-1.5B/pytorch_model.bin...
Done.
