In [1]:
!pip install unsloth

Collecting unsloth
  Downloading unsloth-2025.12.9-py3-none-any.whl.metadata (65 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/65.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.9/65.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2025.12.7 (from unsloth)
  Downloading unsloth_zoo-2025.12.7-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth)
  Downloading tyro-1.0.3-py3-none-any.whl.metadata (12 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.33.post2-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.2 kB)
Collecting bitsandbytes!=0.46.0,!=0.48.0,>=0.45.5 (from unsloth)
  Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting datasets!=4.0.*,!=4.1.0,<4.4.0,>=3.4.1 (from unsloth)
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting trl!=0.19.0,<=0.24.0,>=0.18.2 (from u

In [2]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B',
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.12.9: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.81G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules=[
        'q_proj', 'k_proj', 'v_proj', 'o_proj',
        'gate_proj', 'up_proj', 'down_proj',
    ],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = 'none',
    use_gradient_checkpointing = 'unsloth',
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth 2025.12.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [12]:
training_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

Before answering, reason carefully through the problem to ensure a logical, accurate, and clinically sound response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Use evidence-based medicine and professional clinical judgment to answer the following medical question.

### Question:
{}

### Response:
<thinks>
{}
<think>
{}"""

### Advanced Medical Prompt Template & Formatting
**Purpose:** This block establishes the exact structure the model will follow during training. It uses a specialized prompt that forces the model to perform "Chain-of-Thought" (CoT) reasoning before giving a final medical answer.

**Key Components Explained:**
* **`training_prompt_style`:** This is the master template. It sets the "Persona" (Medical Expert) and provides a multi-stage structure:
    * **Instruction:** Defines the high standards of reasoning expected.
    * **Question:** The specific clinical query from your dataset.
    * **Response:** Uses specialized tags (`<think>`) to separate the internal logic from the final clinical judgment.
* **`<thinks>` & `<think>` Tags:** By including these in your training data, you are teaching the model to "speak" its reasoning aloud. This makes the AI's logic more transparent and helps it reach more accurate diagnostic conclusions.
* **`EOS_TOKEN` (End of Sequence):** This is the most vital special token. It signals to the model exactly where the medical response ends. This prevents the model from "looping" or repeating itself indefinitely during inference.
* **`formatting_prompt_fun`:** This function iterates through your dataset and merges the raw question, reasoning (`Complex_CoT`), and answer into the master template.

In [13]:
EOS_TOKEN = tokenizer.eos_token

def formatting_prompt_fun(ex):
  inputs = ex["Question"]
  cots = ex["Complex_CoT"]
  outputs = ex["Response"]
  texts = []
  for input, cot, output in zip(inputs, cots, outputs):
    text = training_prompt_style.format(input, cot, output) + EOS_TOKEN
    texts.append(text)
  return{
      "text": texts,
  }

### Dataset Loading and Mapping
**Purpose:** This block pulls a specific "Reasoning" dataset from the Hugging Face Hub and applies our custom formatting function to every row in a high-speed, parallelized manner.

**Key Components Explained:**
* **load_dataset:** We are downloading the `medical-o1-reasoning-SFT` dataset. This is an "o1-style" dataset, meaning it contains complex medical questions and deep reasoning steps (Chain-of-Thought).
* **split="train[0:1000]":** Instead of loading the entire dataset (which could be huge), we are slicing it to take only the first 1,000 examples. This is perfect for a quick fine-tuning run or a proof-of-concept in Colab.
* **batched=True:** This is a performance booster. Instead of sending one row at a time to our `formatting_prompt_fun`, the library sends "batches" (usually 1,000 rows at once). This makes the processing significantly faster by reducing the overhead of moving data back and forth in memory.
* **ds.map:** This line executes the transformation. Once finished, `ds` will contain a new column called `"text"` that holds our perfectly formatted prompts, ready for the model to read.

In [14]:
from datasets import load_dataset
ds = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en", split="train[0:1000]")
ds = ds.map(formatting_prompt_fun, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [15]:
ds["text"][0]

"Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\n\nBefore answering, reason carefully through the problem to ensure a logical, accurate, and clinically sound response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.\nUse evidence-based medicine and professional clinical judgment to answer the following medical question.\n\n### Question:\nGiven the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?\n\n### Response:\n<thinks>\nOkay, let's see what's going on here. We've got sudden weakness in the person's left arm and leg - and that screams something neuro-related, maybe a stroke?\n\nBut wait, 

### Training Configuration (SFTTrainer)
**Purpose:** This block sets the "Lesson Plan" for the training process. It defines how fast the model learns, how much memory it uses, and how often it checks its progress.

**Key Hyperparameters Explained:**
* **per_device_train_batch_size = 2:** The number of examples the GPU looks at in one single step. We keep this low (2) to prevent the GPU from running out of memory.
* **gradient_accumulation_steps = 4:** This simulates a larger batch size. By accumulating 4 steps of 2 examples each, the model actually learns as if it were seeing 8 examples at once ($2 \times 4 = 8$). This makes training more stable.
* **learning_rate = 2e-4:** This is the "speed" of learning. If it's too high, the model forgets old knowledge; if it's too low, it takes forever to learn.
* **bf16 & fp16:** These settings handle "Mixed Precision."
    * **bf16 (Bfloat16)** is a modern format for newer GPUs (like A100/L4).
    * **fp16** is the fallback for older GPUs (like T4).
    * `is_bfloat16_supported()` automatically chooses the best one for your hardware.
* **optim = "adamw_8bit":** This is a specialized memory-saving optimizer. It allows us to train larger models on smaller GPUs by using 8-bit precision for the math.
* **max_steps = 60:** This limits the training to 60 updates. For a full training run, you would typically increase this or use `num_train_epochs`.
* **warmup_steps = 5:** The model starts with a very low learning rate for the first 5 steps to "warm up" before hitting full speed, which prevents early training crashes.

In [17]:
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    train_dataset = ds,
    tokenizer = tokenizer,
    dataset_text_field = 'text',
    max_seq_length = 2048,
    dataset_num_proc = 2,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

🦥 Unsloth: Padding-free auto-enabled, enabling faster training.


In [18]:
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 18,464,768 of 1,795,552,768 (1.03% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,2.4776
20,2.0452
30,1.8293
40,1.8074
50,1.8671
60,1.7996


TrainOutput(global_step=60, training_loss=1.9710369427998862, metrics={'train_runtime': 269.6008, 'train_samples_per_second': 1.78, 'train_steps_per_second': 0.223, 'total_flos': 3167443042805760.0, 'train_loss': 1.9710369427998862, 'epoch': 0.48})

In [23]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

Before answering, reason carefully through the problem to ensure a logical, accurate, and clinically sound response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Use evidence-based medicine and professional clinical judgment to answer the following medical question.

### Question:
{}

### Response:
{}"""

### Inference with Medical Reasoning (o1 Style)
**Purpose:** This block tests the model's ability to perform complex clinical reasoning. By using a structured prompt, we guide the model to "think" before providing a final diagnosis.

**Key Components Explained:**
* **Prompt Template:** Notice the `<think>` tag at the end. This is a "trigger" that tells the model to start its internal reasoning process, mimicking the behavior of models like OpenAI's o1 or DeepSeek-R1.
* **Clinical Scenario:** We are testing the model with a specific urogynecology case (61-year-old female with stress incontinence).
* **Q-Tip Test:** This is a diagnostic procedure mentioned in the prompt used to assess urethral hypermobility.
* **Generation Parameters:**
    * **max_new_tokens = 1200:** Reasoning models often need more space to "write out" their thoughts before reaching an answer.
    * **use_cache = True:** Speeds up the generation process by remembering previous parts of the sentence.
* **batch_decode:** Converts the model's numerical output into a human-readable clinical response, including the reasoning steps and the final judgment.

In [24]:
question = "A 61-year old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on the findings, what is your judgement?"

FastLanguageModel.for_inference(model)

inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True
    )

response = tokenizer.batch_decode(outputs)

print(response[0].split("### Response:")[1])


<thinks>
Okay, let's break this down. The woman is 61 and has a history of involuntary urine loss during activities but not at night. That's interesting because usually, when people don't leak at night, they might have a condition that's related to their skin. Skin issues can lead to leaky urination at night, especially if it's something like psoriasis or eczema.

Now, let's think about what the Q-tip test tells us. It's used to check for psoriasis. If the skin shows red and bluish patches, that's a strong indication of psoriasis. So, given the skin findings and the presence of leaky urination at night, it seems like psoriasis is a likely culprit.

This makes sense because psoriasis can explain the leaky urination. But let's also consider if there's another possibility. Could there be something else causing the leak? Maybe an allergic reaction or something else. But psoriasis is a common cause for skin issues like this, so it's a good fit.

So, based on the Q-tip test showing red and 

### Saving and Exporting the Model
**Purpose:** This block allows you to save your fine-tuned model either locally or to the Hugging Face Hub. By default, these lines are set to `if False` so you don't accidentally save until you are ready.

**The Three Export Methods:**

1. **Merged 16-bit (`merged_16bit`):**
   * **What it does:** Merges your LoRA adapters back into the base model and saves it in high-precision (Float16).
   * **Best for:** Professional deployment using **vLLM** or for further fine-tuning. This creates a standard Hugging Face model.

2. **Merged 4-bit (`merged_4bit`):**
   * **What it does:** Merges the adapters and then compresses the model into 4-bit.
   * **Best for:** Saving space while keeping the model ready for Hugging Face's online inference engines. Note: This can slightly reduce accuracy compared to 16-bit.

3. **LoRA Adapters (`lora`):**
   * **What it does:** Saves **only** the small adapter files (usually ~100MB to 300MB) instead of the whole multi-gigabyte model.
   * **Best for:** Quick backups and sharing your "learning" without re-uploading the entire base model. To use these, you must load the base model first and then apply these adapters.

**Commands:**
* `save_pretrained_merged`: Saves the files to a folder in your current directory.
* `push_to_hub_merged`: Uploads the model directly to your Hugging Face profile (requires a `token`).


**To run a save method:**
1.  Choose **one** of the three methods (16bit, 4bit, or LoRA).
2.  Change the `False` to `True` for that specific line.


**Naming Convention:**
* **Local Path ("model"):** The folder created in this Colab session to store the files.
* **Hub Path ("hf/model"):** The destination on Hugging Face.
    * *Action Required:* Replace `hf` with your Hugging Face username and `model` with your desired repository name.
* **Token:** You must paste your Hugging Face "Write" token between the quotes `""` to allow the upload to succeed.

In [25]:
# 16bit
if True: model.save_pretrained_merged("DeepSeek-R1-medical", tokenizer, save_method="merged_16bit")
if True: model.push_to_hub_merged("RomanNihal/DeepSeek-R1-medical", tokenizer, save_method="merged_16bit", token="")

# 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method="merged_4bit")
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method="merged_4bit", token="")

# lora
if False: model.save_pretrained_merged("model", tokenizer, save_method="lora")
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method="lora", token="")


config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|██████████| 1/1 [01:00<00:00, 60.28s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [01:15<00:00, 75.38s/it]


Unsloth: Merge process complete. Saved to `/content/DeepSeek-R1-medical`


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...R1-medical/tokenizer.json:   0%|          | 28.5kB / 11.4MB            

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|██████████| 1/1 [01:54<00:00, 114.32s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...medical/model.safetensors:   1%|1         | 50.3MB / 3.55GB            

Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [02:18<00:00, 138.65s/it]


Unsloth: Merge process complete. Saved to `/content/RomanNihal/DeepSeek-R1-medical`
