<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/Deep-seek-R1-MedicalSFT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning DeepSeek-R1-Distill-Llama-8B

## Objective:
Adapt `DeepSeek-R1-Distill-Llama-8B` for medical chain-of-thought reasoning.

## Key Components:
- **Model:** `unsloth/DeepSeek-R1-Distill-Llama-8B`

> Add blockquote


- **Dataset:** 500 samples from `medical-o1-reasoning-SFT`
- **Tools:**
  - `Unsloth` (2x faster training)
  - 4-bit quantization
  - LoRA adapters
- **Result:** 44-minute training resulting in concise medical reasoning with structured `<think>` outputs.

## Performance Improvement:

| **Metric**         | **Before Fine-Tuning** | **After Fine-Tuning** |
|--------------------|------------------------|-----------------------|
| **Response Length** | 450 words              | 150 words             |
| **Reasoning Style** | Verbose                | Focused               |
| **Answer Format**   | Bulleted               | Paragraph             |


### step-by-step  fine-tune DeepSeek-R1-Distill-Llama-8B on medical data

##  **1: Install All the Required Packages**

In [None]:
%%capture
# The '%%capture' magic command in Jupyter notebooks suppresses output from subsequent cells.

!pip install kaggle
# Installs the 'kaggle' package using pip. Assumes pip is installed and configured.

!pip install unsloth
# Installs the 'unsloth' package using pip. Similar assumption as above.

!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
# Installs a specific version of the 'unsloth' package directly from its GitHub repository.
# '--force-reinstall': Forces reinstallation even if the package is already installed.
# '--no-cache-dir': Avoids caching the installation files.
# '--no-deps': Skips installing dependencies, useful if dependencies are already satisfied.
# 'git+https://github.com/unslothai/unsloth.git': GitHub repository URL from which to install the package.


## **2: Authentication in Google Colab**

In [None]:
from huggingface_hub import login
# Imports the 'login' function from the 'huggingface_hub' package to authenticate with Hugging Face.

from google.colab import userdata
# Imports 'userdata' from Google Colab, which allows access to stored secrets or credentials.

# Retrieve the Hugging Face token from Colab secrets
hf_token = userdata.get('HF_TOKEN')
# Gets the Hugging Face authentication token stored in Google Colab's 'userdata' for secure access.

# Log in to Hugging Face
login(hf_token)
# Uses the retrieved token to authenticate the user with Hugging Face's hub.

In [None]:
import wandb
# Imports the 'wandb' library, which is used for experiment tracking and logging in machine learning.

from google.colab import userdata
# Imports 'userdata' from Google Colab to access stored secrets or credentials.

# Retrieve the Weights & Biases (W&B) API token from Colab secrets
wb_token = userdata.get('wandb')
# Gets the stored W&B API key to authenticate with the W&B platform.

wandb.login(key=wb_token)
# Logs into W&B using the retrieved API key for tracking experiments.

# Initialize a new W&B run for experiment tracking
run = wandb.init(
    project='Fine-tune-DeepSeek-R1-Distill-Llama-8B on Medical COT Dataset',  # Specifies the W&B project name
    job_type="training",  # Labels this run as a training job
    anonymous="allow"  # Allows anonymous logging if authentication isn't provided
)


## **3: Model Initialization**

In [None]:
from unsloth import FastLanguageModel
# Imports the 'FastLanguageModel' class from the 'unsloth' library, which is optimized for efficient language model training and inference.

# Define model configuration parameters
max_seq_length = 2048  # Sets the maximum sequence length for the model.
dtype = None  # Specifies the data type for model computation (None means the default type will be used).
load_in_4bit = True  # Enables 4-bit quantization for reduced memory usage and faster inference.

# Load the pretrained model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/DeepSeek-R1-Distill-Llama-8B",  # Specifies the model to be loaded from Hugging Face Hub.
    max_seq_length=max_seq_length,  # Uses the defined max sequence length.
    dtype=dtype,  # Uses the specified data type (None defaults to the model’s recommended type).
    load_in_4bit=load_in_4bit,  # Enables 4-bit quantization if True.
    token=hf_token,  # Uses the Hugging Face authentication token to access private models if necessary.
)

## **4.Inference to model Model**

In [None]:
from unsloth import FastLanguageModel

# Define the prompt format for inference
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>{}"""

# Define the medical question for inference
question = (
    "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing "
    "but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would "
    "cystometry most likely reveal about her residual volume and detrusor contractions?"
)

# Prepare the model for inference mode
FastLanguageModel.for_inference(model)

# Tokenize the input prompt and move it to the GPU for efficient processing
inputs = tokenizer(
    [prompt_style.format(question, "")],  # Format the prompt with the question
    return_tensors="pt",  # Return PyTorch tensors
    padding=True,  # Ensure proper padding for batch processing
    truncation=True  # Prevent overly long inputs from causing issues
).to("cuda")  # Move tensors to GPU

# Generate model output based on the input prompt
outputs = model.generate(
    input_ids=inputs.input_ids,  # Input token IDs
    attention_mask=inputs.attention_mask,  # Attention mask for proper token processing
    max_new_tokens=1200,  # Limit the response length to avoid excessive output
    use_cache=True,  # Enable caching for faster inference
)

# Decode the generated output into a human-readable format
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Extract and print the response portion after "### Response:"
print(response[0].split("### Response:")[1].strip())

In [None]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>
{}
</think>


In [None]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

In [None]:
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset["text"][0]

In [None]:
# Apply Parameter-Efficient Fine-Tuning (PEFT) to the model using LoRA (Low-Rank Adaptation)
# This allows fine-tuning large models with fewer resources by only updating a small subset of parameters
model = FastLanguageModel.get_peft_model(
    model,  # The pre-trained model to which LoRA will be applied
    r=16,  # Rank of the low-rank matrices used in LoRA. Higher values increase capacity but also computational cost.
           # Suggested values: 8, 16, 32, 64, 128. Choose based on your task and resources.
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],  # List of model layers to apply LoRA to.
                                                           # These are typically attention and feed-forward layers.
    lora_alpha=16,  # Scaling factor for LoRA weights. Controls the magnitude of updates.
                    # A higher value increases the impact of LoRA updates.
    lora_dropout=0,  # Dropout rate for LoRA layers. Set to 0 for optimal performance.
                     # Dropout can help prevent overfitting but is not necessary here.
    bias="none",  # Whether to include bias terms in LoRA. "none" is optimized for efficiency.
    use_gradient_checkpointing="unsloth",  # Enables gradient checkpointing to save memory during training.
                                           # "unsloth" is optimized for very long sequences and reduces VRAM usage by 30%.
    random_state=3407,  # Random seed for reproducibility. Ensures consistent results across runs.
    use_rslora=False,  # Whether to use Rank-Stabilized LoRA (RS-LoRA). Set to False by default.
    loftq_config=None,  # Configuration for LoftQ (if applicable). Set to None as it is not used here.
)

## 7. Training Setup

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

### 8. Start Training

In [None]:
trainer_stats = trainer.train()

## 9. Save & Deploy

In [None]:
# Save locally
model.save_pretrained_merged("DeepSeek-R1-Medical-COT", tokenizer, save_method="merged_16bit")

# Push to Hub
model.push_to_hub_merged("username/DeepSeek-R1-Medical-COT", tokenizer, save_method="merged_16bit")