<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/Gemma_2B_Medical_ORPO_RLHF_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install and Upgrade Unsloth Library to Latest Nightly Version

In [None]:
%%capture
# Install the `unsloth` library using pip
!pip install unsloth

# Uninstall the current version of `unsloth` and install the latest nightly version
# directly from the GitHub repository. This ensures you have the most up-to-date code.
# Flags used:
# - `--upgrade`: Ensures the latest version is installed.
# - `--no-cache-dir`: Disables caching for a fresh installation.
# - `--no-deps`: Skips installing dependencies (useful if dependencies are already installed).
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

### Load and Configure a Pretrained FastLanguageModel with Custom Settings


In [None]:
# Import the FastLanguageModel module from the unsloth library
from unsloth import FastLanguageModel
import torch

# Define the maximum sequence length for the model (can be customized)
max_seq_length = 4096  # Choose any! RoPE Scaling is auto-supported internally.

# Specify the data type for the model (None for auto-detection, or specify Float16/Bfloat16)
dtype = None  # Float16 for Tesla T4, V100; Bfloat16 for Ampere+; None for auto-detection.

# Enable or disable 4-bit quantization to reduce memory usage
load_in_4bit = True  # Use 4-bit quantization. Set to False if not required.

# Load the pretrained model and tokenizer with the specified parameters
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-2b-it-bnb-4bit",  # Model name to be loaded
    max_seq_length=max_seq_length,             # Maximum sequence length
    dtype=dtype,                               # Data type
    load_in_4bit=load_in_4bit                  # Enable 4-bit quantization
)

==((====))==  Unsloth 2025.1.7: Fast Gemma patching. Transformers: 4.48.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.1.7 patched 18 layers with 18 QKV layers, 18 O layers and 18 MLP layers.


#### Formatting Medical Reasoning Dataset for ORPOTrainer


In [None]:
# Formatting Medical Reasoning Dataset for ORPOTrainer
# This script formats the dataset by creating prompts based on the given instruction, input, and responses.
# The format follows the Alpaca prompt style, including "instruction", "input", and "response" sections.

# Define the Alpaca-style prompt template for formatting the data
alpaca_prompt = """Below is a medical scenario with an input that describes a situation or a question related to healthcare. Write a response that appropriately completes the medical reasoning request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Define the End of Sequence (EOS) token
EOS_TOKEN = "<EOS>"  # Placeholder for EOS token, ensure this matches your tokenizer's EOS token

def format_prompt(sample):
    # Extract the instruction, input, accepted response, and rejected response from the sample
    instruction = sample["instruction"]  # Instruction on how to approach the medical problem
    input_data = sample["Input"]
    accepted = sample["accepted"]  # The accepted (valid) reasoning response
    rejected = sample["rejected"]  # The rejected (invalid) reasoning response

    # ORPOTrainer expects keys: prompt (formatted instruction and input), chosen (accepted response), and rejected (rejected response)
    # Create a formatted prompt using the Alpaca template, leaving the response section empty
    sample["Input"] = alpaca_prompt.format(instruction, input_data, "")

    # Add the accepted response, appending the EOS token at the end
    sample["chosen"] = accepted + EOS_TOKEN

    # Add the rejected response, appending the EOS token at the end
    sample["rejected"] = rejected + EOS_TOKEN

    return sample  # Return the formatted sample for further use

# Placeholder statement, does nothing, but ensures syntactical correctness
pass

# Example of loading and processing the dataset
from datasets import load_dataset

dataset_name = "SURESHBEEKHANI/medical-reasoning-orpo"
dataset = load_dataset(dataset_name, split="all")
dataset = dataset.shuffle(seed=42).select(range(4200))  # Limit to 1000 samples for a quick demo

# Apply the `format_prompt` function to each sample in the dataset to format them correctly
dataset = dataset.map(format_prompt)  # The `map` function applies `format_prompt` across all samples in the dataset

# Split dataset into train and test sets
dataset = dataset.train_test_split(test_size=200)


Map:   0%|          | 0/4200 [00:00<?, ? examples/s]

Let's print out some examples to see how the dataset should look like

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Input', 'accepted', 'rejected', 'instruction', 'chosen'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['Input', 'accepted', 'rejected', 'instruction', 'chosen'],
        num_rows: 200
    })
})

In [None]:
import pprint


# Print the instruction section
print('INSTRUCTION: ' + '=' * 50)
pprint.pprint(row["Input"])

# Print the accepted response section
print('ACCEPTED: ' + '=' * 50)
pprint.pprint(row["chosen"])

# Print the rejected response section
print('REJECTED: ' + '=' * 50)
pprint.pprint(row["rejected"])

('Below is a medical scenario with an input that describes a situation or a '
 'question related to healthcare. Write a response that appropriately '
 'completes the medical reasoning request.\n'
 '\n'
 '### Instruction:\n'
 'Given the following medical question or situation, provide the most suitable '
 'reasoning or explanation.\n'
 '\n'
 '### Input:\n'
 'In a 26-year-old G2P1 patient with gestational diabetes treated with insulin '
 'undergoing labor induction at 40 weeks gestation, with the following blood '
 'work results: fasting glucose 92 mg/dL and HbA1c 7.8%, what should be '
 'administered during labor to manage her glucose levels appropriately?\n'
 '\n'
 '### Response:\n')
("Okay, so we have a 26-year-old woman who's currently pregnant with her "
 "second child, and she's 40 weeks along. She's been diagnosed with "
 "gestational diabetes and has been managing it with insulin. Now, we're "
 "getting ready to induce labor. Right, let's look at her current blood "
 'glucose sit

In [None]:
# Enable reward modelling stats
# Import the PatchDPOTrainer class from the unsloth module
from unsloth import PatchDPOTrainer

# Instantiate PatchDPOTrainer to enable reward modelling statistics
PatchDPOTrainer()

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `ORPOTrainer`! More docs here: [TRL ORPO docs](https://huggingface.co/docs/trl/main/en/orpo_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import ORPOConfig, ORPOTrainer  # Import necessary classes for configuration and training
from unsloth import is_bfloat16_supported  # Import function to check if bfloat16 is supported

# Initialize the ORPOTrainer with model, datasets, tokenizer, and optimized configuration
orpo_trainer = ORPOTrainer(
    model=model,  # Predefined model to be trained
    train_dataset=dataset["train"],  # Training dataset
    eval_dataset=dataset["test"],  # Evaluation dataset
    tokenizer=tokenizer,  # Tokenizer associated with the model
    args=ORPOConfig(  # Configuration for the trainer
        max_length=max_seq_length,  # Maximum sequence length for inputs
        max_prompt_length=max_seq_length // 2,  # Maximum length for the prompt (half of max length)
        max_completion_length=max_seq_length // 2,  # Maximum length for the completion (other half)
        per_device_train_batch_size=8,  # Increase batch size for better gradient stability
        gradient_accumulation_steps=8,  # Increase accumulation steps to effectively simulate a larger batch size
        beta=0.2,  # Slightly higher beta for regularization (tune this as needed)
        logging_steps=10,  # Log less frequently to reduce I/O overhead
        optim="adamw_8bit",  # Optimizer with 8-bit precision to save memory
        learning_rate=5e-5,  # Lower learning rate for stable training
        lr_scheduler_type="cosine",  # Cosine decay for better learning rate adjustment
        weight_decay=0.01,  # Add weight decay to prevent overfitting
        max_steps=1000,  # Increase training steps for convergence
        warmup_steps=100,  # Add a warmup phase to stabilize initial training
        fp16=not is_bfloat16_supported(),  # Use FP16 precision if bfloat16 is not supported
        bf16=is_bfloat16_supported(),  # Use bfloat16 precision if supported
        output_dir="outputs",  # Directory to save outputs such as checkpoints
        report_to="wandb",  # Report metrics to WandB for better tracking
        save_steps=50,  # Save checkpoints regularly to avoid loss of progress
        evaluation_strategy="steps",  # Evaluate at each step interval
        eval_steps=50,  # Evaluate model every 50 steps
        save_total_limit=3,  # Keep only the last 3 checkpoints to save disk space
    ),
)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 4,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 500
 "-____-"     Number of trainable parameters = 19,611,648


AttributeError: 'GemmaFixedRotaryEmbedding' object has no attribute 'current_rope_size'

In [None]:
orpo_trainer.train()

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# alpaca_prompt = Copied from above
# This is a placeholder for the prompt template, presumably copied from another section of the code.

# Enable native 2x faster inference using FastLanguageModel
# This allows the model to run inference faster by optimizing internal processes for speed.
FastLanguageModel.for_inference(model)

# Prepare input data by formatting the prompt with specific instructions and input/output placeholders
inputs = tokenizer(
    [
        # Format the prompt with the given instruction, input, and an empty output for generation
        alpaca_prompt.format(
            "Given the following medical question or situation, provide the most suitable reasoning or explanation",  # Instruction text to guide the model
            "A Fabry-Perot interferometer is used to resolve the mode structure of a He-Ne laser operating at 6328 Å with a frequency separation between the modes of 150 MHz. Determine the required plate spacing for cases where the reflectance of the mirrors is (a) 0.9 and (b) 0.999.",  # The input query/question
            "",  # Empty output to leave space for the generated response
        )
    ],
    # Return tokenized inputs in PyTorch tensor format
    return_tensors = "pt"
).to("cuda")  # Move inputs to GPU for faster computation

# Generate output based on the model's inference from the formatted input
outputs = model.generate(
    **inputs,              # Pass the tokenized inputs to the model
    max_new_tokens = 200,  # Limit the output to a maximum of 200 tokens
    use_cache = True       # Enable caching to speed up the generation process by reusing previous computations
)

# Decode the generated tokens back into text format for human-readable output
tokenizer.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
# This is a placeholder for the prompt template, which is presumably defined elsewhere in the code.

# Enable native 2x faster inference by optimizing the model for inference tasks
# This method configures the model to use efficient inference settings, improving the processing speed.
FastLanguageModel.for_inference(model)

# Prepare the input data for the model by formatting the prompt with a specific instruction and input
inputs = tokenizer(
    [
        # Format the prompt by inserting the instruction, input question, and leave the output blank for generation
        alpaca_prompt.format(
            "Given the following medical question or situation, provide the most suitable reasoning or explanation",  # Instruction provided to the model
            "A Fabry-Perot interferometer is used to resolve the mode structure of a He-Ne laser operating at 6328 Å with a frequency separation between the modes of 150 MHz. Determine the required plate spacing for cases where the reflectance of the mirrors is (a) 0.9 and (b) 0.999.",  # Input question or scenario
            "",  # Blank output, as the model will generate the response
        )
    ],
    return_tensors = "pt"  # Ensure the input is tokenized and returned as a PyTorch tensor
).to("cuda")  # Move the tensor to the GPU for faster processing

# Import the TextStreamer class from the transformers library
# The TextStreamer is used to stream the output during generation, allowing more efficient generation of long text outputs.
from transformers import TextStreamer

# Initialize the TextStreamer with the tokenizer to handle token-to-text conversion during generation
text_streamer = TextStreamer(tokenizer)

# Generate text from the model based on the provided inputs
# Using the TextStreamer, this will stream the generation process, allowing tokens to be decoded and displayed progressively
_ = model.generate(
    **inputs,                 # Pass the tokenized input data to the model
    streamer = text_streamer, # Use the TextStreamer to handle the output streaming
    max_new_tokens = 128      # Limit the generation to a maximum of 128 new tokens
)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

Now if you want to load the LoRA adapters we just saved for inference

In [None]:
# Import FastLanguageModel from the unsloth library, which provides methods for fast inference with language models
from unsloth import FastLanguageModel

# Load the pre-trained model and tokenizer using the specified configuration
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model",  # Specify the model name or path to the model you trained or want to use
    max_seq_length = max_seq_length,  # Set the maximum sequence length for tokenized inputs
    dtype = dtype,  # Define the data type for the model (e.g., float32, float16, etc.)
    load_in_4bit = load_in_4bit,  # If True, load the model with reduced precision (4-bit) for more efficient memory usage
)

# Enable native 2x faster inference by configuring the model for optimized inference operations
FastLanguageModel.for_inference(model)

# alpaca_prompt = You MUST copy from above!
# The alpaca_prompt should be defined elsewhere in the code or copied from previous sections.

# Prepare the input data by formatting the prompt with a specific instruction and input for the model
inputs = tokenizer(
    [
        # Format the prompt string by injecting the instruction, input scenario, and an empty output for the model to generate
        alpaca_prompt.format(
            "Given the following medical question or situation, provide the most suitable reasoning or explanation",  # Instruction text to guide the model
            "A Fabry-Perot interferometer is used to resolve the mode structure of a He-Ne laser operating at 6328 Å with a frequency separation between the modes of 150 MHz. Determine the required plate spacing for cases where the reflectance of the mirrors is (a) 0.9 and (b) 0.999.",  # Input question or scientific scenario
            "",  # Blank output, as the model will generate the response in place of this empty string
        )
    ],
    return_tensors = "pt"  # Ensure the tokenized input is returned as a PyTorch tensor
).to("cuda")  # Move the tokenized input to the GPU for faster processing during inference

# Generate output from the model based on the tokenized input
outputs = model.generate(
    **inputs,           # Pass the tokenized inputs to the model
    max_new_tokens = 64,  # Set the maximum number of new tokens to be generated in the output
    use_cache = True     # Use the model's caching mechanism to speed up subsequent generations by reusing computations
)

# Decode the generated output tokens back into human-readable text format
tokenizer.batch_decode(outputs)  # Decode the generated tokens into a string and return the result


### Push the trained model to the Hugging Face Model Hub using the GGUF format

In [None]:
# Push the trained model to the Hugging Face Model Hub using the GGUF format
model.push_to_hub_gguf(
    "SURESHBEEKHANI/Gemma-2B-Medical-O1-SFT-ORPO-RLHF-Fine-Tuning",  # Specify the model repository path on Hugging Face Hub. Replace "hf" with your Hugging Face username.
    tokenizer,  # Pass the tokenizer associated with the model to ensure compatibility on the hub
    quantization_method=["q4_k_m", "q8_0", "q5_k_m"],  # Specify the quantization methods to apply for optimized model storage (e.g., q4_k_m, q8_0, q5_k_m)
    token="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",  # Provide the Hugging Face token for authentication. Obtain a token at https://huggingface.co/settings/tokens
)