### Install and Upgrade Unsloth Library to Latest Nightly Version

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

### Load and Configure a Pretrained FastLanguageModel with Custom Settings


In [2]:
# Import the FastLanguageModel module from the unsloth library
from unsloth import FastLanguageModel
import torch

# Define the maximum sequence length for the model (can be customized)
max_seq_length = 4096  # Choose any! RoPE Scaling is auto-supported internally.

# Specify the data type for the model (None for auto-detection, or specify Float16/Bfloat16)
dtype = None  # Float16 for Tesla T4, V100; Bfloat16 for Ampere+; None for auto-detection.

# Enable or disable 4-bit quantization to reduce memory usage
load_in_4bit = True  # Use 4-bit quantization. Set to False if not required.

# Load the pretrained model and tokenizer with the specified parameters
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-2b-bnb-4bit",  # Model name to be loaded
    max_seq_length=max_seq_length,             # Maximum sequence length
    dtype=dtype,                               # Data type
    load_in_4bit=load_in_4bit                  # Enable 4-bit quantization
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.8: Fast Gemma patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.07G [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


generation_config.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/40.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

### We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.1.8 patched 18 layers with 18 QKV layers, 18 O layers and 18 MLP layers.


#### Formatting Medical Reasoning Dataset for ORPOTrainer


In [9]:
# Formatting Medical Reasoning Dataset for ORPOTrainer
# This script formats the dataset by creating prompts based on the given instruction, input, and responses.
# The format follows the Alpaca prompt style, including "instruction", "input", and "response" sections.

# Define the Alpaca-style prompt template for formatting the data
alpaca_prompt = """Below is a medical scenario with an input that describes a situation or a question related to healthcare. Write a response that appropriately completes the medical reasoning request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Define the End of Sequence (EOS) token
EOS_TOKEN = "<EOS>"  # Placeholder for EOS token, ensure this matches your tokenizer's EOS token

def format_prompt(sample):
    # Extract the instruction, input, accepted response, and rejected response from the sample
    instruction = sample["instruction"]  # Instruction on how to approach the medical problem
    input_data = sample["Input"]
    accepted = sample["accepted"]  # The accepted (valid) reasoning response
    rejected = sample["rejected"]  # The rejected (invalid) reasoning response

    # ORPOTrainer expects keys: prompt (formatted instruction and input), chosen (accepted response), and rejected (rejected response)
    # Create a formatted prompt using the Alpaca template, leaving the response section empty
    sample["Input"] = alpaca_prompt.format(instruction, input_data, "")

    # Add the accepted response, appending the EOS token at the end
    sample["chosen"] = accepted + EOS_TOKEN

    # Add the rejected response, appending the EOS token at the end
    sample["rejected"] = rejected + EOS_TOKEN

    return sample  # Return the formatted sample for further use

# Placeholder statement, does nothing, but ensures syntactical correctness
pass

# Example of loading and processing the dataset
from datasets import load_dataset

dataset_name = "SURESHBEEKHANI/medical-reasoning-orpo"
dataset = load_dataset(dataset_name, split="all")
dataset = dataset.shuffle(seed=42).select(range(4200))  # Limit to 1000 samples for a quick demo

# Apply the `format_prompt` function to each sample in the dataset to format them correctly
dataset = dataset.map(format_prompt)  # The `map` function applies `format_prompt` across all samples in the dataset


Let's print out some examples to see how the dataset should look like

In [10]:
dataset

Dataset({
    features: ['Input', 'accepted', 'rejected', 'instruction', 'chosen'],
    num_rows: 4200
})

In [17]:
import pprint
row = dataset[1]
print('INSTRUCTION: ' + '=' * 50)
pprint.pprint(row["Input"])
print('ACCEPTED: ' + '=' * 50)
pprint.pprint(row["chosen"])
print('REJECTED: ' + '=' * 50)
pprint.pprint(row["rejected"])

('Below is a medical scenario with an input that describes a situation or a '
 'question related to healthcare. Write a response that appropriately '
 'completes the medical reasoning request.\n'
 '\n'
 '### Instruction:\n'
 'Given the following medical question or situation, provide the most suitable '
 'reasoning or explanation.\n'
 '\n'
 '### Input:\n'
 'A 58-year-old woman presents with a swelling in her right vulva accompanied '
 'by pain during walking and coitus. On pelvic examination, a mildly tender, '
 'fluctuant mass is found just outside the introitus in the right vulva, near '
 "the Bartholin's gland. Given her age and symptoms, what is the definitive "
 'treatment for this condition?\n'
 '\n'
 '### Response:\n')
("Okay, so there's a 58-year-old woman who's having this painful swelling "
 "around her right vulva. She says it's painful when she walks or during sex. "
 "That's got to be really uncomfortable. Let's see what we know from the exam. "
 "There's this tender bump 

In [12]:
# Enable reward modelling stats
# Import the PatchDPOTrainer class from the unsloth module
from unsloth import PatchDPOTrainer

# Instantiate PatchDPOTrainer to enable reward modelling statistics
PatchDPOTrainer()

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `ORPOTrainer`! More docs here: [TRL ORPO docs](https://huggingface.co/docs/trl/main/en/orpo_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [26]:
from trl import ORPOConfig, ORPOTrainer  # Importing necessary classes from trl
from unsloth import is_bfloat16_supported  # Importing function to check for bfloat16 support

orpo_trainer = ORPOTrainer(  # Initializing ORPOTrainer instance
    model=model,  # Assigning the model object
    train_dataset=dataset,  # Assigning the training dataset
    tokenizer=tokenizer,  # Assigning the tokenizer object
    args=ORPOConfig(  # Initializing ORPOConfig with training arguments
        max_length=max_seq_length,  # Setting maximum sequence length
        max_prompt_length=max_seq_length // 2,  # Setting maximum prompt length
        max_completion_length=max_seq_length // 2,  # Setting maximum completion length
        per_device_train_batch_size=1,  # Batch size per device for training
        gradient_accumulation_steps=4,  # Number of gradient accumulation steps
        beta=0.1,  # Beta parameter for ORPO
        logging_steps=1,  # Logging frequency during training
        optim="adamw_8bit",  # Optimizer type (8-bit adamw)
        lr_scheduler_type="linear",  # Learning rate scheduler type
        max_steps=30,  # Maximum training steps (for quick demo, change for full training)
        fp16=not is_bfloat16_supported(),  # Whether to use FP16 (float16) if not bfloat16 supported
        bf16=is_bfloat16_supported(),  # Whether to use BF16 (bfloat16) if supported
        output_dir="outputs",  # Output directory for model checkpoints and logs
        report_to="none",  # Reporting destination (none for no reporting)
    ),
)


Map:   0%|          | 0/4200 [00:00<?, ? examples/s]

Map:   0%|          | 0/4200 [00:00<?, ? examples/s]

In [19]:
orpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 4,200 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 30
 "-____-"     Number of trainable parameters = 19,611,648


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,3.0769,-0.307604,-0.962221,1.0,0.654617,-9.622208,-3.076042,-136.423187,-24.874859
2,2.7934,-0.278727,-0.971243,1.0,0.692517,-9.712435,-2.787268,-128.502686,-24.926565
3,3.2208,-0.321204,-0.692108,1.0,0.370904,-6.921082,-3.212042,-130.866257,-25.89756
4,3.205,-0.32014,-1.153894,1.0,0.833754,-11.538943,-3.201405,-118.451355,-25.475842
5,2.9804,-0.297967,-0.904319,1.0,0.606353,-9.043194,-2.979667,-129.034119,-25.772474
6,2.8697,-0.274342,-0.422061,0.5,0.147719,-4.220611,-2.743423,-80.0728,-59.113464
7,3.1236,-0.312088,-1.215499,1.0,0.903411,-12.154988,-3.120877,-124.870918,-25.650391
8,2.6344,-0.25291,-0.286387,0.5,0.033477,-2.863869,-2.529102,-82.132278,-43.489361
9,3.4312,-0.336199,-1.046396,0.75,0.710196,-10.463959,-3.361993,-95.447136,-39.933029
10,2.9697,-0.296968,-1.159624,1.0,0.862656,-11.596243,-2.969681,-120.981216,-25.623039


TrainOutput(global_step=30, training_loss=3.017614761988322, metrics={'train_runtime': 161.6462, 'train_samples_per_second': 0.742, 'train_steps_per_second': 0.186, 'total_flos': 0.0, 'train_loss': 3.017614761988322, 'epoch': 0.02857142857142857})

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [20]:
# alpaca_prompt = Copied from above
# This is a placeholder for the prompt template, presumably copied from another section of the code.

# Enable native 2x faster inference using FastLanguageModel
# This allows the model to run inference faster by optimizing internal processes for speed.
FastLanguageModel.for_inference(model)

# Prepare input data by formatting the prompt with specific instructions and input/output placeholders
inputs = tokenizer(
    [
        # Format the prompt with the given instruction, input, and an empty output for generation
        alpaca_prompt.format(
            "Given the following medical question or situation, provide the most suitable reasoning or explanation",  # Instruction text to guide the model
            "A Fabry-Perot interferometer is used to resolve the mode structure of a He-Ne laser operating at 6328 Å with a frequency separation between the modes of 150 MHz. Determine the required plate spacing for cases where the reflectance of the mirrors is (a) 0.9 and (b) 0.999.",  # The input query/question
            "",  # Empty output to leave space for the generated response
        )
    ],
    # Return tokenized inputs in PyTorch tensor format
    return_tensors = "pt"
).to("cuda")  # Move inputs to GPU for faster computation

# Generate output based on the model's inference from the formatted input
outputs = model.generate(
    **inputs,              # Pass the tokenized inputs to the model
    max_new_tokens = 200,  # Limit the output to a maximum of 200 tokens
    use_cache = True       # Enable caching to speed up the generation process by reusing previous computations
)

# Decode the generated tokens back into text format for human-readable output
tokenizer.batch_decode(outputs)

['<bos>Below is a medical scenario with an input that describes a situation or a question related to healthcare. Write a response that appropriately completes the medical reasoning request.\n\n### Instruction:\nGiven the following medical question or situation, provide the most suitable reasoning or explanation\n\n### Input:\nA Fabry-Perot interferometer is used to resolve the mode structure of a He-Ne laser operating at 6328 Å with a frequency separation between the modes of 150 MHz. Determine the required plate spacing for cases where the reflectance of the mirrors is (a) 0.9 and (b) 0.999.\n\n### Response:\n**a) 0.9**\n\nThe required plate spacing for a Fabry-Perot interferometer is given by the formula:\n\n$$d = \\frac{\\lambda}{4n}$$\n\nwhere:\n\n* d is the plate spacing\n* λ is the wavelength of light\n* n is the refractive index of the medium between the mirrors\n\nFor (a) 0.9, n = 1.33, and:\n\n$$d = \\frac{6328 \\times 10^{-8}}{4(1.33)} = 0.019 \\text{ mm}$$\n\n**b) 0.999**\n\

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [22]:
# alpaca_prompt = Copied from above
# This is a placeholder for the prompt template, which is presumably defined elsewhere in the code.

# Enable native 2x faster inference by optimizing the model for inference tasks
# This method configures the model to use efficient inference settings, improving the processing speed.
FastLanguageModel.for_inference(model)

# Prepare the input data for the model by formatting the prompt with a specific instruction and input
inputs = tokenizer(
    [
        # Format the prompt by inserting the instruction, input question, and leave the output blank for generation
        alpaca_prompt.format(
            "Given the following medical question or situation, provide the most suitable reasoning or explanation",  # Instruction provided to the model
            "A Fabry-Perot interferometer is used to resolve the mode structure of a He-Ne laser operating at 6328 Å with a frequency separation between the modes of 150 MHz. Determine the required plate spacing for cases where the reflectance of the mirrors is (a) 0.9 and (b) 0.999.",  # Input question or scenario
            "",  # Blank output, as the model will generate the response
        )
    ],
    return_tensors = "pt"  # Ensure the input is tokenized and returned as a PyTorch tensor
).to("cuda")  # Move the tensor to the GPU for faster processing

# Import the TextStreamer class from the transformers library
# The TextStreamer is used to stream the output during generation, allowing more efficient generation of long text outputs.
from transformers import TextStreamer

# Initialize the TextStreamer with the tokenizer to handle token-to-text conversion during generation
text_streamer = TextStreamer(tokenizer)

# Generate text from the model based on the provided inputs
# Using the TextStreamer, this will stream the generation process, allowing tokens to be decoded and displayed progressively
_ = model.generate(
    **inputs,                 # Pass the tokenized input data to the model
    streamer = text_streamer, # Use the TextStreamer to handle the output streaming
    max_new_tokens = 1000      # Limit the generation to a maximum of 128 new tokens
)

<bos>Below is a medical scenario with an input that describes a situation or a question related to healthcare. Write a response that appropriately completes the medical reasoning request.

### Instruction:
Given the following medical question or situation, provide the most suitable reasoning or explanation

### Input:
A Fabry-Perot interferometer is used to resolve the mode structure of a He-Ne laser operating at 6328 Å with a frequency separation between the modes of 150 MHz. Determine the required plate spacing for cases where the reflectance of the mirrors is (a) 0.9 and (b) 0.999.

### Response:
**a) 0.9**

The required plate spacing for a Fabry-Perot interferometer is given by the formula:

$$d = \frac{\lambda}{4n}$$

where:

* d is the plate spacing
* λ is the wavelength of light
* n is the refractive index of the medium between the mirrors

For (a) 0.9, n = 1.33, and:

$$d = \frac{6328 \times 10^{-8}}{4(1.33)} = 0.019 \text{ mm}$$

**b) 0.999**

The required plate spacing for a 

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [23]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.model',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference

In [24]:
# Import FastLanguageModel from the unsloth library, which provides methods for fast inference with language models
from unsloth import FastLanguageModel

# Load the pre-trained model and tokenizer using the specified configuration
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model",  # Specify the model name or path to the model you trained or want to use
    max_seq_length = max_seq_length,  # Set the maximum sequence length for tokenized inputs
    dtype = dtype,  # Define the data type for the model (e.g., float32, float16, etc.)
    load_in_4bit = load_in_4bit,  # If True, load the model with reduced precision (4-bit) for more efficient memory usage
)

# Enable native 2x faster inference by configuring the model for optimized inference operations
FastLanguageModel.for_inference(model)

# alpaca_prompt = You MUST copy from above!
# The alpaca_prompt should be defined elsewhere in the code or copied from previous sections.

# Prepare the input data by formatting the prompt with a specific instruction and input for the model
inputs = tokenizer(
    [
        # Format the prompt string by injecting the instruction, input scenario, and an empty output for the model to generate
        alpaca_prompt.format(
            "Given the following medical question or situation, provide the most suitable reasoning or explanation",  # Instruction text to guide the model
            "A Fabry-Perot interferometer is used to resolve the mode structure of a He-Ne laser operating at 6328 Å with a frequency separation between the modes of 150 MHz. Determine the required plate spacing for cases where the reflectance of the mirrors is (a) 0.9 and (b) 0.999.",  # Input question or scientific scenario
            "",  # Blank output, as the model will generate the response in place of this empty string
        )
    ],
    return_tensors = "pt"  # Ensure the tokenized input is returned as a PyTorch tensor
).to("cuda")  # Move the tokenized input to the GPU for faster processing during inference

# Generate output from the model based on the tokenized input
outputs = model.generate(
    **inputs,           # Pass the tokenized inputs to the model
    max_new_tokens = 64,  # Set the maximum number of new tokens to be generated in the output
    use_cache = True     # Use the model's caching mechanism to speed up subsequent generations by reusing computations
)

# Decode the generated output tokens back into human-readable text format
tokenizer.batch_decode(outputs)  # Decode the generated tokens into a string and return the result


==((====))==  Unsloth 2025.1.8: Fast Gemma patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


['<bos>Below is a medical scenario with an input that describes a situation or a question related to healthcare. Write a response that appropriately completes the medical reasoning request.\n\n### Instruction:\nGiven the following medical question or situation, provide the most suitable reasoning or explanation\n\n### Input:\nA Fabry-Perot interferometer is used to resolve the mode structure of a He-Ne laser operating at 6328 Å with a frequency separation between the modes of 150 MHz. Determine the required plate spacing for cases where the reflectance of the mirrors is (a) 0.9 and (b) 0.999.\n\n### Response:\n**a) 0.9**\n\nThe required plate spacing for a Fabry-Perot interferometer is given by the formula:\n\n$$d = \\frac{\\lambda}{4n}$$\n\nwhere:\n\n* d is the plate spacing\n* λ is the wavelength of light\n* n is the refractive']

### Push the trained model to the Hugging Face Model Hub using the GGUF format

In [None]:
# Push the trained model to the Hugging Face Model Hub using the GGUF format
model.push_to_hub_gguf(
    "SURESHBEEKHANI/Gemma_2B_Medical_ORPO_RLHF_Fine_Tuning",  # Specify the model repository path on Hugging Face Hub. Replace "hf" with your Hugging Face username.
    tokenizer,  # Pass the tokenizer associated with the model to ensure compatibility on the hub
    quantization_method=["q4_k_m", "q8_0", "q5_k_m"],  # Specify the quantization methods to apply for optimized model storage (e.g., q4_k_m, q8_0, q5_k_m)
    token="",  # Provide the Hugging Face token for authentication. Obtain a token at https://huggingface.co/settings/tokens
)

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.
Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 4.84 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 18/18 [00:01<00:00, 10.35it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving SURESHBEEKHANI/Gemma_2B_Medical_ORPO_RLHF_Fine_Tuning/pytorch_model-00001-of-00002.bin...
Unsloth: Saving SURESHBEEKHANI/Gemma_2B_Medical_ORPO_RLHF_Fine_Tuning/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting gemma model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m', 'q8_0', 'q5_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at SURESHBEEKHANI/Gemma_2B_Medical_ORPO_RLHF_Fine_Tuning into f16 GGUF format.
The output location will be /content/SURESHBEEKHANI/Gemma_2B_Medical_ORPO_RLHF_Fine_Tuning/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: Gemma_2B_Medical_ORPO_RLHF_Fine_Tuning
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/SURESHBEEKHANI/Gemma_2B_Medical_ORPO_RLHF_Fine_Tuning
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/SURESHBEEKHANI/Gemma_2B_Medical_ORPO_RLHF_Fine_Tuning
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q5_K_M.gguf:   0%|          | 0.00/1.84G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.
Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Saved GGUF to https://huggingface.co/SURESHBEEKHANI/Gemma_2B_Medical_ORPO_RLHF_Fine_Tuning
