To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [blog post](https://unsloth.ai/blog/r1-reasoning) for guidance on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [28]:
%%capture
# Skip restarting message in Colab
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

!pip install unsloth vllm
!pip install --upgrade pillow
# # If you are running this notebook on local, you need to install `diffusers` too
# # !pip install diffusers
# # Temporarily install a specific TRL nightly version
# !pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b
# !pip install wandb -qU
# !pip install --upgrade wandb
!pip install --upgrade sentence-transformers transformers torch wandb

In [16]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Unsloth

Use `PatchFastRL` before all functions to patch GRPO and other RL algorithms!

In [17]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

In [20]:
import json
from collections import defaultdict

with open('/content/drive/MyDrive/myy/heBERT/private_chats.json') as f:
    data = json.load(f)

responses = defaultdict(int)
for conv in data:
    for msg in conv.values():
        for m in msg:
            if m['role'] == 'Output (response)':
                text = m['content'].strip()
                responses[text] += 1

# Save to responses.json
with open('/content/drive/MyDrive/myy/heBERT/responses.json', 'w') as f:
    json.dump([{"id": f"resp_{i}", "text": k}
              for i, k in enumerate(responses.keys())], f, ensure_ascii=False)

In [21]:
import json
from collections import defaultdict

with open('/content/drive/MyDrive/myy/heBERT/private_chats.json') as f:
    data = json.load(f)

responses = defaultdict(int)
for conv in data:
    for msg in conv.values():
        for m in msg:
            if m['role'] == 'Output (response)':
                text = m['content'].strip()
                responses[text] += 1

# Save to responses.json
with open('/content/drive/MyDrive/myy/heBERT/responses.json', 'w') as f:
    json.dump([{"id": f"resp_{i}", "text": k}
              for i, k in enumerate(responses.keys())], f, ensure_ascii=False)

Load up `Llama 3.1 8B Instruct`, and set parameters

In [22]:
import time # Extract input-response pairs from private_chats.json
train_data = []
for conv in data:
    keys_items = conv.keys()
    list_of_keys = list(keys_items)
    if len(list_of_keys) < 1:
      continue
    conv_now = conv.get(list_of_keys[0])
    for chat in conv_now:
      role_tmp = chat.get('role')
      content_tmp = chat.get('content')
      context = []
      if role_tmp == 'Input (context)':
          context.append(content_tmp)
      elif role_tmp == 'Output (response)':

          if content_tmp:
              print(role_tmp + "\n" + content_tmp)

              train_data.append({
                  "input": "\\n".join(context),
                  "response": content_tmp.strip()
              })
              context = []  # Reset context after each response

# Save
with open('/content/drive/MyDrive/myy/heBERT/train_dataset.json', 'w') as f:
    json.dump(train_data, f, ensure_ascii=False)

Output (response)
אם מעניין אותך מאמוש בוא לטלגרם <@MichelleTheQueen> אני עושה שיחות וידאו וסקסטינג בתשלום ויש גם מלא תוכן שלי.  שניה בייבי.  
Output (response)
לא בייבי אני לא נפגשת אבל רוצה לעשות איתי שיחת וידאו.  
Output (response)
היי מאמוש מה שלומך.  
Output (response)
אם מעניין אותך מאמוש בוא לטלגרם <@MichelleTheQueen> אני עושה שיחות וידאו וסקסטינג בתשלום ויש גם מלא תוכן שלי.  
Output (response)
בא לך לעשות איתי שיחת וידאו סקסית.  עדיין לא גמרתי אז היום שלי לא הסתיים.  תלוי למי .  
Output (response)
היי בייבי.  
Output (response)
יש שתי אפשרויות אימות .  1️⃣ אימות קולי, אני עושה אימות קולי בהקלטה, ומשם מתקדמים..  2️⃣ אימות וידיאו - בתשלום של מקדמה 30 שקלים, לוודא שאתה לא קטין שמחפש אימותים, אם אחרי זה עושים שיחה, אתה משלם 30 שקלים פחות..  האימות וידיאו הוא לא מיני, לא בעירות ולא קשור לשום דבר פרט לאימות 🔞 .  אשלח לך אפשרויות אימות.  
Output (response)
אם מעניין אותך מאמוש בוא לטלגרם <@MichelleTheQueen> אני עושה שיחות וידאו וסקסטינג בתשלום ויש גם מלא תוכן שלי.  
Output (response)


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [23]:
import time # Extract input-response pairs from private_chats.json
train_data = []
for conv in data:
    keys_items = conv.keys()
    list_of_keys = list(keys_items)
    if len(list_of_keys) < 1:
      continue
    conv_now = conv.get(list_of_keys[0])
    for chat in conv_now:
      role_tmp = chat.get('role')
      content_tmp = chat.get('content')
      context = []
      if role_tmp == 'Input (context)':
          context.append(content_tmp)
      elif role_tmp == 'Output (response)':

          if content_tmp:
              print(role_tmp + "\n" + content_tmp)

              train_data.append({
                  "input": "\\n".join(context),
                  "response": content_tmp.strip()
              })
              context = []  # Reset context after each response

# Save
with open('/content/drive/MyDrive/myy/heBERT/train_dataset.json', 'w') as f:
    json.dump(train_data, f, ensure_ascii=False)

Output (response)
אם מעניין אותך מאמוש בוא לטלגרם <@MichelleTheQueen> אני עושה שיחות וידאו וסקסטינג בתשלום ויש גם מלא תוכן שלי.  שניה בייבי.  
Output (response)
לא בייבי אני לא נפגשת אבל רוצה לעשות איתי שיחת וידאו.  
Output (response)
היי מאמוש מה שלומך.  
Output (response)
אם מעניין אותך מאמוש בוא לטלגרם <@MichelleTheQueen> אני עושה שיחות וידאו וסקסטינג בתשלום ויש גם מלא תוכן שלי.  
Output (response)
בא לך לעשות איתי שיחת וידאו סקסית.  עדיין לא גמרתי אז היום שלי לא הסתיים.  תלוי למי .  
Output (response)
היי בייבי.  
Output (response)
יש שתי אפשרויות אימות .  1️⃣ אימות קולי, אני עושה אימות קולי בהקלטה, ומשם מתקדמים..  2️⃣ אימות וידיאו - בתשלום של מקדמה 30 שקלים, לוודא שאתה לא קטין שמחפש אימותים, אם אחרי זה עושים שיחה, אתה משלם 30 שקלים פחות..  האימות וידיאו הוא לא מיני, לא בעירות ולא קשור לשום דבר פרט לאימות 🔞 .  אשלח לך אפשרויות אימות.  
Output (response)
אם מעניין אותך מאמוש בוא לטלגרם <@MichelleTheQueen> אני עושה שיחות וידאו וסקסטינג בתשלום ויש גם מלא תוכן שלי.  
Output (response)


In [24]:
import wandb
wandb.login()

True

In [47]:
import numpy as np
from sentence_transformers import SentenceTransformer
import json
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer
from transformers import BertModel, BertTokenizer

model = SentenceTransformer('avichr/heBERT')
response_embeddings = np.load('models/response_embeddings.npy')
response_embeddings = np.load(modules=[bert_model])
import numpy as np
# model = SentenceTransformer('/content/drive/MyDrive/myy/heBERT/encoder/')

# Load responses
with open('/content/drive/MyDrive/myy/heBERT/responses.json') as f:
    responses = json.load(f)

# Ensure responses is a list of dicts
if isinstance(responses, dict):
    responses = list(responses.values())

# Extract response texts
response_texts = [r['text'] for r in responses if isinstance(r, dict) and 'text' in r]

# Compute and save embeddings
response_embeddings = model.encode(response_texts)
np.save('/content/drive/MyDrive/myy/heBERT/response_embeddings.npy', response_embeddings)

print("Embeddings recomputed and saved successfully!")
responses = list(responses)
# response_texts = [r['text'] for sublist in responses.values() for r in sublist]
# response_texts = [r['text'] for sublist in responses.values() if isinstance(sublist, list) for r in sublist if isinstance(r, dict) and 'text' in r]
# response_texts = [r['text'] for r in responses.values() if isinstance(r, dict) and 'text' in r]
response_texts = [r['text'] for sublist in responses if isinstance(sublist, list)
                  for r in sublist if isinstance(r, dict) and 'text' in r]

response_embeddings = model.encode(response_texts)
np.save('/content/drive/MyDrive/myy/heBERT/response_embeddings.npy', response_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at avichr/heBERT and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


FileNotFoundError: [Errno 2] No such file or directory: 'models/response_embeddings.npy'

In [46]:
import numpy as np
from sentence_transformers import SentenceTransformer

class ResponseSelector:
    def __init__(self):
        self.model = SentenceTransformer('avichr/heBERT')
        self.response_embeddings = np.load('models/response_embeddings.npy')
        self.response_embeddings = np.load(modules=[bert_model])
        with open('/content/drive/MyDrive/myy/heBERT/responses.json') as f:
            self.responses = json.load(f)

    def get_response(self, user_input: str, top_k: int = 5) -> str:
        # Encode input
        input_embedding = self.model.encode(user_input)

        # Find closest responses (cosine similarity)
        scores = np.dot(input_embedding, self.response_embeddings.T)
        top_idx = np.argmax(scores)

        return self.responses[top_idx]['text']

selector = ResponseSelector()
print(selector.get_response("נפגשת?"))

Some weights of BertModel were not initialized from the model checkpoint at avichr/heBERT and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TypeError: load() got an unexpected keyword argument 'modules'

Some weights of BertModel were not initialized from the model checkpoint at avichr/heBERT and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


NameError: name 'dataloader' is not defined

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [41]:
import os
os.environ["WANDB_DISABLED"] = "true"  # Disable wandb

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer

# Load avichr/heBERT from Hugging Face
bert_model = SentenceTransformer("avichr/heBERT")

# Wrap it in a Sentence Transformers model
model = SentenceTransformer(modules=[bert_model])

# Create training examples
train_examples = []
for pair in train_data:
    target_text = pair['response']
    matching_responses = [r['id'] for r in responses if r['text'] == target_text]
    if matching_responses:
        resp_id = matching_responses[0]
        train_examples.append(InputExample(
            texts=[pair['input'], target_text],
            label=resp_id
        ))

# Train with contrastive loss

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

# Define the loss function
train_loss = losses.MultipleNegativesRankingLoss(model)

# Train the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    show_progress_bar=True
)
# Save the fine-tuned model
model.save('/content/drive/MyDrive/myy/heBERT/model/encoder/')



Some weights of BertModel were not initialized from the model checkpoint at avichr/heBERT and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

ValueError: too many dimensions 'str'

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [43]:
import os
os.environ["WANDB_DISABLED"] = "true"  # Disable wandb

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load Hebrew model
try:
    model = SentenceTransformer('avichr/heBERT')
except:
    model = SentenceTransformer('onlplab/alephbert-base')  # Fallback

# Create training examples
train_examples = []
for pair in train_data:
    target_text = pair['response']
    matching_responses = [r['id'] for r in responses if r['text'] == target_text]
    if matching_responses:
        resp_id = matching_responses[0]
        train_examples.append(InputExample(
            texts=[pair['input'], target_text],
            label=resp_id
        ))

# Train with contrastive loss
train_loader = DataLoader(train_examples, shuffle=True, batch_size=32)
loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_loader, loss)],
    epochs=3,
    show_progress_bar=True
)

model.save('/content/drive/MyDrive/myy/heBERT/model/encoder/')

Some weights of BertModel were not initialized from the model checkpoint at avichr/heBERT and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

ValueError: too many dimensions 'str'

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:23<00:00, 23.78s/it, est. speed input: 1.64 toks/s, output: 19.94 toks/s]


'Calculating pi to a large number of decimal places is a complex task that requires a computational approach, rather than a simple mathematical formula. Here\'s a way to calculate pi using the Monte Carlo method, which is an approximation method that uses random numbers to estimate the value of pi:\n\n**The Monte Carlo Method**\n\nThe Monte Carlo method is based on the idea of simulating the probability of a random walk across a square and circle. Here\'s the basic idea:\n\n1. Draw a square and a circle on a piece of paper.\n2. Generate random points within the square.\n3. Count the proportion of points that fall within the circle.\n4. The ratio of points within the circle to the total number of points is approximately equal to the ratio of the area of the circle to the area of the square, which is pi.\n\n**Mathematical Formulation**\n\nLet\'s denote the following variables:\n\n*   `N`: the number of random points generated\n*   `n`: the number of points within the circle\n*   `pi_appr

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:23<00:00, 23.29s/it, est. speed input: 2.62 toks/s, output: 19.41 toks/s]


"<reasoning>\nPi (π) is an irrational number that represents the ratio of a circle's circumference to its diameter. It is approximately equal to 3.14159, but its decimal representation goes on indefinitely without repeating.\n\nTo calculate pi, we can use various mathematical formulas and methods, such as the Leibniz formula, the Gregory-Leibniz series, or the Monte Carlo method. However, these methods are not practical for obtaining a high degree of accuracy.\n\nA more practical approach is to use the Bailey-Borwein-Plouffe (BBP) formula, which is a spigot algorithm that allows us to calculate any digit of pi without having to compute the preceding digits.\n\nAnother method is to use the Chudnovsky algorithm, which is a fast and efficient method for calculating pi to a high degree of accuracy.\n\nFor simplicity, we can use the first few terms of the BBP formula to estimate pi:\nπ = 3 + 1/(4/3 - 1/(4/3 - 1/(4/3 - ...))\n\nLet's use this simplified formula to estimate pi:\n\nπ ≈ 3 + 1/(

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
