# Bonus Activity - Unsloth GRPO Inference

### 1. Insall Unsloth for Collab

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm==0.7.3
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm==0.7.3

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

In [3]:
from unsloth import FastLanguageModel
from trl import GRPOConfig, GRPOTrainer
from vllm import SamplingParams
from google.colab import drive
import re
import json
import torch
import random
from google.colab import files
import numpy as np
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
from datasets import load_dataset, Dataset,load_from_disk
import gc

# Mount Drive
drive.mount('/content/drive')

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-06 13:21:45 __init__.py:207] Automatically detected platform cuda.
Mounted at /content/drive


In [4]:
# Define system prompt used for training
SYSTEM_PROMPT = """
Solve the following mathematical problem step by step. Your response must follow this format:

<reasoning>
Provide a clear, step-by-step solution to the problem. Include all mathematical steps and logic.
</reasoning>

<answer>
Write your final answer here concisely.
</answer>
"""

In [5]:
# Load the base model (no training)
print("Loading base model...")
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.2-3B-Instruct",
    max_seq_length = 2048,
    load_in_4bit = True,
    fast_inference = True,
)
print("Base model loaded successfully!")

Loading base model...
==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.7.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/llama-3.2-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 49.49%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 22.16 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 224.
Unsloth: vLLM's KV Cache can use up to 8.41 GB. Also swap space = 6 GB.
INFO 05-06 13:25:19 config.py:549] This model supports multiple tasks: {'embed', 'generate', 'classify', 'score', 'reward'}. Defaulting to 'generate'.
Unsloth: vLLM Bit

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

INFO 05-06 13:25:26 cuda.py:229] Using Flash Attention backend.
INFO 05-06 13:25:26 model_runner.py:1110] Starting to load model unsloth/llama-3.2-3b-instruct-unsloth-bnb-4bit...
INFO 05-06 13:25:27 loader.py:1089] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 05-06 13:25:29 weight_utils.py:254] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

INFO 05-06 13:26:07 weight_utils.py:270] Time spent downloading weights for unsloth/llama-3.2-3b-instruct-unsloth-bnb-4bit: 38.191973 seconds


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 05-06 13:26:09 model_runner.py:1115] Loading model weights took 2.2405 GB
INFO 05-06 13:26:09 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 05-06 13:26:18 worker.py:267] Memory profiling takes 8.84 seconds
INFO 05-06 13:26:18 worker.py:267] the current vLLM instance can use total_gpu_memory (22.16GiB) x gpu_memory_utilization (0.49) = 10.97GiB
INFO 05-06 13:26:18 worker.py:267] model weights take 2.24GiB; non_torch_memory takes 0.04GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 7.64GiB.
INFO 05-06 13:26:19 executor_base.py:111] # cuda blocks: 4469, # CPU blocks: 3510
INFO 05-06 13:26:19 executor_base.py:116] Maximum concurrency for 2048 tokens per request: 34.91x
INFO 05-06 13:26:22 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error oc

Capturing CUDA graph shapes: 100%|██████████| 31/31 [00:42<00:00,  1.37s/it]

INFO 05-06 13:27:05 model_runner.py:1562] Graph capturing finished in 43 secs, took 0.57 GiB
INFO 05-06 13:27:05 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 55.71 seconds





tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Base model loaded successfully!


In [6]:
# Load the GRPO-trained LoRA model separately
print("Loading GRPO-trained model...")
grpo_model, _ = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.2-3B-Instruct",
    max_seq_length = 2048,
    load_in_4bit = True,
    fast_inference = True,
    lora_path = "/content/drive/MyDrive/math_reasoning_grpo_lora_improved"
)
print("GRPO-trained model loaded successfully!")

Loading GRPO-trained model...
==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.7.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/llama-3.2-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 25.18%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 22.16 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 3.02 GB. Also swap space = 5 GB.
INFO 05-06 13:27:25 config.py:549] This model supports multiple tasks: {'embed', 'generate', 'classify', 'score', 'reward'}. Defaulting to 'generate'.
Unsloth: 

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 05-06 13:27:30 model_runner.py:1115] Loading model weights took 2.2093 GB
INFO 05-06 13:27:32 worker.py:267] Memory profiling takes 0.98 seconds
INFO 05-06 13:27:32 worker.py:267] the current vLLM instance can use total_gpu_memory (22.16GiB) x gpu_memory_utilization (0.25) = 5.58GiB
INFO 05-06 13:27:32 worker.py:267] model weights take 2.21GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.74GiB; the rest of the memory reserved for KV Cache is 2.63GiB.
INFO 05-06 13:27:32 executor_base.py:111] # cuda blocks: 1539, # CPU blocks: 2925
INFO 05-06 13:27:32 executor_base.py:116] Maximum concurrency for 2048 tokens per request: 12.02x
INFO 05-06 13:27:36 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utili

Capturing CUDA graph shapes: 100%|██████████| 23/23 [00:32<00:00,  1.42s/it]

INFO 05-06 13:28:08 model_runner.py:1562] Graph capturing finished in 33 secs, took 0.45 GiB
INFO 05-06 13:28:08 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 37.96 seconds





GRPO-trained model loaded successfully!


In [8]:
# Function to generate response with base model
def generate_response(prompt, lora=False):
    """Generate a response using either the base model or GRPO-trained model."""

    # Prepare input for inference
    llm_prompt = [
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : prompt}
    ]

    text = base_tokenizer.apply_chat_template(llm_prompt, tokenize=False, add_generation_prompt=True)

    # Set generation parameters
    sampling_params = SamplingParams(
        temperature=0.8,
        top_p=0.95,
        max_tokens=1024,
    )

    # Choose which model to use based on the lora parameter
    model = grpo_model if lora else base_model

    # Generate response
    output = model.fast_generate(
        [text],
        sampling_params=sampling_params,
    )[0].outputs[0].text

    return output

In [13]:
# Load test dataset
test_dataset = load_from_disk("/content/drive/MyDrive/test_math_dataset_improved")

# Select 3 random examples from the test dataset
random.seed(123)  # For reproducibility
sample_indices = random.sample(range(len(test_dataset)), 4)
sample_examples = [test_dataset[i] for i in sample_indices]

print(f"Selected example indices: {sample_indices}")

Selected example indices: [3, 17, 5, 49]


In [14]:
# Evaluate each example
for i, example in enumerate(sample_examples):
    print(f"\n{'='*40} Example #{i+1} {'='*40}")
    print(f"\nProblem:\n{example['prompt'][1]['content']}")
    print("\n" + "-" * 80 + "\n")

    print(f"Expected answer:\n{example['expected_answer']}")
    print("\n" + "-" * 80 + "\n")

    print("Base model (no training):")
    base_response = generate_response(prompt = example['prompt'] ,lora= False)
    print(f"{base_response}")
    print("\n" + "-" * 80 + "\n")

    print("GRPO-trained model:")
    grpo_response = generate_response(example['prompt'],lora= True)
    print(f"{grpo_response}")
    print("\n" + "=" * 80 + "\n\n")




Problem:
Given circle A: $x^2+y^2=1$, and circle B: $(x-3)^2+(y+4)^2=10$. Let P be a moving point in the plane, and tangents to circle A and circle B are drawn from P, touching the circles at points D and E, respectively. If $PE=PD$, then the minimum distance from P to the origin is \_\_\_\_\_.

--------------------------------------------------------------------------------

Expected answer:
\frac{8}{5}

--------------------------------------------------------------------------------

Base model (no training):


Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.25s/it, est. speed input: 53.75 toks/s, output: 65.56 toks/s]


<reasoning>

To solve this problem, we can start by defining the coordinates of points D and E. We know that the tangent to a circle is perpendicular to the radius drawn to the point of tangency. Let's define the coordinates of point D as (x, y) and the coordinates of point E as (x, y + 2sqrt(9 - 10)).

Since PE = PD, we can write the equation of line DE as y = -sqrt(9 - 10)(x - 0) + (0 + sqrt(9 - 10)) = -sqrt(1)(x) + sqrt(1) = -x + 1.

Next, we can find the equation of the tangent to circle A (x^2 + y^2 = 1) as y = -x + 1 and the equation of the tangent to circle B ((x - 3)^2 + (y + 4)^2 = 10) as y = -(x - 3) + 7.

We can equate these two equations to find the x-coordinate of point P: -x + 1 = -(x - 3) + 7. Solving for x, we get x = 1. Substituting this value back into either of the tangent equations, we get y = 0.

Next, we can use the distance formula to find the distance from P to the origin: d = sqrt((1 - 0)^2 + (0 - 0)^2) = sqrt(1) = 1.

</reasoning>

<answer>

The minimum distan

Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.13s/it, est. speed input: 23.25 toks/s, output: 65.72 toks/s]


<reasoning>
To solve this problem, we first need to understand the geometry involved. We have two circles, A and B, with equations $x^2 + y^2 = 1$ and $(x - 3)^2 + (y + 4)^2 = 10$, respectively. We're looking for the minimum distance from a point P to the origin, given that the tangents from P to the circles A and B have equal length.

Since the circles are given in standard form, we can find their centers and radii. Circle A has center (0, 0) and radius 1, while circle B has center (3, -4) and radius $\sqrt{10}$.

Given that the tangents from P to the circles are equal in length, we can use the fact that the lengths of tangents from an external point to a circle are equal. Therefore, $PD = PE = d$, where $d$ is the distance from P to the origin.

To minimize the distance from P to the origin, we can use the triangle inequality: $d \geq |PE| - |PD|$.

Since $PE = PD$, we have $d \geq 0$.

Now, let's consider the geometry. We can draw a line from the center of circle B to the origin. Th

Processed prompts: 100%|██████████| 1/1 [00:04<00:00,  4.37s/it, est. speed input: 55.61 toks/s, output: 65.22 toks/s]


<reasoning>
To determine if two vectors are parallel, we need to check if one vector is a scalar multiple of the other. In this case, we need to find the value of $y$ such that $\overrightarrow{a}$ is parallel to $\overrightarrow{b}$. We can write this as $\overrightarrow{a} = k\overrightarrow{b}$, where $k$ is a scalar.

We can write the equation $\overrightarrow{a} = k\overrightarrow{b}$ as:

$(-1, 2) = k(1, -2y)$

We can equate the corresponding components to get:

$-1 = k$ and $2 = -2ky$

Now we can solve for $y$:

$2 = -2ky$

$2 = -2k(y)$

Dividing both sides by -2:

$-1 = ky$

Dividing both sides by $k$ (we can ignore $k$ since we already know it's equal to -1):

$-1/y = -1$

Now we can solve for $y$:

$y = 1$

Therefore, $y$ must be equal to 1 in order for $\overrightarrow{a}$ to be parallel to $\overrightarrow{b}$.
</reasoning>

<answer>
$y = 1$
</answer>

--------------------------------------------------------------------------------

GRPO-trained model:


Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.79s/it, est. speed input: 64.21 toks/s, output: 65.27 toks/s]


<reasoning>
To find the value of $y$ if vectors $\overrightarrow{a}$ and $\overrightarrow{b}$ are parallel, we need to use the property that two vectors are parallel if they are scalar multiples of each other. 

We know that $\overrightarrow{a}=(-1,2)$ and $\overrightarrow{b}=(1,-2y)$. If these vectors are parallel, there exists some scalar $k$ such that $\overrightarrow{b} = k\overrightarrow{a}$. We can set up the following system of equations using the components of the vectors:
\begin{align*}
1 &= k(-1)\\
-2y &= k(2)
\end{align*}
We can solve the first equation to find that $k = -1$. Then, we can substitute this value of $k$ into the second equation:
\begin{align*}
-2y &= -1(2)\\
-2y &= -2\\
y &= 1
\end{align*}
Therefore, the value of $y$ is 1.

\end{reasoning}

<answer>
$y=1$





Problem:
The area of the closed region bounded by $xy=1$, $y=x$, and $x=3$ is (　　)
A: $2\ln3$
B: $2+\ln3$
C: $4-2\ln3$
D: $4-\ln3$

------------------------------------------------------------------------

Processed prompts: 100%|██████████| 1/1 [00:06<00:00,  6.87s/it, est. speed input: 37.72 toks/s, output: 65.25 toks/s]


<reasoning>
To solve this problem, we will use the concept of integration to find the area of the closed region bounded by the curves $xy=1$, $y=x$, and $x=3$.
First, we need to find the intersection points of these curves. We can rewrite the equation $xy=1$ as $y=\frac{1}{x}$. The other two curves are given as $y=x$ and $x=3$.
We can see that the curve $y=\frac{1}{x}$ intersects with $y=x$ at $x=1$ since $1 \cdot 1 = 1$. This means that the region bounded by these curves is the area between $x=1$ and $x=3$, between the curves $y=\frac{1}{x}$ and $y=x$.
Next, we can set up an integral to represent the area of the region.
$A = \int_{1}^{3} (x-\frac{1}{x}) dx$
Now, we will integrate the function with respect to $x$.
$A = \int_{1}^{3} (x-\frac{1}{x}) dx = [\frac{x^{2}}{2} - \ln(x)] \bigg|_{1}^{3}$
Using the fundamental theorem of calculus, we can evaluate the definite integral as follows:
$A = [\frac{x^{2}}{2} - \ln(x)] \bigg|_{1}^{3} = (\frac{3^{2}}{2} - \ln(3)) - (\frac{1^{2}}{2} - \ln(

Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.72s/it, est. speed input: 33.57 toks/s, output: 65.58 toks/s]


<reasoning>
To solve this problem, we first need to visualize the region bounded by the curves $xy=1$, $y=x$, and $x=3$. 

We can rewrite the equation $xy=1$ as $y = \frac{1}{x}$. This represents a hyperbola with asymptotes at $x=0$ and $y=0$. 

We can also graph the line $x=3$, which is a vertical line passing through the point (3, 0).

The region of interest is the closed area between the hyperbola $y = \frac{1}{x}$, the line $y=x$, and the line $x=3$.

To find the area of this region, we can use integration.

Let's start by finding the intersection points of the hyperbola and the line $y=x$. Setting $y = \frac{1}{x}$ equal to $y=x$, we can solve for $x$:

$\frac{1}{x} = x$

$x^2 = 1$

$x = 1$ (since $x$ cannot be negative in this case)

Now, we need to integrate the area between the curve $y = \frac{1}{x}$ and the line $y=x$ from $x=1$ to $x=3$.

The area of this region is given by:

$A = \int_1^3 \left( x - \frac{1}{x} \right) dx$

$A = \int_1^3 \left( x - \frac{1}{x} \right) dx$



Processed prompts: 100%|██████████| 1/1 [00:06<00:00,  6.80s/it, est. speed input: 44.10 toks/s, output: 65.42 toks/s]


<reasoning>
To solve this problem, let's denote the total number of peaches as 'x'. 

The eldest monkey divides the peaches in a 5:4 ratio, taking the larger portion for himself. Therefore, the eldest monkey takes 5/9 of the total peaches.

Let the number of peaches the third monkey gets be 'y'. After the first division, the remaining peaches for the first two monkeys will be (4/9)x.

The second monkey then divides these remaining peaches in a 5:4 ratio, taking the larger portion for himself. Therefore, the second monkey takes 5/9 of the remaining peaches.

We know that the eldest monkey took 29 more peaches than the third monkey. So, we can set up the equation:

5/9x - y = 29

Since the second monkey took 5/9 of the remaining peaches after the first division, we can also write the equation:

5/9 * (4/9)x = 5/9y

We can simplify the second equation to get:

20/81x = 5/9y

Now we can substitute y in the first equation:

5/9x - 20/81x = 29

Multiplying the whole equation by 81 to elimina

Processed prompts: 100%|██████████| 1/1 [00:08<00:00,  8.25s/it, est. speed input: 36.39 toks/s, output: 65.86 toks/s]

<reasoning>
To solve this problem, we need to first determine the number of peaches each monkey took. Let's assume the total number of peaches initially was x.

The eldest monkey divided the peaches into two parts in a 5:4 ratio. Since the eldest monkey took the larger portion, he took 5/9 of the total peaches. So, the number of peaches the eldest monkey took is (5/9)x.

The remaining peaches after the eldest monkey took his share is (4/9)x.

The second monkey then divided these remaining peaches into two parts in a 5:4 ratio. The second monkey took the larger portion, which is 5/9 of the remaining peaches. Let's call the number of peaches the second monkey took y. So, we have (5/9)(4/9)x = y.

The remaining peaches after the second monkey took his share is (4/9)(4/9)x = (16/81)x.

Since the third monkey took the remaining peaches, the number of peaches the third monkey took is (16/81)x.

According to the problem, the eldest monkey took 29 more peaches than the third monkey. So, we can


