In [137]:
%%capture
!pip install unsloth
!pip install vllm
!pip install trl

In [176]:
import pandas as pd
from transformers import TextStreamer
from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel
from datasets import load_dataset, Dataset
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

In [78]:
reasoning_data = load_dataset("unsloth/OpenMathReasoning-mini")

reasoning_data

DatasetDict({
    cot: Dataset({
        features: ['expected_answer', 'problem_type', 'problem_source', 'generation_model', 'pass_rate_72b_tir', 'problem', 'generated_solution', 'inference_mode'],
        num_rows: 19252
    })
})

In [79]:
reasoning_data["cot"][0]

{'expected_answer': '14',
 'problem_type': 'has_answer_extracted',
 'problem_source': 'aops_c4_high_school_math',
 'generation_model': 'DeepSeek-R1',
 'pass_rate_72b_tir': '0.96875',
 'problem': 'Given $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.',
 'generated_solution': "<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 

In [98]:
non_reasoning_data = load_dataset("mlabonne/FineTome-100k").remove_columns(['source', 'score'])

non_reasoning_data

DatasetDict({
    train: Dataset({
        features: ['conversations'],
        num_rows: 100000
    })
})

In [99]:
non_reasoning_data["train"][0]

{'conversations': [{'from': 'human',
   'value': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.'},
  {'from': 'gpt

In [107]:
fast_model, fast_tokenizer = FastLanguageModel.from_pretrained(model_name="unsloth/Qwen3-1.7B-unsloth-bnb-4bit", load_in_4bit=True)

==((====))==  Unsloth 2025.8.1: Fast Qwen3 patching. Transformers: 4.54.1. vLLM: 0.10.0.
   \\   /|    Tesla P100-PCIE-16GB. Num GPUs = 1. Max memory: 15.888 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 6.0. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.31.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
# fast_model

In [108]:
%%capture

quantized_model = FastLanguageModel.get_peft_model(model=fast_model, 
                                                   r=8, 
                                                   target_modules=[
                                                                    "q_proj",
                                                                    "k_proj",
                                                                    "v_proj",
                                                                    "o_proj",
                                                                    "gate_proj",
                                                                    "up_proj",
                                                                    "down_proj"
                                                                ], 
                                                  lora_alpha=16)

Unsloth 2025.8.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [109]:
def FormatReasoningData(batch):
    all_conversations = []
    for problem, generated_solution in zip(batch["problem"], batch["generated_solution"]):
        all_conversations.append([
            {"role": "user", "content": problem},
            {"role": "assistant", "content": generated_solution}
        ])
    return {"conversations": all_conversations}

In [110]:
formatted_reasoning_data = reasoning_data.map(FormatReasoningData, batched=True).remove_columns(['expected_answer', 'problem_type', 
                                                                                                 'problem_source', 'generation_model',
                                                                                                 'pass_rate_72b_tir', 'problem',
                                                                                                 'generated_solution', 'inference_mode'])

formatted_reasoning_data

DatasetDict({
    cot: Dataset({
        features: ['conversations'],
        num_rows: 19252
    })
})

In [111]:
formatted_reasoning_data["cot"][1]

{'conversations': [{'content': 'Let \\( \\sum a_n \\) be a convergent series with \\( a_n \\geq 0 \\). Determine the nature of the series \\( \\sum \\frac{a_n^2}{1 + a_n^2} \\).',
   'role': 'user'},
  {'content': "<think>\nOkay, so I have this problem here where I need to determine the nature of the series ∑(a_n²)/(1 + a_n²) given that the series ∑a_n is convergent and each a_n is non-negative. Alright, let's start by understanding what the problem is asking. \n\nFirst, I know that ∑a_n is convergent, which means that the sequence a_n approaches zero as n approaches infinity. Because all the terms a_n are non-negative, they must be decreasing to zero, right? Wait, actually, convergence of the series only requires that the terms approach zero, but they don't necessarily have to be decreasing. For example, even if the terms bounce around a bit but still approach zero, the series can still converge if they do so quickly enough. But since the problem states a_n ≥ 0, we don't have to worry

In [112]:
non_reasoning_data["train"] = standardize_sharegpt(dataset=non_reasoning_data["train"], tokenizer=fast_tokenizer)

non_reasoning_data["train"][0]

Unsloth: Standardizing formats (num_proc=4):   0%|          | 0/100000 [00:00<?, ? examples/s]

{'conversations': [{'content': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.',
   'role': 'user'},
  {'content': 

In [132]:
def ApplyChatTemplateONData(sample_data):
    prep_data = []
    for i in sample_data["conversations"]:
        prep_data.append(fast_tokenizer.apply_chat_template(i, tokenize=False))

    return {"text":prep_data}

In [133]:
formatted_reasoning_data_with_template = formatted_reasoning_data.map(ApplyChatTemplateONData, batched=True).remove_columns(["conversations"])

formatted_reasoning_data_with_template

Map:   0%|          | 0/19252 [00:00<?, ? examples/s]

DatasetDict({
    cot: Dataset({
        features: ['text'],
        num_rows: 19252
    })
})

In [134]:
formatted_reasoning_data_with_template["cot"][0]["text"]

"<|im_start|>user\nGiven $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² 

In [136]:
formatted_non_reasoning_data_with_template = non_reasoning_data.map(ApplyChatTemplateONData, batched=True).remove_columns(["conversations"])

formatted_non_reasoning_data_with_template

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 100000
    })
})

In [140]:
reasoning_train_df = formatted_reasoning_data_with_template["cot"].to_pandas()
non_reasoning_train_df = formatted_non_reasoning_data_with_template["train"].to_pandas()

print(reasoning_train_df.shape, non_reasoning_train_df.shape)

(19252, 1) (100000, 1)


In [157]:
combined_df = pd.concat([reasoning_train_df, non_reasoning_train_df.sample(10000).reset_index(drop=True)]).sample(frac=1, random_state=2025).reset_index(drop=True)

combined_df.head()

Unnamed: 0,text
0,"<|im_start|>user\nIf \(a, b, c\) are real numb..."
1,<|im_start|>user\nFind all positive integer so...
2,<|im_start|>user\nIf the sum of \(n\) consecut...
3,<|im_start|>user\nFind \(\lim_{n \to \infty} \...
4,<|im_start|>user\nFind \(\lim_{x\to 0} \frac{(...


In [161]:
combined_data = Dataset.from_pandas(df=combined_df).shuffle(seed=2025)

combined_data

Dataset({
    features: ['text'],
    num_rows: 29252
})

In [171]:
args = SFTConfig(
    output_dir="./reasoning_model",
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    warmup_steps = 5,
    max_steps = 30,
    logging_steps=1,
    # num_train_epochs = 10, 
    learning_rate = 2e-4,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 2025,
    report_to = "none", 
    logging_strategy='steps',
    save_strategy="steps"
)

sft_trainer = SFTTrainer(
    model=quantized_model,
    args=args,
    train_dataset=combined_data,
    eval_dataset=None,
    tokenizer=fast_tokenizer
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/29252 [00:00<?, ? examples/s]

In [172]:
training_history = sft_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 29,252 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 8,716,288 of 1,729,291,264 (0.50% trained)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 29,252 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 8,716,288 of 1,729,291,264 (0.50% trained)


Step,Training Loss
1,0.7588
2,0.3647
3,0.6582
4,0.4849
5,0.4398
6,0.4688
7,0.432
8,0.3717
9,0.5611
10,0.4721


In [181]:
def SteamResponse(model, tokenizer, query:str, allow_thinking:bool=False):
    query = [
        {"role":"user", "content":query}
    ]
    if allow_thinking:
        text = tokenizer.apply_chat_template(query,
                                        tokenize = False,
                                        add_generation_prompt = True, # Must add for generation
                                        enable_thinking = True, # Enable thinking
                                     )
    else:
        text = tokenizer.apply_chat_template(query,
                                    tokenize = False,
                                    add_generation_prompt = True, # Must add for generation
                                    enable_thinking = False, # Disable thinking
                                 )
    print("Text:", text)

    tokenize_text = tokenizer(text, return_tensors="pt").to("cuda:0")

    _ = model.generate(
                      **tokenize_text,
                      max_new_tokens = 1024, 
                      temperature = .7 if not allow_thinking else .6, 
                      top_p = 0.8 if not allow_thinking else .95, 
                      top_k = 20, 
                      streamer = TextStreamer(tokenizer, skip_prompt = True)
     )

##### Without Thinking

In [185]:
SteamResponse(model=sft_trainer.model, tokenizer=sft_trainer.tokenizer, query="If 𝑥2−5𝑥+6=0, find the values of 𝑥. Show all steps")

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Text: <|im_start|>user
If 𝑥2−5𝑥+6=0, find the values of 𝑥. Show all steps<|im_end|>
<|im_start|>assistant
<think>

</think>


To solve the quadratic equation 𝑥2−5𝑥+6=0, we can factor the left-hand side. We look for two numbers that multiply to 6 and add up to -5. These numbers are -2 and -3.

So, we can rewrite the equation as:

(𝑥 - 2)(𝑥 - 3) = 0

Now, we can set each factor equal to zero and solve for x:

x - 2 = 0
x = 2

x - 3 = 0
x = 3

Therefore, the solutions are x = 2 and x = 3.<|im_end|>


##### With Thinking

In [186]:
SteamResponse(model=sft_trainer.model, tokenizer=sft_trainer.tokenizer, query="If 𝑥2−5𝑥+6=0, find the values of 𝑥. Show all steps", allow_thinking=True)

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Text: <|im_start|>user
If 𝑥2−5𝑥+6=0, find the values of 𝑥. Show all steps<|im_end|>
<|im_start|>assistant

<think>

</think>

To solve the quadratic equation 𝑥2−5𝑥+6=0, we can factor the quadratic expression. Here's how:

Step 1: Find two numbers that multiply to give the constant term, which is 6, and add up to the coefficient of the x term, which is -5.

These numbers are -2 and -3 because (-2) * (-3) = 6 and (-2) + (-3) = -5.

Step 2: Rewrite the quadratic expression using these numbers:

𝒙² - 5𝒙 + 6 = (𝒙 - 2)(𝒙 - 3)

Step 3: Set each factor equal to zero and solve for x:

𝒙 - 2 = 0 → x = 2

𝒙 - 3 = 0 → x = 3

So, the solutions are x = 2 and x = 3.<|im_end|>


##### Without Thinking

In [187]:
SteamResponse(model=sft_trainer.model, tokenizer=sft_trainer.tokenizer, query="Hello how're you doing?")

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Text: <|im_start|>user
Hello how're you doing?<|im_end|>
<|im_start|>assistant
<think>

</think>


Hello! I'm doing well, thank you. How can I assist you today?<|im_end|>


##### With Thinking

In [188]:
SteamResponse(model=sft_trainer.model, tokenizer=sft_trainer.tokenizer, query="Hello how're you doing?", allow_thinking=True)

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


Text: <|im_start|>user
Hello how're you doing?<|im_end|>
<|im_start|>assistant

<think>

</think>

Hello! I'm just a helpful AI assistant. How can I assist you today?<|im_end|>
