<a href="https://colab.research.google.com/github/CHAITSNIPER/reasoning/blob/main/Reasoning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Loading base models and environment

In [None]:
!pip install --upgrade datasets fsspec

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency resolver do

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

The paper implements Qwen 1.5-1.8B for implementing Process-Supervised Reward Model(PRM) and Outcome-supervised Reward model(ORM)

In [None]:
model_name = "Qwen/Qwen1.5-1.8B"
tokenizer = AutoTokenizer.from_pretrained(model_name);
base_model = AutoModelForCausalLM.from_pretrained(model_name);

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

In [None]:
dataset = load_dataset("gsm8k","main");

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

The dataset has questions and provided answers, but we are not going to use them for training.

PRM and ORM are 2 kinds of models used to improve the reasoning capablities of large language models(LLMs) through Reinforcement Learning


PRM(Process Reward Model):
PRM is trained to evaluate the correctness of individual reasoning steps in a solution, provides step by step feedback during the reasoning process.

For every part of the solution, it predicts probablity that step K is correct. During RL training, it provides reward after each reasoning step.


ORM(Outcome Reward Model):
ORM predicts whether a complete solution will lead to a correct final answer. It provides a single reward at the end of generation.

It assesses the entire solution at once and provides feedback at the end. It is trained on pairs of solutions and their final correctedness labels. For any solutions prefix, it predicts the likelihood that it will generate the correct answer.


In [None]:
class ProcessRewardModel(torch.nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model

        self.classifier = torch.nn.Linear(base_model.config.hidden_size, 1);

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids, attention_mask = attention_mask, output_hidden_states = True)
        last_hidden = outputs.hidden_states[-1][:,-1,:];
        return torch.sigmoid(self.classifier(last_hidden));

In [None]:
class OutcomeRewardModel(torch.nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model

        self.classifier = torch.nn.Linear(base_model.config.hidden_size,1);
    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids, attention_mask = attention_mask, output_hidden_states = True);

        last_hidden = outputs.hidden_states[-1][:,-1,:];
        return torch.sigmoid(self.classifier(last_hidden));

In [None]:
def apply_clip(rewards, eta = 0.8):
    clipped = torch.minimum(rewards - eta, torch.zeros_like(rewards));
    return clipped;

In [None]:
def apply_delta(rewards):
    delta_rewards = torch.zeros_like(rewards)
    for k in range(len(rewards)-1):
        delta_rewards[k] = rewards[k]-rewards[k+1];

    delta_rewards[-1] = rewards[-1];
    return delta_rewards;

Now we use PPO, proximal Policy Approximation to incorporate these rewards

In [None]:
from torch.optim import AdamW
from torch.utils.data import DataLoader

class PPOTrainer:
    def __init__(self, model,prm,orm,tokenizer,kl_coef=0.1, gamma = 1.0, clip_eps=0.2):
        self.model = model;
        self.prm = prm;
        self.orm = orm;
        self.tokenizer = tokenizer;
        self.kl_coef = kl_coef;
        self.gamma = gamma;
        self.clip_eps = clip_eps;
        self.optimizer = AdamW(model.parameters(),lr = 1e-6);
    def compute_rewards(self, questions, solutions):
        inputs = self.tokenizer(questions,solutions, return_tensors = "pt",padding = True);
        success_rewards = torch.tensor([self.check_correctness(q,s) for q,s in zip(questions,solutions)])

        with torch.no_grad():
            process_rewards = self.prm(inputs["input_ids"],inputs["attention_mask"]);
            #outcome_rewards = self.orm(inputs["input_ids"],inputs["attention_mask"]);


        process_rewards = apply_clip(process_rewards);
        process_rewrds = apply_delta(process_rewards);

        total_rewards = process_rewards + success_rewards.unsqueeze(-1);
        return total_rewards;

    def generate_solutions(self,questions,num_samples=8):
        solutions = [];
        for qn in questions:
            input = self.tokenizer(qn, return_tensors = "pt");
            outputs = self.model.generate(**input, max_length = 512, num_return_sequences = 1);

            solutions.append(self.tokenizer.batch_decode(outputs, skip_special_tokens = True));
        return solutions;

    def train(self, batch):
        questions = batch["question"];
        solutions = self.generate_solutions(questions);
        rewards = self.compute_rewards(questions, solutions);

        inputs = self.tokenizer(questions, solutions, return_tensors = "pt", padding = True, return_attention_mask = True);
        outputs = self.model(**inputs, labels = inputs["input_id"]);

        logits = outputs.logits
        log_probs = torch.log_softmax(logits, dim = -1);


        ratios = torch.exp(log_probs - log_probs.detach());
        clipped_ratios = torch.clamp(ratios, 1-self.clip_eps, 1+self.clip_eps);

        policy_loss = -torch.min(ratios*rewards, clipped_ratios*rewards).mean();

        with torch.no_grad():
            ref_logits = self.model(**inputs).logits;
            ref_log_probs = torch.log_softmax(ref_logits, dim = -1);

        kl_div = (log_probs*(log_probs - ref_log_probs)).sum(-1).mean();

        loss = policy_loss + self.kl_coef * kl_div

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        return loss.item()

    def check_correctness(self,question,solution):
        return 1 if "correct_answer" in solution else 0;


training loop

In [None]:
prm = ProcessRewardModel(base_model);
orm = OutcomeRewardModel(base_model);
trainer = PPOTrainer(base_model,prm,orm,tokenizer);


num_epochs = 5
batch_size = 1024

for epoch in range(num_epochs):
    dataloader = DataLoader(dataset["train"], batch_size=batch_size, shuffle=True)

    for batch in dataloader:
        loss = trainer.train(batch)
        print(f"Epoch {epoch}, Loss: {loss}")

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Both `max_new_tokens` (=2048) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Both `max_new_tokens` (=2048) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
Both `max_new_tokens` (=2048) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Setting `pad_token_id` to `e

KeyboardInterrupt: 