In [1]:
import torch

In [2]:
torch.device("cuda" if torch.cuda.is_available() else "cpu")

device(type='cuda')

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F # Import F for functional operations
import json
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# --- Configuration ---
#oprtional:  modernbert-base
MODEL_ID = "microsoft/deberta-v3-small"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LEARNING_RATE = 5e-5
EPOCHS = 3
BATCH_SIZE = 10
MAX_LENGTH = 128

In [4]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



In [5]:
class RewardIterableDataset(torch.utils.data.IterableDataset):
    def __init__(self, filename, tokenizer):
        self.filename = filename
        self.tokenizer = tokenizer

    def __iter__(self):
        with open(self.filename, 'r') as f:
            for line in f:
                yield json.loads(line)

    def __getitem__(self, idx):
        return self[idx]

def collate_fn(batch):
    return batch

In [6]:
dataset = RewardIterableDataset("financial_rewards.jsonl", tokenizer)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn)

In [7]:
for batch_idx, batch in enumerate(dataloader):
    print(batch)

[{'prompt': 'User: If I invest $10,000 at a 5% annual interest rate compounded annually, what will be the total amount after 2 years?\nAssistant:', 'chosen': ' Reasoning: To calculate the future value with compound interest, we use the formula A = P(1 + r)^n. Here, P = 10,000, r = 0.05, and n = 2. First year: 10,000 * 1.05 = 10,500. Second year: 10,500 * 1.05 = 11,025.\nAnswer: $11,025', 'rejected': ' You will have around $11,000. Just trust me, 5% adds up.'}, {'prompt': 'User: What does a high P/E ratio generally imply about a stock?\nAssistant:', 'chosen': " Reasoning: A Price-to-Earnings (P/E) ratio compares a company's share price to its earnings per share. A high P/E ratio suggests that investors are expecting higher earnings growth in the future, or that the stock might be overvalued compared to its peers.\nAnswer: It implies high growth expectations or potential overvaluation.", 'rejected': ' Reasoning: P/E stands for Price to Equity.\nAnswer: It means the stock is cheap and you

In [8]:
class RewardModel(nn.Module):
    def __init__(self, base_model_id):
        super().__init__()
        # Initialize as a sequence classifier with 1 label (scalar score)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            base_model_id,
            num_labels=1,
            dtype=torch.float32 # Changed from torch_dtype=torch.float16 to dtype=torch.float32
        )
        # Ensure pad token is set
        if self.model.config.pad_token_id is None:
             self.model.config.pad_token_id = self.model.config.eos_token_id

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        return outputs.logits.squeeze(-1) # [batch_size]

In [9]:
reward_model = RewardModel(MODEL_ID).to(DEVICE)
reward_model.train()

pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RewardModel(
  (model): DebertaV2ForSequenceClassification(
    (deberta): DebertaV2Model(
      (embeddings): DebertaV2Embeddings(
        (word_embeddings): Embedding(128100, 768, padding_idx=0)
        (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): DebertaV2Encoder(
        (layer): ModuleList(
          (0-5): 6 x DebertaV2Layer(
            (attention): DebertaV2Attention(
              (self): DisentangledSelfAttention(
                (query_proj): Linear(in_features=768, out_features=768, bias=True)
                (key_proj): Linear(in_features=768, out_features=768, bias=True)
                (value_proj): Linear(in_features=768, out_features=768, bias=True)
                (pos_dropout): Dropout(p=0.1, inplace=False)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): DebertaV2SelfOutput(
                (dense): Linear(in_features=768, o

# Model training loop



The model is trained on the selected and rejected answer pairs.
The loss function used is the bradley terry loss function.

## Bradley terry loss function

-    The log probabilities of the each sequence (choose and selected are calucualted).
-    Since every toke has its log prob, the net prob of the entire sequence is calcuated as the meann of all the tokens in a sequence.
-       loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
-    It first caulates the difference between the log probabilities of the choose and the reference options.
-   Second it calculates the sigmoid of the difference.
-   Third it calculates the Negative log.

-     Hence it calcuates the negative log likehood of difference between the choosen and the selected log probs.
-   The aim here is to ***maximize*** the difference




In [10]:
optimizer = optim.AdamW(reward_model.parameters(), lr=LEARNING_RATE)
for epoch in range(EPOCHS):
        total_loss = 0
        batch_count = 0
        for batch_idx, batch in enumerate(dataloader):
            optimizer.zero_grad()

            # Prepare chosen and rejected inputs
            prompts = [item['prompt'] for item in batch]
            chosen_texts = [p + item['chosen'] for p, item in zip(prompts, batch)]
            rejected_texts = [p + item['rejected'] for p, item in zip(prompts, batch)]

            # Tokenize
            chosen_enc = tokenizer(chosen_texts, return_tensors='pt', padding=True, truncation=True, max_length=MAX_LENGTH).to(DEVICE)
            rejected_enc = tokenizer(rejected_texts, return_tensors='pt', padding=True, truncation=True, max_length=MAX_LENGTH).to(DEVICE)

            # Forward pass to get scores
            chosen_rewards = reward_model(chosen_enc.input_ids, chosen_enc.attention_mask)
            rejected_rewards = reward_model(rejected_enc.input_ids, rejected_enc.attention_mask)

            # --- Bradley-Terry Loss ---

            loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()

            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            batch_count += 1

        avg_loss = total_loss / batch_count if batch_count > 0 else 0
        print(f"Epoch {epoch+1}/{EPOCHS} | Average Loss: {avg_loss:.4f}")

print("\nTraining complete.")

Epoch 1/3 | Average Loss: 0.5987
Epoch 2/3 | Average Loss: 0.1019
Epoch 3/3 | Average Loss: 0.0247

Training complete.


In [11]:
 # --- Quick Verification ---
print("\n--- Verification: Scoring a sample ---")
reward_model.eval()

# Load just one sample for verification manually
with open("financial_rewards.jsonl", 'r') as f:
    sample = json.loads(f.readline())
prompt = sample['prompt']
chosen = prompt + sample['chosen']
rejected = prompt + sample['rejected']

print(f"Prompt: {prompt}")
print(f"Chosen: {chosen}")
print(f"Rejected: {rejected}")

with torch.no_grad():
    c_enc = tokenizer(chosen, return_tensors='pt').to(DEVICE)
    r_enc = tokenizer(rejected, return_tensors='pt').to(DEVICE)

    c_score = reward_model(c_enc.input_ids, c_enc.attention_mask).item()
    r_score = reward_model(r_enc.input_ids, r_enc.attention_mask).item()

print(f"Chosen Score:   {c_score:.4f}")
print(f"Rejected Score: {r_score:.4f}")

if c_score > r_score:
    print("SUCCESS: Chosen response scored higher.")
else:
    print("FAILURE: Rejected response scored higher (expected with very little training).")



--- Verification: Scoring a sample ---
Prompt: User: If I invest $10,000 at a 5% annual interest rate compounded annually, what will be the total amount after 2 years?
Assistant:
Chosen: User: If I invest $10,000 at a 5% annual interest rate compounded annually, what will be the total amount after 2 years?
Assistant: Reasoning: To calculate the future value with compound interest, we use the formula A = P(1 + r)^n. Here, P = 10,000, r = 0.05, and n = 2. First year: 10,000 * 1.05 = 10,500. Second year: 10,500 * 1.05 = 11,025.
Answer: $11,025
Rejected: User: If I invest $10,000 at a 5% annual interest rate compounded annually, what will be the total amount after 2 years?
Assistant: You will have around $11,000. Just trust me, 5% adds up.
Chosen Score:   2.7969
Rejected Score: -4.6204
SUCCESS: Chosen response scored higher.


In [12]:
%%writefile requirements.txt
torch
transformers

Writing requirements.txt


In [13]:
import os
print(os.listdir('.'))

['.config', 'requirements.txt', 'financial_rewards.jsonl', 'sample_data']


In [15]:
%%writefile data_loader.py
import torch
import json
from torch.utils.data import Dataset, DataLoader

class RewardIterableDataset(torch.utils.data.IterableDataset):
    def __init__(self, filename, tokenizer):
        self.filename = filename
        self.tokenizer = tokenizer

    def __iter__(self):
        with open(self.filename, 'r') as f:
            for line in f:
                yield json.loads(line)

    def __getitem__(self, idx):
        # IterableDatasets typically do not implement __getitem__
        # for direct indexing. This method is a placeholder if needed
        # for compatibility with some DataLoader features that might
        # indirectly call it, but the primary mode of access is __iter__.
        # For this specific use case with DataLoader, __iter__ is sufficient.
        raise NotImplementedError("__getitem__ not implemented for RewardIterableDataset")

def collate_fn(batch):
    return batch


Writing data_loader.py


In [16]:
%%writefile reward_model.py
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification

class RewardModel(nn.Module):
    def __init__(self, base_model_id):
        super().__init__()
        # Initialize as a sequence classifier with 1 label (scalar score)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            base_model_id,
            num_labels=1,
            torch_dtype=torch.float32 # Use torch_dtype for AutoModelForSequenceClassification
        )
        # Ensure pad token is set
        if self.model.config.pad_token_id is None:
             self.model.config.pad_token_id = self.model.config.eos_token_id

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        return outputs.logits.squeeze(-1) # [batch_size]

Writing reward_model.py


In [19]:
%%writefile config.py
import torch

MODEL_ID = "microsoft/deberta-v3-small"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LEARNING_RATE = 5e-5
EPOCHS = 3
BATCH_SIZE = 10
MAX_LENGTH = 128
MODEL_SAVE_PATH = "reward_model.pt"


Writing config.py


In [20]:
import os
print(os.listdir('.'))

['.config', 'README.md', 'config.py', 'train_reward_model.py', 'reward_model.py', 'data_loader.py', 'requirements.txt', 'financial_rewards.jsonl', 'sample_data']


In [21]:
%%writefile train_reward_model.py
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import json
from torch.utils.data import DataLoader
from transformers import AutoTokenizer

# Import modules from our project structure
from data_loader import RewardIterableDataset, collate_fn
from reward_model import RewardModel
import config # Import config.py

# --- Configuration (using config.py) ---
MODEL_ID = config.MODEL_ID
DEVICE = config.DEVICE
LEARNING_RATE = config.LEARNING_RATE
EPOCHS = config.EPOCHS
BATCH_SIZE = config.BATCH_SIZE
MAX_LENGTH = config.MAX_LENGTH
MODEL_SAVE_PATH = config.MODEL_SAVE_PATH # New: Model save path

# --- Tokenizer Initialization ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# --- Data Loading ---
dataset = RewardIterableDataset("financial_rewards.jsonl", tokenizer)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn)

# --- Model Initialization ---
reward_model = RewardModel(MODEL_ID).to(DEVICE)
reward_model.train()

# --- Optimizer ---
optimizer = optim.AdamW(reward_model.parameters(), lr=LEARNING_RATE)

# --- Training Loop ---
print("Starting training...")
for epoch in range(EPOCHS):
    total_loss = 0
    batch_count = 0
    for batch_idx, batch in enumerate(dataloader):
        optimizer.zero_grad()

        # Prepare chosen and rejected inputs
        prompts = [item['prompt'] for item in batch]
        chosen_texts = [p + item['chosen'] for p, item in zip(prompts, batch)]
        rejected_texts = [p + item['rejected'] for p, item in zip(prompts, batch)]

        # Tokenize
        chosen_enc = tokenizer(chosen_texts, return_tensors='pt', padding=True, truncation=True, max_length=MAX_LENGTH).to(DEVICE)
        rejected_enc = tokenizer(rejected_texts, return_tensors='pt', padding=True, truncation=True, max_length=MAX_LENGTH).to(DEVICE)

        # Forward pass to get scores
        chosen_rewards = reward_model(chosen_enc.input_ids, chosen_enc.attention_mask)
        rejected_rewards = reward_model(rejected_enc.input_ids, rejected_enc.attention_mask)

        # --- Bradley-Terry Loss ---
        loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()

        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        batch_count += 1

    avg_loss = total_loss / batch_count if batch_count > 0 else 0
    print(f"Epoch {epoch+1}/{EPOCHS} | Average Loss: {avg_loss:.4f}")

print("\nTraining complete.")

# --- Save the trained model ---
print(f"Saving reward model to {MODEL_SAVE_PATH}")
torch.save(reward_model.state_dict(), MODEL_SAVE_PATH)
print("Model saved successfully.")

# --- Quick Verification ---
print("\n--- Verification: Scoring a sample ---")
reward_model.eval()

# Load just one sample for verification manually
with open("financial_rewards.jsonl", 'r') as f:
    sample = json.loads(f.readline())
prompt = sample['prompt']
chosen = prompt + sample['chosen']
rejected = prompt + sample['rejected']

print(f"Prompt: {prompt}")
print(f"Chosen: {chosen}")
print(f"Rejected: {rejected}")

with torch.no_grad():
    c_enc = tokenizer(chosen, return_tensors='pt').to(DEVICE)
    r_enc = tokenizer(rejected, return_tensors='pt').to(DEVICE)

    c_score = reward_model(c_enc.input_ids, c_enc.attention_mask).item()
    r_score = reward_model(r_enc.input_ids, r_enc.attention_mask).item()

print(f"Chosen Score:   {c_score:.4f}")
print(f"Rejected Score: {r_score:.4f}")

if c_score > r_score:
    print("SUCCESS: Chosen response scored higher.")
else:
    print("FAILURE: Rejected response scored higher (expected with very little training).")


Overwriting train_reward_model.py


In [22]:
%%writefile README.md
# Reward Model Training Project

This project implements a reward model using a pre-trained transformer model (`microsoft/deberta-v3-small`) to differentiate between chosen and rejected responses based on a dataset of financial questions and answers.

## Project Structure
- `requirements.txt`: Lists the Python dependencies.
- `README.md`: Project documentation.
- `config.py`: Centralizes all configuration parameters for the training process.
- `data_loader.py`: Contains the `RewardIterableDataset` and `collate_fn` for loading and processing data.
- `reward_model.py`: Defines the `RewardModel` architecture.
- `train_reward_model.py`: Script to train the reward model and save the trained model.

## Setup and Installation
1. Clone this repository (if applicable).
2. Create a virtual environment (optional but recommended):
   `python -m venv venv`
   `source venv/bin/activate` (on Linux/macOS)
   `.\venv\Scripts\activate` (on Windows)
3. Install the required packages:
   `pip install -r requirements.txt`

## Configuration
All training parameters are defined in `config.py`, including `MODEL_ID`, `DEVICE`, `LEARNING_RATE`, `EPOCHS`, `BATCH_SIZE`, `MAX_LENGTH`, and `MODEL_SAVE_PATH`.

## Data
The model is trained on a JSONL file named `financial_rewards.jsonl`, which contains entries with `prompt`, `chosen`, and `rejected` fields.

## Training
To train the reward model, run the `train_reward_model.py` script:
`python train_reward_model.py`

The trained model will be saved to the path specified in `config.MODEL_SAVE_PATH` (default: `reward_model.pt`).

## Model Details
- **Base Model**: `microsoft/deberta-v3-small` (configured in `config.py`)
- **Loss Function**: Bradley-Terry loss
- **Optimizer**: AdamW
- **Learning Rate**: 5e-5 (configured in `config.py`)
- **Epochs**: 3 (configured in `config.py`)
- **Batch Size**: 10 (configured in `config.py`)
- **Max Length**: 128 (configured in `config.py`)

Overwriting README.md


In [23]:
import os
print(os.listdir('.'))

['.config', 'README.md', 'config.py', 'train_reward_model.py', 'reward_model.py', 'data_loader.py', 'requirements.txt', 'financial_rewards.jsonl', 'sample_data']
