## Setting up the Stage

Once you install the required libraries, you can start by importing the necessary libraries.

In [1]:
# Import necessary libraries
import logging
import os
import sys
import re
import math
from dataclasses import dataclass, field
from typing import Optional

# Import PyTorch and Hugging Face Transformers
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    TrainerCallback,
    TrainerControl,
    TrainerState,
)

# Import dataset utilities
from datasets import load_dataset

# Import libraries from TRL (Transformers Reinforcement Learning)
from trl import (
    GRPOConfig, 
    SFTTrainer
)

# Import math-related utilities
from latex2sympy2_extended import NormalizationConfig
from math_verify import LatexExtractionConfig, parse, verify

  from .autonotebook import tqdm as notebook_tqdm


## Choosing our Base Model

Since DeepSeek team chose DeepSeek-V3 as their base model to create R1 Zero and R1, but it’s quite huge **685 GB 💀 in size** which is obviously not in our reach.

To keep it simple, we will use a much smaller base model [Qwen/Qwen2.5–0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) (0.9 GB in size). If you have a higher GPU RAM that can even load unquantized LLMs, you can go for a bigger model, such as [Qwen/Qwen2.5–7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct).

Let’s take a look at some of the specification of our base model:

In [2]:
# MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
# MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
OUTPUT_DIR = "data/Qwen-GRPO-training" # For saving our trained model

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Initialize tokenizer with chat template
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    padding_side="right"
)

# Set pad token if not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Vocabulary size: {len(tokenizer)}")
print(f"Model max length: {tokenizer.model_max_length}")
print(f"Pad token: {tokenizer.pad_token}")
print(f"EOS token: {tokenizer.eos_token}")

Vocabulary size: 151665
Model max length: 16384
Pad token: <｜end▁of▁sentence｜>
EOS token: <｜end▁of▁sentence｜>


These are some basic info about the model, take a look at the total number of parameters our base model has.

In [3]:
# Initialize base model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

print(f"Model parameters: {model.num_parameters():,}")

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Model parameters: 1,777,088,000


Close to 0.5B params, let’s print a simple response from it and then we will move on to next step.

In [4]:
import json
import random

def get_random_half_dict(original_dict):
    # Get the number of items to select (half of the dictionary length)
    num_items = len(original_dict) // 10
    
    # Convert dictionary items to a list
    items = list(original_dict.items())
    
    # Randomly select half of the items
    selected_items = random.sample(items, num_items)
    
    # Convert back to dictionary
    result_dict = dict(selected_items)
    
    return result_dict

In [5]:
# Check CUDA availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move model to the appropriate device
model.to(device)

# Test basic inference
def test_model_inference(user_input: str):
    """Test basic model inference with the loaded model and tokenizer."""
    messages = [
        {"role": "system", "content": "You are a decision maker chemist that predicts logD for a given small molecule based on its SMILES."}, #"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>."},
        {"role": "user", "content": user_input}
    ]

    # Apply chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Tokenize and generate
    inputs = tokenizer(text, return_tensors="pt").to(device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=1000,
        do_sample=True,
        temperature=0.7
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

def test_model_inference_v2(text: str):

    # Tokenize and generate
    inputs = tokenizer(text, return_tensors="pt").to(device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=100000,
        do_sample=True,
        # temperature=0.7
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

Using device: cuda


In [9]:
# # {'LogD': 1.5, 'KSOL': 340.0, 'MLM': nan, 'MDR1-MDCKII': 1.6, 'HLM': nan}
# # Test the model
# prompt = f"""What are the LogD, KSOL, MLM, MDR1-MDCKII, HLM for this small molecule CO[C@H]1C[C@H](N2N=CC3=C(C(=O)NC4=CC=C5CNCC5=C4)C=C(Cl)C=C32)C1?
# Here is the scientiffic publication I found that may help you:
# Multi-Parameter Optimization: Identifying high quality compounds with a balance of properties
# A successful drug that passes the hurdles of clinical trials to gain approval and a strong market position must exhibit a delicate balance of biological and physicochemical properties. Such a compound must, of course, be potent against its intended physiological target(s); however, it must also have appropriate pharmacokinetics to reach the site of the target at a sufficiently high concentration and for an appropriate duration via the intended route of administration. Furthermore, for the compound to be safely administered, it must avoid unintended side-effects, drug-drug interactions and non-specific or idiosyncratic toxicities at the therapeutic dose. The goal of drug discovery is to identify a successful compound as efficiently as possible. But, as the history of drug discovery has proved, this is a challenge of significant proportions [1].
# This task is made even more difficult by the fact that, in drug discovery, data on the behavior of the compound in the ultimate target patient population, i.e. humans, is not available. This has led to the development of a plethora of in silico, in vitro, and in vivo animal models from which we can (hopefully) infer the likely in vivo efficacy, disposition and safety of a compound in humans. These include models for the prediction and measurement of potency and selectivity against molecular targets or off-targets; absorption, distribution, metabolism and elimination (ADME) properties; cell-based measurements of pharmacological activity and toxicity; and animal models of pharmacology, pharmacokinetics and toxicity. The cost and throughput of these techniques vary, from in silico methods which typically have the lowest cost and highest throughput, through in vitro and cell-based assays to lengthy and expensive in vivo studies, the use of which we would also like to minimize for ethical reasons. Therefore, drug discovery is a process of simultaneously optimizing all of these factors as compounds are designed, synthesized and progressed through a cascade of assays to accumulate data.
# This balancing act is difficult to achieve through a purely intellectual process. Psychologists have repeatedly demonstrated that people are very poor at making decisions based on complex and uncertain data when there is a lot at stake, such as in drug discovery. Several biases in decision-making (described as cognitive biases) have been identified that can detrimentally affect efficiency and productivity in drug discovery. A detailed discussion of some of these, with examples, may be found in [2]; however two illustrative examples are:
#  Confirmation bias: The tendency to seek data that confirms a pre-formed hypothesis, rather than perform experiments designed to yield results to challenge the hypothesis. This can lead to a premature focus on a small range of options, which may lead to missed opportunities or late stage failures of compounds that have been progressed too far in the search for the one piece of data that would prove the point.
#  Excess focus on certainty: The tendency to seek additional data to be ‘absolutely certain’ of a critical factor, even when this data adds little value at a high cost. Often a more significant increase in the confidence around a slightly less important factor may have a greater effect on the overall chance of success. This can lead to inefficient use of resources when considering multiple property requirements and to late stage, expensive failures.
# The historical evidence regarding the attrition and productivity of pharmaceutical research and development supports this observation. The increasing complexity and volume of data being generated in drug discovery has not improved success rates in development – 11% in 2000 [3] versus 12% in 2010 [1] – while the cost per marketed drug has continued to escalate – from an estimated fully capitalized cost of $802M in 2001 [4] to $1,778M in 2010 [1] – and productivity, as measured by the number of registered new chemical entities, has fallen [5]. There are a wide range of theories regarding the underlying cause of these effects, but it is safe to conclude that generating additional, early-stage data has not resulted in the improvements anticipated in the outcomes.
# Fortunately, we may learn from other fields that face the same need to balance many factors in the design of a successful solution. These fields range from engineering disciplines, such as aerospace or automotive design, to economics. The resulting methods are commonly described under the broad term “Multi-parameter Optimization” (MPO) or sometimes also “Multi-dimensional Optimization” (MDO) or “Multi-objective Optimization” (MOOP). For convenience, we will use the term MPO to describe all of the methods in this review.
# There is a significant difference between applications of MPO methods to drug discovery and other fields, in particular engineering. This relates to the quality of the data available on the potential designs or prototypes from which a selection must be made. In an engineering discipline, characteristics may commonly be measured to accuracies within parts per million or predicted computationally to within a fraction of a percent. This may be contrasted with drug discovery where measured properties, such as IC50 or Ki values, may have an experimental variability of a factor of two, while predictions may have statistical uncertainties of an order of magnitude. This dramatically increases the challenge because, even if an ideal compound exists among the available options, we cannot expect to identify it with absolute confidence, thus running the risk of missing opportunities for high quality drugs [6].
# In our research into the requirements for an ideal MPO method for drug discovery, we identified the following factors that should be taken into account:
#  Interpretability: The property criteria and their impact on compound priority should be easy to understand. A ‘black box’ method that does not provide an easy way to understand why a compound has been classified in a given way is likely to be discounted. Furthermore, a ‘black box’ does not provide any guidance on the way one should go about making improvements in order to increase the chance of success.
#  Flexibility: Each project will have a different set of property criteria depending on the therapeutic objectives of the project, intended route of administration and competitive conditions in the market. The project team should be able to define appropriate criteria based on their experience or historical evidence.
#  Weighting: The project team should be able to assign different weights to each property criterion, as different criteria will have different degrees of importance to the outcome of the project. For example identifying a compound that is potent against the intended target is critical, while other properties will be less important, particularly early in a project when there is an opportunity for redesign to overcome liabilities.
#  Uncertainty: It is important to avoid rejecting potentially valuable compounds based on a property value that fails to meet a criterion if that value has a high level of uncertainty. The opportunity cost of incorrectly rejecting a good compound may be very high, particularly when the range of alternative options is limited.
# Coincidentally, it seems that the development of a suitable MPO approach for drug discovery is itself an MPO problem!
# One common question is, “Can’t this be easily solved by visualization of the data?” While visualization is necessary to understand and communicate results, it is not sufficient to allow conclusions to be easily drawn, given the complexity of the data at hand. One common approach is to plot multi-dimensional data, for example on a three dimensional graph with additional parameters shown by the colors and sizes of the points. An alternative is a ‘traffic light’ view where the properties of each compound are shown in a table and colored according to whether they ‘pass’ (green), ‘fail’ (red) or are ‘close’ (yellow) to the relevant criterion. However, even with only five-dimensional data, it is difficult to confidently draw a conclusion from these visualizations even before we consider the relative importance of each property or the uncertainty in the data. An MPO method helps a project team to define a set of criteria and use this pro-actively to guide their decisions to quickly target high quality compounds [7].
# In this review, we will explore a range of different MPO approaches that have been applied to drug discovery and compare their strengths and weaknesses relative to the requirements described above. The methods that we will discuss in order of increasing sophistication, include ‘rules-of-thumb’ that provide chemists with guidelines for compound characteristics, simple pass/fail filters, Pareto optimization, desirability functions and probabilistic scoring, which brings together all of the requirements discussed above. We will also consider the role of chemical diversity to mitigate risk when selecting compounds for further investigation. Finally, we will illustrate some of the methods using examples taken from the literature before drawing our conclusions.
# Rules of Thumb
# Perhaps the most common approach used to consider the quality of compounds relative to criteria beyond potency are ‘rules of thumb’ that provide guidelines regarding desirable compound characteristics. The best known is undoubtedly Lipinski’s Rule of Five (RoF) [8], which proposes criteria for four basic characteristics that Lipinski identified as being satisfied by the majority of orally absorbed compounds, namely:
#  Molecular Weight (MW) < 500
#  Logarithm of the octanol:water partition coefficient (logP) < 5
#  Number of Hydrogen Bond Donors (HBD) < 5
#  Number of Hydrogen Bond Acceptors (HBA) < 10
# Subsequently, several other rules have been proposed, for example Veber et al. [9] identified that most of the 1100 compounds they studied with oral bioavailability of greater than 20% in rats had less than 10 rotatable bonds and a Polar Surface Area (PSA) of less than 140 Å2. However, Lu et al. [10] repeated this study with a set of 434 compounds and showed that the criteria depended on the method used for calculation, providing one illustration of the need for flexibility in the criteria depending on the source of data.
# Johnson et al. identified rules based on MW and the logarithm of the octanol:buffer partition coefficient at pH7.4 (logD) to achieve permeability and metabolic stability [11]. In this case, rather than expressing these rules as criteria for the individual characteristics, Johnson et al. identified correlations that led them to express the rules in terms of a ‘golden triangle’ that defines an optimal region in (MW,logD) space in which a compound should lie (illustrated in Figure 1).
# elogD
#  MW
#  Figure 1. An illustration of the “Golden Triangle” [11] proposed by Johnson et al. Compounds within the shaded region in (MW, logD) space were found to have a higher chance of achieving better outcomes for permeability and metabolic stability. This is a convenient visual rule-of- thumb for selecting compounds.
# Other rules, involving parameters such as the fraction of carbons which are sp3 hybridized [12] and the number of aromatic rings [13] have been proposed as measures of developability or likelihood of clinical success. Furthermore, Hughes et al. [14] studied the relationship between physicochemical properties and adverse events observed in in vivo toleration studies. They concluded that compounds with both calculated logP (clogP) > 3 and topological polar surface area (TPSA) < 75 Å2 had a significantly increased safety risk.
# The undoubted popularity of these rules derives from their simplicity and interpretability, the first requirement for a good MPO method. It is very easy to calculate these characteristics and quickly check if a compound obeys these rules. Similarly it is easy to understand how to modify a compound that fails to meet these rules in order to improve its chance of success; it is clear how MW, HBD or HBA could be reduced and

#  chemists have a good understanding of the influence of chemical functionalities on lipophilicity. Therefore, these rules-of-thumb provide an easy approach to selecting compounds and guiding their redesign.
# The main disadvantage of these rules-of-thumb is also due to their simplicity. There may be a tendency to over-interpret simple rules and apply them with too much rigor. For example, does a compound with a MW of 501 have a significantly worse chance of oral absorption than one with MW of 500? Indeed, Lipinski’s original paper [8] suggested that two or more failures against the RoF criteria were required to significantly decrease the chance of oral absorption, so the rules were not intended to be applied individually.
# These rules are derived from a review of historically successful drugs and are often treated as absolute rules that define ‘drug likeness.’ However, compounds for different therapeutic indications or routes of administration may require different characteristics or be more tolerant to violations of these rules. For example, there has been a tendency for the RoF to be considered as a definition of the conditions for ‘drug likeness’ when it is only based on analysis of the requirements for orally absorbed drugs. Drugs intended for topical, IV, inhaled or other routes of administration can violate some or all of the rules without a significant impact on their chance of success [15]. Therefore, the criteria and weight given to each of these rules-of- thumb should be defined or applied flexibly according to the therapeutic objectives of a project. Unfortunately, this is not a straightforward exercise, as careful statistical analysis of a large number of compounds is required to identify statistically significant criteria.
# The majority of the characteristics used in these rules-of-thumb do not have any underlying uncertainty, as they are simple values calculated from the molecular structure. The principal exception to this is lipophilicity (logP or logD) which, if calculated, typically has an uncertainty (root-mean-square-error) of at least 0.5 log units. Therefore, care should be taken when drawing conclusions regarding compounds close to the criterion for lipophilicity.
# Finally, we should consider the confidence in the ‘prediction’ by a rule-of-thumb. As an illustrative example, we applied the RoF to a set of 1191 marketed drugs labeled according to whether they have been approved for oral administration and the results are shown in Table 1. Although, one should be careful not to over interpret these results, we can see that passing the RoF is not a guarantee of finding an orally available compound. This is not surprising, as the RoF was derived from observations of absorption and other factors such as first pass metabolism can limit oral bioavailability. However, the specificity of the RoF is also low (21%), as more non-orally administered compounds pass the RoF than fail and a significant proportion of compounds that fail the RoF are orally administered.
# Table 1. The results of applying Lipinski’s Rule of Five to 1191 marketed drugs labeled as oral or non-oral according to their approved route of administration.
# Oral 709 59 Non-oral 333 90
# In summary, rules-of-thumb can provide very convenient and easily applied guidelines for the selection of compounds with a greater chance of yielding successful drugs, if used in the appropriate context. However, one should be careful about being overly rigid regarding their application as this could lead to missed opportunities.
#   RoF result
#  Pass
# (1 RoF Failure)
#  Fail
# (>1 RoF Failure)
                  
#  Filtering
# Another simple approach to applying multiple criteria to the selection of compounds is sequential filtering. In this process the compounds are compared to series of criteria; those that fail to meet a criterion are discarded while those that meet the criterion are progressed for comparison against the next criterion in the sequence. The hope is that one or more ‘ideal’ compounds will emerge from the sequence of filters, having passed all of the criteria. Filtering offers the benefit that interpretation is straightforward, because if a compound fails one or more criteria this clearly indicates the focus for improving the compound.
# The set of criteria against which compounds are compared can be based on any relevant properties, whether calculated or experimental. This offers the flexibility that a drug discovery project may choose criteria that are tailored to the project objectives, based on the experience of the project team or historical data for successful compounds for the intended therapeutic indication. These criteria are sometimes referred to as a target product profile (TPP) and an illustrative example of such a profile for identification of a lead compound for an orally dosed compound is shown in Table 2. Early in a project, for example when choosing a screening library for high throughput screening, it is also common to apply the criteria indicated by one of more of the rules-of- thumb discussed above as sequential filters.
# Table 2. An example of a target product profile for selection of a lead compound intended for oral administration.
# Potency against target (Ki)
# Selectivity against related off-targets
# Physicochemical
# LogP
# Solubility
# MW
# ADME
# Caco-2* permeability (Papp)
# Intrinsic Clearance in Human Liver Microsomes (Clint) Absence of P-glycoprotein transport (Caco2 BA:AB)
# Safety
# Avoid Cytochrome P450-mediated drug-drug interactions (Ki for CYP3A4, CYP2C9, CYP2D6, CYP1A2) Avoid interaction with hERG potassium ion channel (IC50)
# Cytotoxicity in HepG2† cells (LD50)
# *Human epithelial colorectal adenocarcinoma cell line [16] †Hepatocellular carcinoma cell line [17]
# One challenge of filtering is that it is common for no compounds to emerge from the end of the sequence; there are several possible reasons for this:
#  There are often conflicts between the property criteria; improving one property often leads to an adverse change in another. In these situations, the relative importance of each criterion should be taken into account as this defines acceptable trade-offs against conflicting properties.
#  Simple yes/no criteria may be too strict; For example, if a compound meets all of the criteria in the TPP, except that it has a logP of 5.1 versus a criterion of <5, does it make sense to reject it?
#  There may have been a mis-measurement or mis-prediction; one or more compounds may have been incorrectly rejected due to the experimental variability or statistical error in a prediction.
# The last of these is probably the biggest concern about filtering because, as we discussed above, there is significant uncertainty in almost all of the data which is available in early drug discovery. If we consider a
#   Property
#  Criterion
#  Pharmacology
#      <100 nM >100 
# <4
# >100 M <450 Da
# >1010-6 cm/s
# <25 L/min/mg protein <3
# >1 M >10 M >1 mM
                               
#  simple illustrative example in which we have 10 filters that are each 90% accurate in passing/failing a compound, the probability of an ideal compound emerging, even if it was present in the set being filtered, is only 35% (p = 0.910 assuming independence of the error in each filter). Therefore, even in this optimistic case, sequential filtering is more likely to discard an ideal compound than accept it. Furthermore, there is a significant chance of incorrectly passing a poor compound; in this example, if a compound should correctly fail only one of the criteria, the probability of it being incorrectly accepted is 4%. Given that there are typically many more poor compounds than good, this means that any ideal compound that is fortunate enough to be correctly passed by all of the filters is likely to be swamped by poor compounds incorrectly accepted.
# Therefore, despite the simplicity and easy interpretation of filtering, it should be treated with caution. The process accumulates error without that being transparent, running the risk of rejecting good compounds and missing opportunities to find a high quality drug.
# Calculated Metrics
# Rather than defining criteria for multiple, individual properties these may be combined to calculate a single metric that can be optimized to guide selection or design. One of the earliest and most commonly applied metrics is the Ligand Efficiency (LE) proposed by Hopkins et al. [18], with the goal of mitigating the tendency to focus too heavily on the optimization of potency at the cost of other necessary properties. LE was derived from the observation that smaller compounds tend to have better physicochemical and ADME properties than large compounds. Therefore, given two equally potent compounds it is preferable to choose the smaller. Or, alternatively, increasing potency without significantly increasing compound size is desirable. LE is defined as:,
# where G is the free energy of binding and NH is the number of heavy (i.e. non-Hydrogen) atoms in the compound. In more common units, this may be expressed as:,
# where pIC50 = -log(IC50) and the IC50 is expressed in molar concentration.
# The use of the LE metric is particularly popular in fragment-based drug design [19], where the starting point is typically one or more small fragments with low binding affinity and new compounds are designed by growing or linking these fragments to identify a larger compound with sufficient potency. Although the initial fragments bind only weakly, they have a high LE due to their small size and the optimization process may be guided by increasing the potency while maintaining a high LE.
# The LE metric inspired other calculated optimization metrics, for example Ligand Lipophilicity Efficiency (LLE) [20], also known as Lipophilic Efficiency (LipE):
# where a calculated value of logP is often used. This was motivated by the desire to maximize potency while maintaining as low a lipophilicity as possible, due to the association between high lipophilicity and several issues including poor solubility, membrane permeation and metabolic stability, lack of selectivity and a higher risk of non-specific toxicity [21] [22].
# The range of efficiency metrics has been further extended to include percent efficiency index (PEI), defined as the percent inhibition (as a fraction between 0 and 1) divided by MW in kDa; binding efficiency index (BEI), defined as pIC50 divided by MW in kDa; and surface efficiency index (SEI), defined as pIC50 divided by PSA in 100s of Å. All of these combine a measure of potency related to another property representing the ‘drug- likeness’ of the compound and are reviewed in detail in [23]. More complex derivatives of these efficiency indexes have also been proposed including ligand efficiency-dependent lipophilicity (LEDL), defined as logP divided by LE [24] and ‘fit quality’ [25].
# These calculated metrics have the advantage that they are simple to apply, as only a single value must be monitored during optimization. They are also easy to interpret – Increase potency while minimizing the increase in compound size or lipophilicity – although this ease of interpretation may be sacrificed somewhat by the more complex efficiency indexes such as LEDL.
#  In many cases rules-of-thumb have been developed for selection of high quality compounds using these metrics. For example, it has been proposed that a LLE of 6 or higher is preferable, corresponding to a potency of better than 10 nM with a logP of 2. Again these provide useful guidelines when applied in an appropriate context, but the same caveats apply here as to the rules-of-thumb discussed above, in particular:
#  Potency and logP values have significant uncertainty, particularly when predicted, yet it is rare to see the uncertainties propagated through the calculation of the efficiency metric to consider the confidence with which compounds may be chosen based on these metrics.
#  As noted above, increasing compound size, MW and logP significantly increases the chance of encountering issues with poor physicochemical, ADME and toxicity. However, the correlation with these properties is not perfect, so it may be inappropriate to make selections based too strictly on these metrics, particularly when options are limited.
#  These rules are not universal and are typically based on identification of orally administered drugs, so the project’s therapeutic objective should be considered carefully before choosing a criterion.
# It is noteworthy that there is a close relationship between the optimization based on these metrics and a recent trend to optimize compounds based on measurements of the thermodynamic parameters of binding using biophysical measurements [26]. This strategy suggests that it is better to increase binding affinity (or equivalently decrease the free energy G, as strong binding is equivalent to a reduction in free energy) by introducing an interaction dominated by decreasing the enthalpy of binding ( H) rather than one dominated by increasing entropy ( S) – note G = H - T S, where T is the temperature. Decreasing the binding free energy by reducing the enthalpy is achieved by forming a specific interaction with the target, for example a hydrogen bond with a residue in the binding pocket, which will typically improve the LE or LLE. The free energy can also be reduced by increasing the entropy and this can be achieved by displacing coordinated water molecules from the binding pocket into bulk solvent, for example by adding a bulky lipophilic group to occupy the binding pocket. However, this is often detrimental in the long-run, as such a non-specific interaction will increase the chance of off-target binding or non-specific toxicity and this is reflected by a decrease in the LE or LLE [27].
# Pareto Optimization
# The concept known as Pareto optimality was proposed by an Italian economist Vilfredo Pareto in the early 20th Century [28]. He suggested that, when considering multiple parameters, there may not be a single best combination of parameters, but rather a family of solutions that each represents a different, optimal combination. More specifically, a solution to a multi-parameter optimization problem is considered to be a Pareto optimum if there is no other solution that is better in all of the parameters.
# To illustrate this, consider the two-parameter example, illustrated in Figure 2(a), of a hypothetical drug discovery project which wishes to achieve an optimal balance of potency and metabolic stability to achieve good in vivo exposure and hence efficacy. Ideally, the project would like to identify a compound with high potency (pIC50) and good stability in human liver microsomes (% remaining after incubation for 40 minutes). This ideal goal is represented by the top right corner of the plots in Figure 2. However, as this ideal may be difficult or impossible to achieve, the project would like to find a good balance between potency and metabolic stability. The points shown as solid points in Figure 2 represent compounds with different, Pareto optimal combinations of these two parameters. For example, the point labeled A has no points that have both better potency and better metabolic stability, i.e. there are no points to the right and above; such a point is described as ‘non-dominated’. Contrast this with the point labeled B, which has a point, C, to the right and above representing a compound that is better in both parameters; B is ‘dominated’ by C. Note that the non- dominated points define a boundary, known as the ‘Pareto front’ and each represents a candidate for further investigation to identify the best balance of potency and metabolic stability to achieve in vivo efficacy.
# The concept of Pareto optimality may be generalized to Pareto rank, whereby a point is ranked according to the number of points by which it is dominated, a rank-0 point is non-dominated, rank-1 is dominated by only a single point, etc. This allows compounds to be ranked according to how close they are to the optimum front."""

with open("polaris-antiviral-admet-2025.json", "r") as f:
    polaris_dataset = json.load(f)

polaris_dataset_train = polaris_dataset.copy()
polaris_dataset_train.pop("CO[C@H]1C[C@H](N2N=CC3=C(C(=O)NC4=CC=C5CNCC5=C4)C=C(Cl)C=C32)C1")
polaris_dataset_train = get_random_half_dict(polaris_dataset_train)
# , KSOL, MLM, MDR1-MDCKII, HLM
# prompt = f"""
# LogD <protocol>: like solubility - but then in fatty tissue - LogD is a measure of a molecule's lipophilicity, or how well it dissolves in fat. LogD is calculated by comparing a molecule's solubility in octanol, a fat-like substance, to its solubility in water.
# <protocol> Lipophilicity is possibly the most important physicochemical parameter for any potential drug candidate. Lipophilicity measurements are valuable for understanding how drugs are dissolved in plasma and other aqueous biological fluids. Lipophilicity is typically accessed as the distribution of the tested compound between two solvents - typically non-aqueous organic (1-octanol) and aqueous (pH-buffered water), and then LogP is expressed as a Log of the concentration ratio between two phases. LogP is widely used in cheminformatics and is a component of Lipinski’s “rule of five”, which is a golden standard to evaluate a drug-likeness of a compound. According to this rule, the successful drug candidate should possess LogP value not greater than 5. 

# LogD is a distribution coefficient widely used to measure the lipophilicity of ionizable compounds, where the partition is a function of the pH. 
# For nonionizable compounds LogP = LogD throughout pH range, whereas for ionizable compounds LogD takes into account the partition of both ionized and non-ionized forms. LogD is more convenient for practical measurements, as it takes into account solution pH, which is important for the analysis of the drug candidate properties in various biological media with different pH values. The shake flask method is considered the gold standard technique for determining log D.
# </protocol>

# Mouse Liver Microsomal stability (MLM, protocol): This is a stability assay that tests how quickly a molecule gets broken down by mouse liver microsomes. This is a useful assay that can be used as an estimate on how long a molecule will reside in the mouse body before it gets cleared.
# Human Liver Microsomal stability (HLM, protocol): This is a stability assay that tests how quickly a molecule gets broken down by human liver microsomes. This is a useful assay that can be used as an estimate on how long a molecule will reside in the human body before it gets cleared.
# Solubility (KSOL, protocol): solubility is essential for drug molecules: this heavily affects the pharmacokinetic and dynamics ('PKPD') of the molecule in the human body.
# Cell permeation (MDR1-MDCKII, protocol): MDCKII-MDR1 is a cell line that's used to model cell permeation i.e. how well drug compounds will permeate cell layers. For coronaviruses this is a critical endpoint because there is increasing evidence that afflictions such as long-covid are caused by (remnant) virus particles in the brain, and blood-brain-barrier (BBB) permeation is critical for drug candidates to reach the brain.


# What are the LogD for this small molecule CO[C@H]1C[C@H](N2N=CC3=C(C(=O)NC4=CC=C5CNCC5=C4)C=C(Cl)C=C32)C1?
# Here is the set of samples that can help make a discovery
# {str(polaris_dataset_train)}
# """
# prompt = f"""
# Background:

# LogD is a measure of a molecule's lipophilicity, or its ability to dissolve in fats. It's a crucial property in drug discovery, as it influences factors like absorption, distribution, metabolism, and excretion (ADME).
# Mouse Liver Microsomal stability (MLM, protocol): This is a stability assay that tests how quickly a molecule gets broken down by mouse liver microsomes. This is a useful assay that can be used as an estimate on how long a molecule will reside in the mouse body before it gets cleared.
# Human Liver Microsomal stability (HLM, protocol): This is a stability assay that tests how quickly a molecule gets broken down by human liver microsomes. This is a useful assay that can be used as an estimate on how long a molecule will reside in the human body before it gets cleared.
# Solubility (KSOL, protocol): solubility is essential for drug molecules: this heavily affects the pharmacokinetic and dynamics ('PKPD') of the molecule in the human body.
# Cell permeation (MDR1-MDCKII, protocol): MDCKII-MDR1 is a cell line that's used to model cell permeation i.e. how well drug compounds will permeate cell layers. For coronaviruses this is a critical endpoint because there is increasing evidence that afflictions such as long-covid are caused by (remnant) virus particles in the brain, and blood-brain-barrier (BBB) permeation is critical for drug candidates to reach the brain.

# Additional Context:

# You are provided with a dataset of similar molecules and their corresponding LogD values:
# {str(polaris_dataset_train)}


# What is the LogD for this small molecule CO[C@H]1C[C@H](N2N=CC3=C(C(=O)NC4=CC=C5CNCC5=C4)C=C(Cl)C=C32)C1?"""
# # test_input = f"""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
# # <answer> answer here </answer>. User: {prompt}. Assistant:"""
# response = test_model_inference(prompt)
# print(f"Test Input: {prompt}")
# print(f"Model Response: {response}")


# prompt = f"""
# Background:
# LogD is a measure of a molecule's lipophilicity, or its ability to dissolve in fats. It's a crucial property in drug discovery, as it influences factors like absorption, distribution, metabolism, and excretion (ADME).
# Mouse Liver Microsomal stability (MLM, protocol): This is a stability assay that tests how quickly a molecule gets broken down by mouse liver microsomes. This is a useful assay that can be used as an estimate on how long a molecule will reside in the mouse body before it gets cleared.
# Human Liver Microsomal stability (HLM, protocol): This is a stability assay that tests how quickly a molecule gets broken down by human liver microsomes. This is a useful assay that can be used as an estimate on how long a molecule will reside in the human body before it gets cleared.
# Solubility (KSOL, protocol): solubility is essential for drug molecules: this heavily affects the pharmacokinetic and dynamics ('PKPD') of the molecule in the human body.
# Cell permeation (MDR1-MDCKII, protocol): MDCKII-MDR1 is a cell line that's used to model cell permeation i.e. how well drug compounds will permeate cell layers. For coronaviruses this is a critical endpoint because there is increasing evidence that afflictions such as long-covid are caused by (remnant) virus particles in the brain, and blood-brain-barrier (BBB) permeation is critical for drug candidates to reach the brain.

# Additional Context:

# Here is the first batch of molecules we recieved wet lab results for:
# {str(polaris_dataset_train)}

# Manual Rules Used by Chemists in Drug Discovery

# When evaluating a batch of molecules with ADMET data, experienced chemists use a combination of knowledge, intuition, and established rules to guide their decision-making. Here are some of the key manual rules they often employ:

# Golden Triangle Rule: This rule suggests that compounds with 200 Da ≤ molecular weight ≤ 350 Da and -2 ≤ logD ≤ 5 are more likely to have a favorable ADMET profile.

# Number of rotatable bonds ≤ 10
# Polar surface area ≤ 140 Å² or Total number of hydrogen bond donors and acceptors ≤ 12
# Rule of Three (Ro3) : This rule is often applied in fragment-based drug discovery. It suggests that fragment hits should ideally have:   

# Molecular weight ≤ 300 Da
# cLogP ≤ 3
# Number of hydrogen bond donors and acceptors ≤ 3
# Other Considerations:

# "Lead-likeness" : Similar to drug-likeness, but with more relaxed criteria, as lead compounds are often optimized further.   
# Important Notes:

# These rules are guidelines, not absolute requirements. There are exceptions to every rule, and the specific context of the drug discovery program should always be considered.
# Experienced chemists often develop their own intuition and heuristics based on their experience and knowledge of specific drug targets and therapeutic areas.

# As a cheif chemist DM, comprehensively study the results and look at a candidate molecule CO[C@H]1C[C@H](N2N=CC3=C(C(=O)NC4=CC=C5CNCC5=C4)C=C(Cl)C=C32)C1. Based on its bonds, fragments, atoms, first batch results and your chemistry knoledge, what do you think is its logD value, High-Risk (Pi>thresholdmax) → Requires modification to reduce, Weakly Optimized (Pi<thresholdmin) → Needs enhancement, Acceptable"""
# test_input = f"""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
# # <answer> answer here </answer>. User: {prompt}. Assistant:"""
# prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.

# User: LogD measures a molecule's lipophilicity, influencing its absorption, distribution, metabolism, and excretion (ADME). Given the candidate molecule:

# CO[C@H]1C[C@H](N2N=CC3=C(C(=O)NC4=CC=C5CNCC5=C4)C=C(Cl)C=C32)C1

# analyze its LogD value based on known chemical properties and ADMET guidelines.

# Output the LogD classification in one of the following categories:
# - High-Risk (LogD > 3)
# - Weakly Optimized (LogD < 1)
# - Acceptable (1 ≤ LogD ≤ 3)

# Assistant: <think> Step-by-step reasoning process for determining LogD. </think> <answer> Final LogD classification. </answer>"""
# prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.

# User: Background:
# LogD is a measure of a molecule's lipophilicity, or its ability to dissolve in fats. It is a crucial property in drug discovery, as it influences factors like absorption, distribution, metabolism, and excretion (ADME).

# Mouse Liver Microsomal Stability (MLM, protocol): This is a stability assay that tests how quickly a molecule is broken down by mouse liver microsomes. It is a useful assay that can estimate how long a molecule will reside in the mouse body before clearance.

# Human Liver Microsomal Stability (HLM, protocol): This is a stability assay that tests how quickly a molecule is broken down by human liver microsomes. It is a useful assay that can estimate how long a molecule will reside in the human body before clearance.

# Solubility (KSOL, protocol): Solubility is essential for drug molecules as it heavily affects the pharmacokinetics and pharmacodynamics ('PKPD') of the molecule in the human body.

# Cell Permeation (MDR1-MDCKII, protocol): MDCKII-MDR1 is a cell line used to model cell permeation, i.e., how well drug compounds permeate cell layers. For coronaviruses, this is a critical endpoint because increasing evidence suggests that afflictions such as long COVID are caused by (remnant) virus particles in the brain. Blood-brain barrier (BBB) permeation is critical for drug candidates targeting the brain.

# Additional Context:

# Here is the first batch of molecules for which we received wet lab results:
# {str(polaris_dataset_train)}

# Manual Rules Used by Chemists in Drug Discovery:

# When evaluating a batch of molecules with ADMET data, experienced chemists use a combination of knowledge, intuition, and established rules to guide their decision-making. Here are some of the key manual rules they often employ:

# Lipinski's Rule of Five (Ro5): This rule helps predict the oral bioavailability of a drug candidate. It states that a molecule is likely to have good oral absorption if it meets the following criteria:
# - Molecular weight ≤ 500 Da
# - Lipophilicity (logP) ≤ 5
# - Number of hydrogen bond acceptors ≤ 10
# - Number of hydrogen bond donors ≤ 5

# Pfizer Rule: This rule focuses on potential toxicity. It suggests that compounds with high lipophilicity (logP > 3) and low polar surface area (TPSA < 75) are more likely to be toxic.

# GSK Rule: This rule proposes that compounds with molecular weight ≤ 400 Da and logP ≤ 4 are more likely to have a favorable ADMET profile.

# Golden Triangle Rule: This rule suggests that compounds with 200 Da ≤ molecular weight ≤ 350 Da and -2 ≤ logD ≤ 5 are more likely to have a favorable ADMET profile.

# Veber's Rule: This rule expands on Lipinski's Ro5 by considering molecular flexibility. It suggests that good oral bioavailability is likely if:   
# - Number of rotatable bonds ≤ 10
# - Polar surface area ≤ 140 Å² or total number of hydrogen bond donors and acceptors ≤ 12

# Rule of Three (Ro3): This rule is often applied in fragment-based drug discovery. It suggests that fragment hits should ideally have:   
# - Molecular weight ≤ 300 Da
# - cLogP ≤ 3
# - Number of hydrogen bond donors and acceptors ≤ 3

# Other Considerations:
# - **Lead-likeness**: Similar to drug-likeness, but with more relaxed criteria, as lead compounds are often optimized further.
# - **Ligand Efficiency (LE)**: A measure of binding affinity relative to the size of the molecule. Higher LE is generally desirable.
# - **Lipophilic Efficiency (LiPE)**: A measure of potency relative to lipophilicity. Higher LiPE suggests a better balance between these properties.
# - **Synthetic Accessibility**: The ease with which a molecule can be synthesized. Easier synthesis is generally preferred.

# Important Notes:
# - These rules are guidelines, not absolute requirements. There are exceptions to every rule, and the specific context of the drug discovery program should always be considered.
# - Experienced chemists often develop their own intuition and heuristics based on their experience and knowledge of specific drug targets and therapeutic areas.
# - Computational tools and predictive models are increasingly used to complement these manual rules and provide more quantitative predictions of ADMET properties.
# - By combining these manual rules with their expertise and the available data, chemists can make informed decisions about which molecules to prioritize for further development and optimization.

# As a chief chemist DM, comprehensively study the results and examine the candidate molecule:
# CO[C@H]1C[C@H](N2N=CC3=C(C(=O)NC4=CC=C5CNCC5=C4)C=C(Cl)C=C32)C1. 

# Based on its bonds, fragments, atoms, first batch results, and your chemistry knowledge, what do you think is its LogD value?

# Output results in one of the three ranges:

# **LogD (Lipophilicity):**
# - **High-Risk (High Lipophilicity):** LogD > 3. These molecules are highly lipophilic and may have increased metabolism, poor solubility, potential toxicity, and higher risk of off-target binding. Modification to reduce lipophilicity is often needed.
# - **Weakly Optimized (Low Lipophilicity):** LogD < 1. These molecules are very polar and may have poor permeability. Enhancement of lipophilicity might be necessary to improve absorption and distribution.
# - **Acceptable (Optimal Lipophilicity):** 1 ≤ LogD ≤ 3. These molecules have a good balance of lipophilicity and polarity, which is generally favorable for drug-like properties.

# **HLM (Human Liver Microsomal Stability):**
# - High-Risk (Rapid Metabolism): T½ < 15 minutes
# - Weakly Optimized (Moderate Metabolism): 15 minutes ≤ T½ < 30 minutes
# - Acceptable (Good Stability): T½ ≥ 30 minutes

# **MLM (Mouse Liver Microsomal Stability):**
# - High-Risk (Rapid Metabolism): T½ < 15 minutes
# - Weakly Optimized (Moderate Metabolism): 15 minutes ≤ T½ < 30 minutes
# - Acceptable (Good Stability): T½ ≥ 30 minutes

# **KSOL (Solubility):**
# - High-Risk (Poor Solubility): logS < -5
# - Weakly Optimized (Moderate Solubility): -5 ≤ logS < -4
# - Acceptable (Good Solubility): logS ≥ -4

# **MDR1-MDCKII (Cell Permeability):**
# - High-Risk (Poor Permeability): Papp < 1 x 10⁻⁶ cm/s
# - Weakly Optimized (Moderate Permeability): 1 x 10⁻⁶ cm/s ≤ Papp < 2 x 10⁻⁶ cm/s
# - Acceptable (Good Permeability): Papp ≥ 2 x 10⁻⁶ cm/s

# Assistant: <think> Step-by-step reasoning process for evaluating LogD, HLM, MLM, KSOL, and MDR1-MDCKII. </think> <answer> 
# {
#     "LogD": "category_here",
#     "HLM": "category_here",
#     "MLM": "category_here",
#     "KSOL": "category_here",
#     "MDR1-MDCKII": "category_here"
# } 
# </answer>"""

prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. 

User: Background:
LogD is a measure of a molecule's lipophilicity, or its ability to dissolve in fats. It is a crucial property in drug discovery, as it influences factors like absorption, distribution, metabolism, and excretion (ADME).

Mouse Liver Microsomal Stability (MLM, protocol): This is a stability assay that tests how quickly a molecule is broken down by mouse liver microsomes. It is a useful assay that can estimate how long a molecule will reside in the mouse body before clearance.

Human Liver Microsomal Stability (HLM, protocol): This is a stability assay that tests how quickly a molecule is broken down by human liver microsomes. It is a useful assay that can estimate how long a molecule will reside in the human body before clearance.

Solubility (KSOL, protocol): Solubility is essential for drug molecules as it heavily affects the pharmacokinetics and pharmacodynamics ('PKPD') of the molecule in the human body.

Cell Permeation (MDR1-MDCKII, protocol): MDCKII-MDR1 is a cell line used to model cell permeation, i.e., how well drug compounds permeate cell layers. For coronaviruses, this is a critical endpoint because increasing evidence suggests that afflictions such as long COVID are caused by (remnant) virus particles in the brain. Blood-brain barrier (BBB) permeation is critical for drug candidates targeting the brain.

Additional Context:

Here is the first batch of molecules for which we received wet lab results:
{str(polaris_dataset_train)}

Manual Rules Used by Chemists in Drug Discovery:

When evaluating a batch of molecules with ADMET data, experienced chemists use a combination of knowledge, intuition, and established rules to guide their decision-making. Here are some of the key manual rules they often employ:

Lipinski's Rule of Five (Ro5): This rule helps predict the oral bioavailability of a drug candidate. It states that a molecule is likely to have good oral absorption if it meets the following criteria:
- Molecular weight ≤ 500 Da
- Lipophilicity (logP) ≤ 5
- Number of hydrogen bond acceptors ≤ 10
- Number of hydrogen bond donors ≤ 5

Pfizer Rule: This rule focuses on potential toxicity. It suggests that compounds with high lipophilicity (logP > 3) and low polar surface area (TPSA < 75) are more likely to be toxic.

GSK Rule: This rule proposes that compounds with molecular weight ≤ 400 Da and logP ≤ 4 are more likely to have a favorable ADMET profile.

Golden Triangle Rule: This rule suggests that compounds with 200 Da ≤ molecular weight ≤ 350 Da and -2 ≤ logD ≤ 5 are more likely to have a favorable ADMET profile.

Veber's Rule: This rule expands on Lipinski's Ro5 by considering molecular flexibility. It suggests that good oral bioavailability is likely if:   
- Number of rotatable bonds ≤ 10
- Polar surface area ≤ 140 Å² or total number of hydrogen bond donors and acceptors ≤ 12

Rule of Three (Ro3): This rule is often applied in fragment-based drug discovery. It suggests that fragment hits should ideally have:   
- Molecular weight ≤ 300 Da
- cLogP ≤ 3
- Number of hydrogen bond donors and acceptors ≤ 3

Other Considerations:
- **Lead-likeness**: Similar to drug-likeness, but with more relaxed criteria, as lead compounds are often optimized further.
- **Ligand Efficiency (LE)**: A measure of binding affinity relative to the size of the molecule. Higher LE is generally desirable.
- **Lipophilic Efficiency (LiPE)**: A measure of potency relative to lipophilicity. Higher LiPE suggests a better balance between these properties.
- **Synthetic Accessibility**: The ease with which a molecule can be synthesized. Easier synthesis is generally preferred.

Important Notes:
- These rules are guidelines, not absolute requirements. There are exceptions to every rule, and the specific context of the drug discovery program should always be considered.
- Experienced chemists often develop their own intuition and heuristics based on their experience and knowledge of specific drug targets and therapeutic areas.
- Computational tools and predictive models are increasingly used to complement these manual rules and provide more quantitative predictions of ADMET properties.
- By combining these manual rules with their expertise and the available data, chemists can make informed decisions about which molecules to prioritize for further development and optimization.

As a chief chemist DM, comprehensively study the results and examine the candidate molecule SMILES user provides. 

Based on its bonds, fragments, atoms, first batch results, and your chemistry knowledge, what do you think is its LogD value?

Output results in one of the three ranges:

LogD (Lipophilicity):
- High-Risk (High Lipophilicity): LogD > 3. These molecules are highly lipophilic and may have increased metabolism, poor solubility, potential toxicity, and higher risk of off-target binding. Modification to reduce lipophilicity is often needed.
- Weakly Optimized (Low Lipophilicity): LogD < 1. These molecules are very polar and may have poor permeability. Enhancement of lipophilicity might be necessary to improve absorption and distribution.
- Acceptable (Optimal Lipophilicity): 1 ≤ LogD ≤ 3. These molecules have a good balance of lipophilicity and polarity, which is generally favorable for drug-like properties.

HLM (Human Liver Microsomal Stability):
- High-Risk (Rapid Metabolism): T½ < 15 minutes
- Weakly Optimized (Moderate Metabolism): 15 minutes ≤ T½ < 30 minutes
- Acceptable (Good Stability): T½ ≥ 30 minutes

MLM (Mouse Liver Microsomal Stability):
- High-Risk (Rapid Metabolism): T½ < 15 minutes
- Weakly Optimized (Moderate Metabolism): 15 minutes ≤ T½ < 30 minutes
- Acceptable (Good Stability): T½ ≥ 30 minutes

KSOL (Solubility):
- High-Risk (Poor Solubility): logS < -5
- Weakly Optimized (Moderate Solubility): -5 ≤ logS < -4
- Acceptable (Good Solubility): logS ≥ -4

MDR1-MDCKII (Cell Permeability):
- High-Risk (Poor Permeability): Papp < 1 x 10⁻⁶ cm/s
- Weakly Optimized (Moderate Permeability): 1 x 10⁻⁶ cm/s ≤ Papp < 2 x 10⁻⁶ cm/s
- Acceptable (Good Permeability): Papp ≥ 2 x 10⁻⁶ cm/s

The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e.,<think> Step-by-step reasoning process for evaluating LogD, HLM, MLM, KSOL, and MDR1-MDCKII. </think> <answer> 
{
    "LogD": "value",
    "HLM": "value",
    "MLM": "value",
    "KSOL": "value",
    "MDR1-MDCKII": "value"
} 
</answer>.
User: What are the LogD,HLM,MLM, KSOL and MDR1-MDCKII for CO[C@H]1C[C@H](N2N=CC3=C(C(=O)NC4=CC=C5CNCC5=C4)C=C(Cl)C=C32)C1?. Assistant:
"""

# HLM (Human Liver Microsomal Stability):

# High-Risk (Rapid Metabolism): T½ < 15 minutes. These molecules are rapidly metabolized and likely to have poor oral bioavailability. Significant optimization is usually needed.
# Weakly Optimized (Moderate Metabolism): 15 minutes ≤ T½ < 30 minutes. These molecules have moderate stability and may require some optimization to improve their pharmacokinetic properties.
# Acceptable (Good Stability): T½ ≥ 30 minutes. These molecules exhibit good stability and are less likely to be limited by hepatic metabolism.
# MLM (Mouse Liver Microsomal Stability):

# High-Risk (Rapid Metabolism): T½ < 15 minutes
# Weakly Optimized (Moderate Metabolism): 15 minutes ≤ T½ < 30 minutes
# Acceptable (Good Stability): T½ ≥ 30 minutes
# KSOL (Solubility):

# High-Risk (Poor Solubility): logS < -5. These molecules have very low solubility and are likely to face significant challenges with formulation and absorption. Major optimization is often needed.
# Weakly Optimized (Moderate Solubility): -5 ≤ logS < -4. These molecules have moderate solubility and may require some formulation efforts or structural modifications to improve their solubility.
# Acceptable (Good Solubility): logS ≥ -4. These molecules exhibit good solubility and are less likely to be limited by solubility issues.
# LogD (Lipophilicity):

# High-Risk (High Lipophilicity): LogD > 3. These molecules are highly lipophilic and may have increased metabolism, poor solubility, potential for toxicity, and higher risk of off-target binding. Modification to reduce lipophilicity is often needed.
# Weakly Optimized (Low Lipophilicity): LogD < 1. These molecules are very polar and may have poor permeability. Enhancement of lipophilicity might be necessary to improve absorption and distribution.
# Acceptable (Optimal Lipophilicity): 1 ≤ LogD ≤ 3. These molecules have a good balance of lipophilicity and polarity, which is generally favorable for drug-like properties.
# MDR1-MDCKII (Cell Permeability):

# High-Risk (Poor Permeability): Papp < 1 x 10⁻⁶ cm/s. These molecules have very low permeability and are likely to face challenges with absorption and reaching their target. Significant optimization is usually needed.
# Weakly Optimized (Moderate Permeability): 1 x 10⁻⁶ cm/s ≤ Papp < 2 x 10⁻⁶ cm/s. These molecules have moderate permeability and may benefit from some optimization to improve their absorption and distribution.
# Acceptable (Good Permeability): Papp ≥ 2 x 10⁻⁶ cm/s. These molecules exhibit good permeability and are less likely to be limited by permeability issues.

In [10]:
response = test_model_inference(prompt)
print(f"Test Input: {prompt}")
print(f"Model Response: {response}")

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Test Input: A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. 

User: Background:
LogD is a measure of a molecule's lipophilicity, or its ability to dissolve in fats. It is a crucial property in drug discovery, as it influences factors like absorption, distribution, metabolism, and excretion (ADME).

Mouse Liver Microsomal Stability (MLM, protocol): This is a stability assay that tests how quickly a molecule is broken down by mouse liver microsomes. It is a useful assay that can estimate how long a molecule will reside in the mouse body before clearance.

Human Liver Microsomal Stability (HLM, protocol): This is a stability assay that tests how quickly a molecule is broken down by human liver microsomes. It is a useful assay that can estimate how long a molecule will reside in the human body before clearance.

Solubility (KSOL, p

In [8]:
response[-200:]

'k>\n\n```json\n{\n    "LogD": "low lipophilicity",\n    "HLM": "moderate metabolism",\n    "MLM": "moderate metabolism",\n    "KSOL": "poor solubility",\n    "MDR1-MDCKII": "good permeability"\n}\n</answer>\n```'

In [8]:
with open("polaris-antiviral-admet-2025.json", "r") as f:
    polaris_dataset = json.load(f)

In [15]:
polaris_dataset["CO[C@H]1C[C@H](N2N=CC3=C(C(=O)NC4=CC=C5CNCC5=C4)C=C(Cl)C=C32)C1"]

{'HLM': nan, 'MDR1-MDCKII': 1.6, 'MLM': nan, 'LogD': 1.5, 'KSOL': 340.0}

In [9]:
# Qwen/Qwen2.5-0.5B-Instruct response to 
# prompt = "What are the LogD, KSOL, MLM, MDR1-MDCKII, HLM for this small molecule CO[C@H]1C[C@H](N2N=CC3=C(C(=O)NC4=CC=C5CNCC5=C4)C=C(Cl)C=C32)C1?"
# test_input = f"""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
# <answer> answer here </answer>. User: {prompt}. Assistant:"""

# 1. **Identify Key Components**: Look at the molecular formula and identify the functional groups that contribute to these parameters.
# 2. **Use Structure-Based Methods**: If possible, use structural databases or software like ChemDraw to visualize the molecule's structure and calculate key properties based on the known information.
# 3. **Consult Literature**: Refer to literature reviews or databases where similar molecules are described to get insights into their properties.

So, the output of this tiny model is quite reliable and suitable for our DeepSeek lookalike model training for sure.

## Policy Model (R) In RL Setup

Now that we have selected our base model, next we need to understand how a basic RL setup works for training an LLM.

For DeepSeek R1 their starting point was (DeepSeek V3) base model and in our case we are starting with Qwen2.5–0.5B-Instruct. By a starting point I meant that **it has created the DeepSeek R1 zero version**, an initial version which has some errors in it before the final version was created.

The initial version (R1 Zero) was created using Reinforcement Learning where (DeepSeek v3/Qwen2.5–0.5B) acts as an RL agent (actor who takes action). Let’s first visualize how it works.

![Qwen 2.5 as an agent workflow (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/5410/1*S6YIXu1vIVmQFl-DgRFktg.png)

The RL agent (DeepSeek V3/Qwen2–0.5B) starts by taking an **Action**, which means it generates an answer and some reasoning for a given problem that’s put into its **Environment**. The Environment, in this case, is simply the reasoning task itself.

After taking an action, the Environment gives back a **Reward**. This Reward is like feedback, it tells our base model (DeepSeek V3/Qwen2–0.5B) how good its action was. A positive Reward means it did something right, maybe got the answer correct or reasoned well. This feedback signal then goes back to our base model, helping it learn and adjust how it takes actions in the future to get even better Rewards.
> In the next section, we will be discussing this methodology in more detail

## GRPO Algorithm for R1 Zero

So that we have understand a basic RL flow now we need to learn what exact RL algorithm DeepSeek uses for R1-Zero.

There are many RL algos available, but traditional RL use something called a **“critic” **to help the main decision making part (“actor” i.e. DeepSeek-V3/Qwen2-0.5B). This critic is usually just as big and complex as the actor itself, which basically doubles the amount of computational cost.

But DeepSeek uses GRPO for training their initial (R1 Zero), **GRPO** does things differently because it figures out a baseline, a kind of reference point for good actions directly from the results it gets from a **group** of actions. Because of this, GRPO doesn’t need a separate critic model at all. This saves a lot of computation and makes things more efficient.

Let’s draw a flowchart of how GRPO is being used for R1 Zero training, and then we will **interpretate** it.

![GRPO Flow for DeepSeek R1 Zero (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/6404/1*8mfNzi-gvasR7mSaseswmg.png)

Let’s understand how DeepSeek GRPO implementation works with our base model (Qwen2–0.5B). 

First, the **Problem Input (A)** is given to the **Qwen Model (B)**, Qwen attempts to generate an answer through **Generate Completion (C)**. The final result, called the **Completion Output (D)**, includes reasoning steps in <think> tags and the final solution in <answer> tags.

Next, the **Problem Input (A)** and the **Ground Truth Solution (E)** are fed into the **Reward Functions (F)**, acting as intelligent graders. These functions compare Qwen **Completion Output (D)** with the correct solution and evaluate different aspects such as:

 1. **Accuracy** (is the answer correct?)

 2. **Format** (are the <think> and <answer> tags used properly?)

 3. **Reasoning Steps** (is the logic clear?)

 4. **Cosine Scaling** (is the response concise?)

 5. **Repetition Penalty** (is there unnecessary repetition?).

These evaluations produce **Reward Scores (G)**, which are then passed to the **GRPO Trainer (H)**. The trainer uses gradients to adjust the **Qwen Model (B)**, fine-tuning how it generates answers. This process is called **Gradient Reward Policy Optimization** because it optimizes Qwen responses using **gradients**, **reward feedback**, and **policy adjustments** to maximize performance.

Finally, the updated **Qwen Model (B)** is tested again on new problems, continuously refining itself through repeated cycles. With each iteration, Qwen becomes a better problem solver.

> In the upcoming section we will start preprocessing our training dataset for GRPO training

## Prompt Template

We are using the same thinking prompt template that DeepSeek uses for the GRPO algorithm to build R1 Zero, so let’s define that:

In [None]:
# DeepSeek system prompt for GRPO based training
SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)

This **system prompt** tells the base model (Qwen2–0.5B) its role as a helpful assistant who reasons step-by-step before answering.

The `<think>` and `<answer>` tags are used to structure the model response, separating its internal reasoning from the final answer for better evaluation and reward.

## Preprocessing Training Data

Now that we have our system prompt ready, we need to transform our training data according to our template.

![Preprocessing dataset overview (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/6160/1*XnM7v4dPD4LtyAh2MLuInA.png)

We need to create the make_conversation function that will handle the conversation for us.

In [None]:
# Function to structure the training data
def make_conversation(example):
    """Convert dataset examples into conversation format."""
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

It will take each problem column value from our training dataset and return a dictionary with the system prompt and the appended problem question for each row. Let’s create this function that will prepare our dataset.

In [None]:
# Load and prepare dataset
def load_math_dataset():
    """Load and prepare the mathematics dataset."""
    dataset = load_dataset(
        "AI-MO/NuminaMath-TIR",
        name="default",
        split=['train', 'test']
    )
    
    # Convert splits into dictionary
    dataset = {
        'train': dataset[0],
        'test': dataset[1]
    }
    
    # Apply conversation format
    for split in dataset:
        dataset[split] = dataset[split].map(make_conversation)

        # Remove 'messages' column if exists
        if "messages" in dataset[split].column_names:
            dataset[split] = dataset[split].remove_columns("messages")
    
    return dataset

We have everything ready, let’s transform our training data into the required format and print the training and test size.

In [None]:
# Load our training dataset and printing train/test size
dataset = load_math_dataset()

print(f"Train set size: {len(dataset['train'])}")
print(f"Test set size: {len(dataset['test'])}")

Now that we have split our training dataset, we need to validate our dataset (**Check if user/assistant conversation exist**) before moving to the next step.

In [None]:
def validate_dataset(dataset):
    """Perform basic validation checks on the dataset."""
    
    # Define the required fields for the dataset
    required_fields = ["problem", "prompt"]

    # Loop through the 'train' and 'test' splits of the dataset
    for split in ['train', 'test']:
        print(f"\nValidating {split} split:")

        # Retrieve column names from the dataset
        fields = dataset[split].column_names

        # Check if any required fields are missing
        missing = [field for field in required_fields if field not in fields]
        if missing:
            print(f"Warning: Missing fields: {missing}")  # Warn if fields are missing
        else:
            print("✓ All required fields present")  # Confirm all fields are present

        # Retrieve the first sample from the dataset split
        sample = dataset[split][0]

        # Extract the 'prompt' field, which contains a list of messages
        messages = sample['prompt']

        # Validate the prompt format:
        # - It should contain at least two messages
        # - The first message should be from the 'system' role
        # - The second message should be from the 'user' role
        if (len(messages) >= 2 and
            messages[0]['role'] == 'system' and
            messages[1]['role'] == 'user'):
            print("✓ Prompt format is correct")  # Confirm correct format
        else:
            print("Warning: Incorrect prompt format")  # Warn if format is incorrect

# Validate dataset
validate_dataset(dataset)

Our training dataset is validated successfully 🙌, it means we have successfully transformed our dataset for training.

## Reward Functions

We already saw in GRPO section that it evaluate the answer of base model through five different ways:

![Reward Functions (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/7474/1*kJln8i6Tv4aspnTfMoRW-Q.png)

 1. **Accuracy** (is the answer correct?)

 2. **Format** (are the `<think>` and `<answer>` tags used properly?)

 3. **Reasoning Steps** (is the logic clear?)

 4. **Cosine Scaling** (is the response concise?)

 5. **Repetition Penalty** (is there unnecessary repetition?).

Each of these are functions will calculate the reward for each response, and we need to code them. So, let’s do that first.

### Accuracy Reward

Accuracy reward is the most easy to understand but requires a bit complex code. In this reward model we want to check if mathematically our base model response is equivalent to the ground truth solution.

![Accuracy Reward (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/7860/1*A3tW-OZSZ4m10EEzogjy8Q.png)

If the model answer is mathematically correct, we assign a reward of **1.0**. If it is incorrect, the reward is **0.0**. In cases where the ground truth solution cannot be parsed, we assign a neutral reward of **0.5** to avoid unfair penalties.

Now, let’s implement the function.

In [None]:
def accuracy_reward(completions, **kwargs):
    """
    Reward function to check if the model's response is mathematically 
    equivalent to the ground truth solution.
    Uses latex2sympy2 for parsing and math_verify for validation.
    """
    
    # Extract responses
    contents = [completion[0]["content"] for completion in completions]
    rewards = []

    solutions = kwargs.get("solution") # Get solutions from kwargs

    if solutions is None:
        return [0.5] * len(completions) # Return neutral reward if no solution
    
    for content, sol in zip(contents, solutions):
        # Parse the ground truth solution
        gold_parsed = parse(sol, extraction_mode="first_match", 
                            extraction_config=[LatexExtractionConfig()])
        
        if gold_parsed:  # Check if parsing was successful
            # Parse the model's answer with relaxed normalization
            answer_parsed = parse(
                content,
                extraction_config=[
                    LatexExtractionConfig(
                        normalization_config=NormalizationConfig(
                            nits=False,
                            malformed_operators=False,
                            basic_latex=True,
                            equations=True,
                            boxed="all",
                            units=True,
                        ),
                        boxed_match_priority=0,
                        try_extract_without_anchor=False,
                    )
                ],
                extraction_mode="first_match",
            )

            # Reward 1.0 if correct, 0.0 if incorrect
            reward = float(verify(answer_parsed, gold_parsed))
        else:
            # If ground truth cannot be parsed, assign neutral reward (0.5)
            reward = 0.5
            print("Warning: Failed to parse gold solution:", sol)

        rewards.append(reward)
    
    return rewards

In this function, we check whether the model response is **equivalent** to the correct answer. Instead of comparing raw text, we:

 1. Convert the solution into a structured mathematical format using **latex2sympy2**.

 2. If parsing fails, assign a neutral reward of **0.5**.

 3. Extract the model output and normalize it for better robustness.

 4. Use **math_verify** to check if the parsed response matches the parsed solution.

 5. If correct assign **1,** if incorrect assign **0**.

This ensures that accuracy evaluation is not just about textual similarity but **true mathematical correctness.**

### Format Reward

Format Reward is all about making sure our model follows instructions and structures its output correctly. We asked it to put its reasoning in `<think>` tags and the final answer in `<answer>` tags, right? This reward function checks exactly that!

![Forward Reward (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/6620/1*DbUraziwiOoAj6SvtSJmpw.png)

If the model uses those tags correctly, we give it a reward of 1. If it messes up the format, it gets 0. Simple as that! This encourages the model to pay attention to the output structure we want.

Let’s code this up:

In [None]:
# Implement Format Reward Function
def format_reward(completions, **kwargs):
    """
    Reward function to check if the completion has the correct format:
    <think>...</think> <answer>...</answer>.
    """
    # Define the regex pattern for the desired format
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"

    # Extract the content from each completion
    completion_contents = [completion[0]["content"] for completion in completions]

    # Check if each completion matches the pattern
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE)
               for content in completion_contents]

    # Reward 1.0 for correct format, 0.0 otherwise
    return [1.0 if match else 0.0 for match in matches]

In this function:

* We define a pattern using regular expressions (regex). This pattern basically says “the content should *start* with <think>, have *anything* inside until </think>, then some *spaces*, then <answer>, *anything* inside until </answer>, and then *end* there”.

* We get the actual text content from each model completion.

* Then we use use re.match to see if each content perfectly matches our pattern. re.DOTALL helps the . in regex match newlines too, and re.MULTILINE makes ^ and $ match the start/end of the whole string, not just lines.

* Finally, we give a reward 1 if it matched the format perfectly, 0 if it didn’t. This is a strict on/off reward for format correctness.

### Reasoning Steps Reward

Reasoning Steps Reward is a bit clever. We want to encourage our model to show its **“thinking process”**. So, we are going to reward it for including things that *look like* reasoning steps.

![Reasoning Steps Reward Encouragement (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/5406/1*hx0sAVnY58WOYw6rGF64ug.png)

We will look for keywords and patterns that usually show up in step-by-step reasoning, like:

* Step 1, Step 2, etc.

* Numbered lists like 1, 2

* Bullet points like - or *

* Transition words like First, Second, Next, Finally

The more of these it includes, the better the reward. It’s like giving points for showing its work!

Let’s code this reasoning encouraging function:

In [None]:
def reasoning_steps_reward(completions, **kwargs):
    r"""
    Reward function to encourage clear step-by-step reasoning.
    It looks for patterns like "Step 1:", numbered lists, bullet points,
    and transition words.
    """
    # Regex pattern to find indicators of reasoning steps
    pattern = r"(Step \d+:|^\d+\.|\n-|\n\*|First,|Second,|Next,|Finally,)"

    # Extract completion contents
    completion_contents = [completion[0]["content"] for completion in completions]

    # Count the number of reasoning step indicators in each completion
    matches = [len(re.findall(pattern, content, re.MULTILINE))
               for content in completion_contents]

    # Reward is proportional to the number of reasoning steps, maxing out at 1.0
    # We're using a "magic number" 3 here - encourage at least 3 steps for full reward
    return [min(1.0, count / 3) for count in matches]


We create a pattern that’s a bit more complex regex. It looks for all those reasoning indicator things we listed above.

We use re.findall to find *all* the matches of our pattern within each content. `len(re.findall(…))` then gives us the *count* of these indicators.

The reward is calculated as min(1.0, count / 3). This means

* If it finds 3 or more reasoning indicators ( count >= 3), the reward is 1.0 (max reward).

* If it finds fewer (e.g., count = 1 or 2), it gets a *partial* reward (like 1/3 or 2/3).

* If it finds none (count = 0), the reward is 0.0.

The / 3 is a bit of a magic number here. We’re saying **“aim for about 3 reasoning steps to get full credit”** You can tweak this number if you want to encourage more or fewer steps.

### Cosine Scaled Reward

Cosine Scaled Reward is a bit more advanced. It’s about encouraging *conciseness* in correct answers and being *less harsh* on longer incorrect answers.

![Cosine Scaling Concept (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/7094/1*WmG8r1OVeU4R3jObAy0yCg.png)

Think of it like this:

* **For correct answers:** We want to reward *shorter*, more direct solutions more than long, rambling ones. A short, correct answer is often better.

* **For incorrect answers:** A short, wrong answer is probably worse than a longer, wrong answer that at least *tried* to reason. So, we want to penalize short wrong answers *more* than long wrong answers.

Let’s see the code that does this clever scaling:

In [None]:
# Implement Cosine Scaled Reward Function
def get_cosine_scaled_reward(
    min_value_wrong: float = -0.5,
    max_value_wrong: float = -0.1,
    min_value_correct: float = 0.8,
    max_value_correct: float = 1.0,
    max_len: int = 1000,
):
    """
    Returns a cosine scaled reward function. This function scales the accuracy reward
    based on completion length. Shorter correct solutions get higher rewards,
    longer incorrect solutions get less penalty.
    """
    def cosine_scaled_reward(completions, solution, accuracy_rewards, **kwargs):
        """
        Cosine scaled reward function that adjusts accuracy rewards based on completion length.
        """
        contents = [completion[0]["content"] for completion in completions]
        rewards = []

        for content, sol, acc_reward in zip(contents, solution, accuracy_rewards):
            gen_len = len(content)  # Length of the generated answer
            progress = gen_len / max_len # How far we are to max length
            cosine = math.cos(progress * math.pi) # Cosine value based on progress

            if acc_reward > 0.5: # Assuming accuracy_reward gives ~1.0 for correct answers
                min_value = min_value_correct
                max_value = max_value_correct
            else: # Incorrect answer
                min_value = max_value_wrong  # Note the swap!
                max_value = min_value_wrong

            # Cosine scaling formula!
            reward = min_value + 0.5 * (max_value - min_value) * (1.0 + cosine)
            rewards.append(float(reward))
        return rewards
    return cosine_scaled_reward

`get_cosine_scaled_reward(...)` generates a reward function for training, customizing scaling with parameters like min_value_wrong/max_value_wrong (penalty range for incorrect answers) and min_value_correct/max_value_correct (reward range for correct ones). max_len sets the maximum length for scaling.

Inside, `cosine_scaled_reward(...)` we calculate rewards based on completions, solution, and accuracy_rewards.

It computes gen_len, normalizes it as progress `= gen_len / max_len`, and derives a cosine value that starts at 1 (short answers) and decreases to -1 (long answers).

If `acc_reward > 0.5`, it uses the correct reward range, otherwise it applies the incorrect range but swaps min/max values to penalize longer wrong answers less.

### Repetition Penalty Reward

Repetition Penalty Reward is all about discouraging our model from getting stuck in loops and repeating itself. We want it to generate fresh, varied reasoning and answers, not just copy-paste the same phrases over and over!

![Repetition Penalty Idea (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/8608/1*9jBhiz-rI_fRGa77g9RZtQ.png)

This reward function penalizes the model if it uses the same sequences of words (n-grams) too many times. We’ll use n-grams of size 3 (trigrams) in our example, but you can adjust this.

If the model repeats itself a lot, it gets a negative reward (penalty). If it’s more diverse and avoids repetition, the penalty is less.

Let’s implement the code to penalize repetition:

In [None]:
def get_repetition_penalty_reward(ngram_size: int = 3, max_penalty: float = -0.1):
    """
    Returns a repetition penalty reward function. Penalizes repetitions of n-grams
    in the generated text.
    """
    if max_penalty > 0:
        raise ValueError(f"max_penalty {max_penalty} should not be positive")

    def zipngram(text: str, ngram_size: int):
        """Helper function to generate n-grams from text."""
        words = text.lower().split() # Lowercase and split into words
        return zip(*[words[i:] for i in range(ngram_size)]) # Create n-grams

    def repetition_penalty_reward(completions, **kwargs) -> float:
        """
        Repetition penalty reward function.
        """
        contents = [completion[0]["content"] for completion in completions]
        rewards = []
        for completion in contents:
            if completion == "": # No penalty for empty completions
                rewards.append(0.0)
                continue
            if len(completion.split()) < ngram_size: # No penalty for short completions
                rewards.append(0.0)
                continue

            ngrams = set() # Use a set to store unique n-grams
            total = 0
            for ng in zipngram(completion, ngram_size): # Generate n-grams
                ngrams.add(ng) # Add n-gram to the set (duplicates are ignored)
                total += 1 # Count total n-grams

            # Calculate scaling factor: more repetition -> higher scaling
            scaling = 1 - len(ngrams) / total
            reward = scaling * max_penalty # Apply penalty based on scaling
            rewards.append(reward)
        return rewards
    return repetition_penalty_reward

Our `get_repetition_penalty_reward(...)` creates a reward function to penalize repetition, with parameters like ngram_size (default 3, for trigrams) and max_penalty (a negative value, e.g., -0.1).

A helper function, `zipngram(text, ngram_size)`, generates n-grams by converting text to lowercase, splitting it into words, and using `zip(*[words[i:] for i in range(ngram_size)])` for efficient extraction.

Inside, `repetition_penalty_reward(...)` computes the penalty for each completion. If it's empty or too short, it gets a reward of 0.0.

The penalty scales as scaling `= 1 - len(ngrams) / total`, where total is the number of n-grams and len(ngrams) is the unique count. More repetition makes scaling approach 1, increasing the penalty.

The final reward is scaling * max_penalty, meaning less repetition results in a smaller penalty, while high repetition leads to a stronger negative reward. 

>We have implemented all five reward functions, Let’s move on to next stage where we define our training args

## Training Configurations for R1 Zero

Now we to code a configuration where we can fine-tune how our *reward functions* actually work. So, Let’s define that configuration class:

In [None]:
# Define GRPOScriptArguments for reward function parameters
@dataclass
class GRPOScriptArguments:
    """
    Script arguments for GRPO training, specifically related to reward functions.
    """

    reward_funcs: list[str] = field(
        default_factory=lambda: ["accuracy", "format"],
        metadata={
            "help": "List of reward functions. Possible values: 'accuracy', 'format', 'reasoning_steps', 'cosine', 'repetition_penalty'"
        },
    )
    cosine_min_value_wrong: float = field(
        default=-0.5,
        metadata={"help": "Minimum reward for cosine scaling for wrong answers"},
    )
    cosine_max_value_wrong: float = field(
        default=-0.1,
        metadata={"help": "Maximum reward for cosine scaling for wrong answers"},
    )
    cosine_min_value_correct: float = field(
        default=0.8,
        metadata={"help": "Minimum reward for cosine scaling for correct answers"},
    )
    cosine_max_value_correct: float = field(
        default=1.0,
        metadata={"help": "Maximum reward for cosine scaling for correct answers"},
    )
    cosine_max_len: int = field(
        default=1000,
        metadata={"help": "Maximum length for cosine scaling"},
    )

    repetition_n_grams: int = field(
        default=3,
        metadata={"help": "Number of n-grams for repetition penalty reward"},
    )
    repetition_max_penalty: float = field(
        default=-0.1,
        metadata={"help": "Maximum (negative) penalty for for repetition penalty reward"},
    )

Our `@dataclass` decorator makes it easy to create a class for storing data. WhileGRPOScriptArguments class holds reward settings.

The reward_funcs list decides which rewards to use, starting with ["accuracy", "format"], but you can add more like "reasoning_steps", "cosine", "repetition_penalty".

Some settings control how the cosine_scaled_reward and repetition_penalty_reward work, letting you adjust how rewards are given.

Next up, we have TrainingArguments from the transformers library. This is the **main** configuration object that controls almost **everything** about the training process.

In [None]:
# Define TrainingArguments from transformers
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,          # Output directory for checkpoints and logs
    overwrite_output_dir=True,
    num_train_epochs=1,             # Total number of training epochs
    per_device_train_batch_size=8,  # Batch size per device during training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    gradient_accumulation_steps=2,  # Accumulate gradients to simulate larger batch size
    learning_rate=5e-5,            # Initial learning rate for AdamW optimizer
    warmup_ratio=0.1,              # Linear warmup over warmup_ratio fraction of training steps
    weight_decay=0.01,             # Apply weight decay to all layers except bias and LayerNorm weights
    logging_steps=10,              # Log every X updates steps
    evaluation_strategy="steps",    # Evaluate every `eval_steps`
    eval_steps=50,                 # Evaluation and logging steps
    save_strategy="steps",         # Save checkpoint every `save_steps`
    save_steps=50,                 # Save checkpoint every X updates steps
    save_total_limit=2,            # Limit the total amount of checkpoints. Deletes the older checkpoints.
    dataloader_num_workers=2,      # Number of subprocesses to use for data loading
    seed=42,                       # Random seed for reproducibility
    bf16=True,                     # Use mixed precision BFP16 training
    push_to_hub=False,             # Whether to push the final model to Hugging Face Hub
    gradient_checkpointing=True,   # Enable gradient checkpointing
    report_to="none",              # Reporting to no one
)

Finally, we need to have a ModelConfig. This is where we put settings that are specific to the **model itself**, like which pre-trained model to use, what data type to use (like bfloat16), and whether to trust remote code or not and so.

Let’s define our ModelConfig:

In [None]:
@dataclass
class ModelConfig:
    """
    Configuration for the model.
    """
    model_name_or_path: str = field(
        default=MODEL_NAME, metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    model_revision: Optional[str] = field(
        default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}
    )
    torch_dtype: Optional[str] = field(
        default="bfloat16", metadata={"help": "Override the default `torch_dtype` and load the model under this dtype."}
    )
    trust_remote_code: bool = field(
        default=True, metadata={"help": "Trust remote code when loading model and tokenizer."}
    )
    attn_implementation: Optional[str] = field(
        default="flash_attention_2", metadata={"help": "Attention implementation to use. 'flash_attention_2' or None"}
    )

Our **ModelConfig** class holds key settings, including model_name_or_path, which defaults to **Qwen 0.5B Instruct**. We use torch_dtype="bfloat16" for efficiency and set trust_remote_code=True for safe remote loading. Additionally, attn_implementation="flash_attention_2" is enabled for potentially faster training if supported.

Now we need to actually **create** instances of these configuration classes so we can use them:

In [None]:
# Instantiate configuration objects
script_args = GRPOScriptArguments()
model_args = ModelConfig()

Next, we need to get our list of reward functions and any “callbacks” we want to use during training.

Callbacks are like little helpers that can do things at different points in the training process (like logging progress, saving models, etc.). For now, we’ll just use a simple logging callback.

Getting our reward functions in one place.

In [None]:
# Utility function to get reward functions based on script arguments
def get_reward_functions(script_args):
    """
    Returns a list of reward functions based on the script arguments.
    """
    reward_funcs_list = []
    reward_funcs_registry = {
        "accuracy": accuracy_reward,  # Assuming accuracy_reward is defined in previous steps
        "format": format_reward,      # Assuming format_reward is defined in previous steps
        "reasoning_steps": reasoning_steps_reward, # Assuming reasoning_steps_reward is defined
        "cosine": get_cosine_scaled_reward( # Assuming get_cosine_scaled_reward is defined
            min_value_wrong=script_args.cosine_min_value_wrong,
            max_value_wrong=script_args.cosine_max_value_wrong,
            min_value_correct=script_args.cosine_min_value_correct,
            max_value_correct=script_args.cosine_max_value_correct,
            max_len=script_args.cosine_max_len,
        ),
        "repetition_penalty": get_repetition_penalty_reward( # Assuming get_repetition_penalty_reward is defined
            ngram_size=script_args.repetition_n_grams,
            max_penalty=script_args.repetition_max_penalty,
        ),
    }

    for func_name in script_args.reward_funcs:
        if func_name not in reward_funcs_registry:
            raise ValueError(f"Reward function '{func_name}' not found in registry.")
        reward_funcs_list.append(reward_funcs_registry[func_name])

    return reward_funcs_list

Our callback function which will track loss and other important info.

In [None]:
logger = logging.getLogger(__name__)

class LoggingCallback(TrainerCallback):
    """
    A simple callback for logging training information at specific steps.
    """
    def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        if state.global_step % args.logging_steps == 0:
            logger.info(f"Step {state.global_step}: Loss = {state.log_history[-1].get('loss', None)}, Learning Rate = {state.log_history[-1].get('learning_rate', None)}")

def get_callbacks(training_args, model_args, script_args):
    """
    Returns a list of callbacks to be used during training.
    For now, it includes only the LoggingCallback. You can extend this to add more callbacks.
    """
    callbacks = [LoggingCallback()] # Instantiate our LoggingCallback
    return callbacks

Finally, initializing these function.

In [None]:
# Get reward functions and callbacks
reward_functions = get_reward_functions(script_args)
callbacks = get_callbacks(training_args, model_args, script_args)

## GRPO Training Loop

This is the engine that will actually drive our GRPO training. We need to initialize it, giving it all the pieces we’ve prepared: our model, reward functions, training arguments, dataset, and callbacks!

Let’s initialize the GRPOTrainer:

In [None]:
from grpo_trainer_override_tmp import GRPOTrainer


In [None]:
# Create GRPOConfig from TrainingArguments
grpo_config = GRPOConfig(
    **training_args.to_dict(), # Convert TrainingArguments to dictionary and unpack
    **{ 
       # REMOVED model_init_kwargs here 
       # We are passing the instantiated 'model' object, so GRPOTrainer doesn't need model_init_kwargs
    }
)

grpo_trainer = GRPOTrainer(
    model=model,                      # Our initialized Qwen model
    reward_funcs=reward_functions,    # List of reward functions from previous step
    args=grpo_config,                # GRPOConfig (created from TrainingArguments)
    train_dataset=dataset['train'],   # Training dataset
    eval_dataset=dataset['test'],    # Evaluation dataset
    callbacks=callbacks              # List of callbacks
)

We can now start the **Training Loop**! This is as simple as calling the train() method on our grpo_trainer.

In [None]:
# Start the GRPO Training Loop
train_result = grpo_trainer.train()

When you run this cell, you should see the training process begin.

Training will take some time but we set **num_train_epochs = 1** and are using a small model, it shouldn’t take *too* long for this example.

But for real-world GRPO DeepSeek R1 Zero training, you’d likely train for many more epochs and steps.

## Saving Tiny R1 Zero LLM

Once the training completed, we can save our trained model which can be used for inferencing.

In [None]:
# Define the path to your trained model (same as OUTPUT_DIR)
TRAINED_MODEL_PATH = "data/Qwen-GRPO-training"

# Save the tokenizer
tokenizer.save_pretrained(TRAINED_MODEL_PATH)

# Save the trained model
grpo_trainer.save_model(TRAINED_MODEL_PATH)

print(f"GRPO Trained model saved to {TRAINED_MODEL_PATH}")

Then we can simply load the trained model using:

In [None]:
# Load the tokenizer - make sure to use trust_remote_code=True if needed
tokenizer = AutoTokenizer.from_pretrained(
    TRAINED_MODEL_PATH,
    trust_remote_code=True, # If your model config requires it
    padding_side="right" # Ensure consistent padding side
)

# Set pad token if it wasn't saved or loaded correctly
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load the trained model itself
trained_model = AutoModelForCausalLM.from_pretrained(
    TRAINED_MODEL_PATH,
    trust_remote_code=True, # If your model architecture requires it
    torch_dtype=torch.bfloat16 # Keep the same dtype as training for consistency
)

# Move the loaded model to your device (GPU if available)
trained_model.to(device) # 'device' is still our CUDA device from before

In order to use it for inference:

In [None]:
# Testing Inference with the Trained Model
def test_trained_model_inference(user_input: str):
    """Test inference with the loaded trained model and tokenizer."""
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT}, # Re-use our system prompt
        {"role": "user", "content": user_input}
    ]

    # Apply chat template using our tokenizer
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt").to(device)

    # Generate output using our *trained_model*
    outputs = trained_model.generate(
        **inputs,
        max_new_tokens=200, # Maybe generate a bit longer now
        do_sample=True,
        temperature=0.7
    )

    # Decode the generated tokens back to text
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Test the model
test_input = "how are you?"
response = test_trained_model_inference(test_input)
print(f"Test Input: {test_input}")
print(f"Trained Model Response: {response}")

## Two main problems with R1 Zero

Now that we have completed our R1 zero training approach using our base model Qwen2–0.5B instead of their DeepSeek V3 (original base model).

We cannot identify our trained model problems but researches of DeepSeek saw the R1 Zero model performed really well on reasoning tests, even scoring similarly to more advanced models like **OpenAI-01–0912** on tasks like **AIME 2024**.

This showed that using reinforcement learning (RL) to encourage reasoning in language models is a promising approach.

But they also noticed DeepSeek-R1-Zero had some key issues that needed fixing for real world use and wider research.

![Problem with R1 Zero (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/6378/1*_NdVhpb9cgT3-8o3Qn7mMA.png)

Researchers of DeepSeek states that the template is *intentionally simple and structurally focused*. It *avoids* imposing any *content-specific* constraints on the *reasoning process itself*. For example, it doesn’t say:

* “You *must* use step-by-step reasoning” (It just says “reasoning process” leaving it open to the model to define what that means).

* “You *must* use reflective reasoning”

* “You *must* use a specific problem-solving strategy”

The main problem was that the reasoning processes inside the `<think>` tags were hard to read, making it tough for humans to follow and analyze.

Another issue was language mixing, when asked multi-lingual questions, the model sometimes mixed languages in the same response, leading to inconsistent and confusing outputs.

If you asked it questions in, say, Spanish. Suddenly, its “thinking” would be a jumbled mix of **English and Spanish, **not exactly polished! These problems, messy reasoning and language confusion, were the clear roadblocks.
> These are the two main reasons they transformed their initial R1 Zero Model into the R1

# Preparing Cold Start Data for SFT

So to fix R1 Zero issues and really get DeepSeek reasoning properly, researchers performed a **Cold Start Data Collection and included Supervised Fine Tuning**.

You can think of it as giving the model a good foundation in reasoning before the really intense RL training. Basically, they wanted to teach **DeepSeek-V3 Base** what good reasoning looks like and how to present it clearly.

One of the example of cold start data is [Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k) that we see earlier and will be using for creating R1, but **we need to understand how cold dataset is created so we wont skip any part from the actual training**.


## Few-shot Prompting with Long CoT

One technique is **Few-shot Prompting with Long Chain-of-Thought (CoT),** in which we try to show DeepSeek-V3 Base (or in our case, Qwen2–0.5B) few examples of questions paired with super detailed, step-by-step solutions. This is Chain-of-Thought (CoT).

![Long CoT (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/4068/1*SAhvB0JqaK4d45IiIcj1Ow.png)

Goal of this approach is to make the model learn by example and start mimicking this thorough reasoning style.

For our example problem “What is 2 + 3 * 4?”, we can create prompts that include a few solved problems as examples. Let’s see how this looks in Python:
```python
# Loading Model and Tokenizer
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True, padding_side="right")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True, torch_dtype=torch.bfloat16).to("cuda" if torch.cuda.is_available() else "cpu")

# Generate Long COT Response
def generate_response(prompt_text):
    messages = [
        {"role": "system", "content": "You are a helpful assistant that provides step-by-step solutions."},
        {"role": "user", "content": prompt_text}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False) # Keep it deterministic for example
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("<|im_start|>assistant\n")[-1].strip() # Extract assistant's response
```
Let’s define the few shot examples accordingly for our asked question:
```python
# Example problems with solutions (using | special_token | as delimiter)
few_shot_prompt = """
Problem: What's the square root of 9 plus 5?
Solution: <|special_token|> First, find the square root of 9, which is 3. Then, add 5 to 3.  3 + 5 equals 8. <|special_token|> Summary: The answer is 8.

Problem: Train travels at 60 mph for 2 hours, how far?
Solution: <|special_token|> Use the formula: Distance = Speed times Time. Speed is 60 mph, Time is 2 hours. Distance = 60 * 2 = 120 miles. <|special_token|> Summary: Train travels 120 miles.

Problem: What is 2 + 3 * 4?
Solution:
"""
```

Now using our base model our sample generations looks like this:
```python
# Generate response for the target problem using few-shot examples
target_problem_prompt = few_shot_prompt + "What is 2 + 3 * 4?"
model_response_few_shot = generate_response(target_problem_prompt)

print("Few-shot Prompt:")
print(target_problem_prompt)
print("\nModel Response (Few-shot CoT):")
print(model_response_few_shot)
```

It output this structured data

```
Few-shot Prompt:
Problem: What's the square root of 9 plus 5?
Solution: <|special_token|> First, find the square root of 9, 
which is 3. Then, add 5 to 3.  3 + 5 equals 8. 
<|special_token|> Summary: The answer is 8.

Problem: Train travels at 60 mph for 2 hours, how far?
Solution: <|special_token|> Use the formula: Distance = Speed times Time. 
Speed is 60 mph, Time is 2 hours. Distance = 60 * 2 = 120 miles. 
<|special_token|> Summary: Train travels 120 miles.

Problem: What is 2 + 3 * 4?
Solution: 

Model Response (Few-shot CoT):
<|special_token|> To solve 2 + 3 * 4, we need to follow the order 
of operations (PEMDAS/BODMAS). Multiplication should be performed 
before addition.
Step 1: Multiply 3 by 4, which equals 12.
Step 2: Add 2 to the result from Step 1: 2 + 12 = 14.
<|special_token|> Summary: The answer is 14.
```

See how the model, after seeing examples, starts to structure its answer with <|special_token|> delimiters and provides step-by-step reasoning leading to the summary and final answer!

This is the power of few-shot learning guiding the model towards the desired output format.


## Direct Prompting

Another method is **Direct Prompting**. Here, we directly instruct the model to not just solve the problem, but also to explicitly show its reasoning step-by-step and then verify its answer. 

This is about encouraging a more deliberate and thoughtful problem-solving approach.

![Example based learning (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/4656/1*IYyk7UWgDNADFe_djWcXow.png)

Let’s craft a prompt for “What is 2 + 3 * 4?” that explicitly asks for reasoning and verification. Here’s the Python code to see it in action:
```python
# Direct prompting example
direct_prompt_text = """
Problem: Solve this, show reasoning step-by-step, and verify:
What is 2 + 3 * 4?
"""

model_response_direct = generate_response(direct_prompt_text)

print("Direct Prompt:")
print(direct_prompt_text)
print("\nModel Response (Direct Prompting):")
print(model_response_direct)
```
The direct prompting output is very easy to understand and this is what it looks like:
```
Direct Prompt:
Problem: Solve this, show reasoning step-by-step, and verify:
What is 2 + 3 * 4?

Model Response (Direct Prompting):
<|special_token|> Reasoning: To solve 2 + 3 * 4, I need to follow 
the order of operations, which states that multiplication should 
be done before addition.
Step 1: Multiply 3 by 4, which equals 12.
Step 2: Add 2 to the result from Step 1: 2 + 12 = 14.
Verification: To verify the answer, I can double-check the 
order of operations and the calculations. Multiplication is 
indeed performed before addition, and the calculations are correct.
<|special_token|> Summary: The answer is 14.
```
As you can see, by directly asking for reasoning and verification, the model provides a more comprehensive output, including a “Verification” section.

This method directly guides the model to produce the kind of detailed reasoning we are looking for.

## Post Processing Refinement

The final technique involves **Post-Processing Refinement**. Interestingly, they even used the outputs from the already trained R1 Zero model for this!

Even with its issues, R1 Zero could reason somewhat. So, they took R1 Zero outputs and had human annotators refine them, making them cleaner, more structured, and correcting any mistakes.

![Processing Refnement (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/4388/1*-GR29EAnTOVBarQ2JrF5sA.png)

Imagine a messy R1 Zero output like this:
```
<think>  ummm... multiply 3 and 4... get 12... then add 2...</think>
<answer> 14 </answer>
```

Human annotators would then refine it to something much clearer and better formatted:
```
<|special_token|> Reasoning: To solve this, we use order of operations, doing multiplication before addition.
Step 1: Multiply 3 by 4, which is 12.
Step 2: Add 2 to the result: 2 + 12 = 14.
<|special_token|> Summary: The answer is 14.
```

While we can’t perfectly simulate human refinement in code, we can demonstrate a basic idea of how you might programmatically reformat and structure a potentially messy output. 

Let’s take a simulated “messy” output and show how we could refine it:
```python
# Simulated messy R1 Zero output
messy_output = "<think>  ummm... multiply 3 and 4... get 12... then add 2...</think>\n<answer> 14 </answer>"

def refine_output(messy_text):
    think_content = messy_text.split("<think>")[1].split("</think>")[0].strip()
    answer_content = messy_text.split("<answer>")[1].split("</answer>")[0].strip()

    refined_text = f"""<|special_token|> Reasoning: {think_content.replace('umm...', '').strip().capitalize()}.
<|special_token|> Summary: The answer is {answer_content}."""
    return refined_text

refined_output_text = refine_output(messy_output)

print("Messy Output (Simulated R1 Zero):")
print(messy_output)
print("\nRefined Output:")
print(refined_output_text)
```

This will output:
```
Messy Output (Simulated R1 Zero):
<think>  ummm... multiply 3 and 4... get 12... then add 2...</think>
<answer> 14 </answer>

Refined Output:
<|special_token|> Reasoning: Multiply 3 and 4... get 12... then add 2.
<|special_token|> Summary: The answer is 14.
```

This simple refine_output function is just a basic example. Real refinement by humans involves much more nuanced understanding and correction of reasoning steps.

However, it shows the core idea: taking initial model outputs and improving their quality and structure to create better training data.
> After generating this Cold Start Data, the next crucial step was **Supervised Fine-Tuning (SFT)**, which we’ll explore in the next section!

## SFT Stage 1 With Cold Start Data

To generate proper cold start data to build R1 using Supervised fine-tuning, we obviously need a proper team along with an excessive amount of code, but thankfully, we already have data ([Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k)) that is similar to the cold start form.
> We need to know what and how training happens inside the SFT Trainer as it processes our training data?

SFT is a form of supervised learning. This means we’re giving the model pairs of inputs and *desired* outputs.

In our case, the input might be a problem prompt, and the desired output is the well-reasoned, step-by-step solution from our training dataset. **I hope this point gives a clear view of why there is a need of cold data.**

It takes our tokenized training data and feeds it to the model in batches. For each batch, a important set of operations happens, Let’s visualize this internal process:

![SFT WorkFlow (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/6838/1*EsEgATw1aSYPjfGtpId2mQ.png)

First, the model takes an input, a problem prompt, for instance. It processes this input and generates its best guess for the solution, token by token. These are the *predicted tokens*.

Next, the SFT Trainer needs to know how good (or bad) these predictions are. It uses a *loss function*, typically Cross-Entropy Loss. This function mathematically compares the model’s predicted tokens to the *correct* tokens from our training data. Think of it as calculating the “error” of the model’s answer.

This “error” isn’t just discarded. It’s the crucial signal for learning. Through a process called *backpropagation*, this error is used to calculate *gradients*. Gradients are like guides, pointing in the direction of parameter adjustments that would reduce the error.

Finally, an *optimizer*, like **AdamW** uses these gradients to subtly tweak the model’s internal settings — its parameters. These tweaks are designed to make the model’s next prediction a little bit closer to the correct answer.


## Stage 1 SFT Trainer Configs for R1

Remember those problems we had with R1 Zero messy reasoning and language mixing? SFT is designed to fix exactly that. By training on high-quality, refined data, we’re teaching the model:

* **Clear Reasoning Style**: To structure its “thinking” in a way that’s easy to read and follow.

* **Consistent Language**: To stick to one language within a response, avoiding confusing mixes.

We’re using the Bespoke-Stratos-17k dataset for SFT. As we saw earlier, it’s got 17,000 problems focused on math and code, with a format that looks pretty good for our needs.

Let’s quickly remind ourselves of a sample from Bespoke-Stratos-17k:

In [None]:
# Load the "Bespoke-Stratos-17k" dataset from bespokelabs
bespoke_rl = load_dataset("bespokelabs/Bespoke-Stratos-17k", "default")

# Access the first sample in the training set
bespoke_rl['train'][0]

This dataset, with its system prompts and user-assistant conversations, is perfect for showing our model how conversations with reasoning should look.

We’ll use the trl library again, which makes SFT training super easy.

First, we need to set up our configurations, similar to what we did for GRPO, but this time for SFT.

In [None]:
# Model and Output Configuration (same as before, or adjust as needed)
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
OUTPUT_DIR = "data/Qwen-SFT-training" # New output directory for SFT model
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Training Arguments - similar to GRPO, but adjust for SFT
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=1,         # Adjust epochs as needed
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=2e-5,        # Adjust learning rate for SFT
    warmup_ratio=0.1,
    weight_decay=0.01,
    logging_steps=10,
    evaluation_strategy="no",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    save_total_limit=2,
    dataloader_num_workers=2,
    seed=42,
    bf16=True,
    push_to_hub=False,
    gradient_checkpointing=True,
    report_to="none",
)

# Model Configuration - same as before
model_args = ModelConfig(
    model_name_or_path=MODEL_NAME,
    model_revision="main",
    torch_dtype="bfloat16",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
)

These TrainingArguments and ModelConfig are quite similar to what we used for GRPO, but with a few tweaks that are more suitable for SFT (like a slightly different learning rate, and importantly, packing=True and max_seq_length=4096 for efficient training on longer sequences).

## Stage 1 STF Training Loop

Now, let’s load our dataset and tokenizer:

In [None]:
# Load Bespoke-Stratos-17k dataset
dataset_sft = load_dataset("HuggingFaceH4/Bespoke-Stratos-17k", split='train') # Only using train split for simplicity

# Initialize tokenizer - same as before
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    padding_side="right"
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

And finally, we initialize the SFTTrainer and start training!

In [None]:
# Initialize base model for SFT - same as before
model_sft = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Initialize the SFT Trainer
sft_trainer = SFTTrainer(
    model=model_sft,                     # Our initialized Qwen model
    train_dataset=dataset_sft,           # Bespoke-Stratos-17k dataset
    tokenizer=tokenizer,                 # Tokenizer
    args=training_args,                  # Training arguments
)

# Start the SFT Training Loop
sft_train_result = sft_trainer.train()

When you run this code, you’ll see the SFT training process start. It will look similar to the GRPO training output, showing loss and learning rate at each logging step.

Just like with GRPO, training time will depend on your hardware and chosen epochs. Since we’re still using a small model and only 1 epoch for this example, it should be reasonably quick.

## Saving Tiny R1 LLM

After SFT is done, we save our newly fine-tuned model (R1).

In [None]:
# Saving the Trained SFT Model
TRAINED_SFT_MODEL_PATH = "data/Qwen-SFT-training" # Same as OUTPUT_DIR

# Save the tokenizer
tokenizer.save_pretrained(TRAINED_SFT_MODEL_PATH)

# Save the trained model
sft_trainer.save_model(TRAINED_SFT_MODEL_PATH)

print(f"SFT Trained model saved to {TRAINED_SFT_MODEL_PATH}")

And that’s it for the SFT part! We’ve now taken our base model, shown it lots of examples of good reasoning, and fine-tuned it to be better at producing clear, structured responses.
> This finetuned model using SFT is what we called R1 after SFT stage 1

The steps after SFT, especially the RL stages and rejection sampling, are complex to implement from scratch in Python. Focusing on the theoretical understanding is key to understand the overall process.


## Reasoning-Oriented RL

After SFT, the model can reason better, but we want to *really* focus on reasoning quality and fix language mixing. This stage uses RL again, but with a smarter reward system.

This new reward checks if the model reasoning and answer are in the same language as the question. If you ask in English, the *whole* response should be in English. This fixes language mixing issues.

![Reasoning Oriented RL (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/7468/1*Z2oHDdkWb7RnO5uVHPSvMg.png)

It adds a **Language Consistency Reward** alongside accuracy to ensure the SFT model reasons and answers in the same language as the input.

The GRPO algorithm and training loop from R1 Zero are reused, but the reward signals are improved to specifically target better reasoning and consistent language output.

## Rejection Sampling

To get super high-quality reasoning data, DeepSeek uses **Rejection Sampling**. Think of it as a filter to keep only the *best* examples.

![Rejection Sampling (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/8520/1*obG-BrhwtIuOv7YBZIpSwg.png)

The model generates many reasoning examples. These are then evaluated for correctness and reasoning quality (often using a generative reward model and human checks).

Only the *best*, high-quality reasoning examples are kept. Combined with non-reasoning data, this refined dataset is used for a second **SFT Stage 2**, further improving reasoning and general abilities.

## SFT Stage 2 Training

The final RL stage focuses on making the model a helpful and safe AI assistant for *all* situations, not just reasoning problems. This is about alignment with human values.

**Key Focus: Helpfulness & Harmlessness Rewards**

Not just accuracy, the reward system now includes:

* **Helpfulness:** Is the response useful and informative?

* **Harmlessness:** Is the response safe, unbiased, and ethical?

![SFT Stage 2 (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/7086/1*_u5ALx4VYQpsSgT_0s10HQ.png)

The training data becomes diverse, including reasoning tasks and human preference data (which output is better — more helpful, less harmful?).

The reward system now balances accuracy with **helpfulness and harmlessness**. Iterative RL training (likely GRPO again) optimizes the model to be not just good at reasoning, but also a safe and helpful AI assistant for general use, resulting in DeepSeek R1.

## Distillation

To make DeepSeek R1 accessible, they **distilled** its knowledge into smaller models.

![Distillation Process (Created by [Fareed Khan](undefined))](https://cdn-images-1.medium.com/max/2500/0*QdOxtvuKaEASreK0.png)

Distillation takes the knowledge of a large, powerful “teacher” model (DeepSeek R1) and transfers it to smaller “student” models. Using a large dataset of reasoning examples, the outputs of DeepSeek R1 are used as the *target* answers.

Smaller models are then trained (SFT) to mimic these outputs. This results in smaller, faster models that retain a significant portion of DeepSeek R1’s reasoning abilities, making them more practical for wider use.