## **Worlde With GRPO** (RL + Constraint-Aware Inference)

### This project explores **reinforcement learning** for symbolic reasoning by training a language model to play Wordle using **GRPO** (Group Relative Policy Optimization).

**Key idea**
* The model learns guess preferences via RL.
* Exact Wordle logic is enforced at inference time using a constraint-aware reranker.
* The secret word is never shown to the model.

**Install libraries**

In [None]:
!pip install  -U -q trl peft math_verify
!pip install -q transformers datasets accelerate
!pip install torch

**Import libraries**

In [None]:
import numpy as np
import pandas as pd
import ast
from datasets import Dataset
import torch
from trl import GRPOConfig
from trl import GRPOTrainer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, re
from collections import Counter, defaultdict

**Login to HuggingFace hub**

In [None]:
from huggingface_hub import notebook_login
notebook_login()

**Load the dataset**

In [None]:
df = pd.read_csv('/content/train.csv')

**Preview the dataset**

In [None]:
df.head()

Unnamed: 0,prompt,word_list,past_guess_history,secret
0,"<|im_start|>system\n\nYou are playing Wordle, ...",https://raw.githubusercontent.com/arnavgarg1/a...,[],ABHOR
1,"<|im_start|>system\n\nYou are playing Wordle, ...",https://raw.githubusercontent.com/arnavgarg1/a...,"[['CRANE', 'C(x) R(x) A(-) N(x) E(-)'], ['SWEA...",ALLEY
2,"<|im_start|>system\n\nYou are playing Wordle, ...",https://raw.githubusercontent.com/arnavgarg1/a...,"[['CRANE', 'C(x) R(x) A(-) N(x) E(x)'], ['ADUL...",ALLOT
3,"<|im_start|>system\n\nYou are playing Wordle, ...",https://raw.githubusercontent.com/arnavgarg1/a...,"[['CRANE', 'C(x) R(x) A(-) N(-) E(x)']]",ANNUL
4,"<|im_start|>system\n\nYou are playing Wordle, ...",https://raw.githubusercontent.com/arnavgarg1/a...,"[['CRANE', 'C(x) R(x) A(-) N(x) E(x)'], ['BLOA...",BATTY


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   prompt              76 non-null     object
 1   word_list           76 non-null     object
 2   past_guess_history  76 non-null     object
 3   secret              76 non-null     object
dtypes: object(4)
memory usage: 2.5+ KB


**Parse and process the necessary columns for convenience**

In [None]:
def parse_history(x):
    if x == "[]":
        return []
    return ast.literal_eval(x)

df['history'] = df['past_guess_history'].apply(parse_history)

In [None]:
df['history'].iloc[3]

[['CRANE', 'C(x) R(x) A(-) N(-) E(x)']]

In [None]:
def build_state(prompt, history):
    lines = [prompt.strip(), "\nCurrent guesses:"]
    if len(history) == 0:
        lines.append("None")
    else:
        for i, (guess, fb) in enumerate(history):
            lines.append(f"Guess {i+1}: {guess} -> {fb}")
    lines.append("\nNext guess:")
    return "\n".join(lines)

df['state_text'] = df.apply(
    lambda r: build_state(r['prompt'], r['history']),
    axis=1
)

In [None]:
print(df['state_text'].iloc[1])

<|im_start|>system

You are playing Wordle, a word-guessing game.

### Game Rules:
- You have **6 tries** to guess a secret **5-letter** word.
- Each guess must be a valid **5-letter English word**.
- After each guess, you will receive feedback indicating how close your guess was.

### Feedback Format:
Each letter in your guess will receive one of three symbols:
1. ✓ : The letter is in the word and in the CORRECT position.
2. - : The letter is in the word but in the WRONG position.
3. x : The letter is NOT in the word.

### Example:
Secret Word: BRISK

Guess 1: STORM → Feedback: S(-) T(x) O(x) R(-) M(x)
Guess 2: BRAVE → Feedback: B(✓) R(✓) A(x) V(x) E(x)
Guess 3: BRISK → Feedback: B(✓) R(✓) I(✓) S(✓) K(✓)

### Response Format:
Think through the problem and feedback step by step. Make sure to first add your step by step thought process within <think> </think> tags. Then, return your guessed word in the following format: <guess> guessed-word </guess>.
<|im_end|>
<|im_start|>user
Make a n

**Remove the rows with no past guesses**

In [None]:
def extract_action(history):
    if len(history) == 0:
        return None
    return history[-1][0]

df['action'] = df['history'].apply(extract_action)

In [None]:
df = df[df['action'].notna()].reset_index(drop=True)
print(len(df))

74


In [None]:
trl_ds = Dataset.from_pandas(
    df[['state_text', 'secret']]
).rename_columns({
    "state_text": "prompt",
    "secret": "solution"
})

**Define the reward function**

In [None]:
def wordle_reward(prompts, completions, solution, **kwargs):
    """
    prompts: List[str]
    completions: List[str]
    solution: List[str]  (same length as completions)
    """

    rewards = []

    for guess, secret in zip(completions, solution):
        guess = guess.strip().upper()
        secret = secret.upper()

        # invalid guess
        if len(guess) != 5:
            rewards.append(-0.5)
            continue

        r = 0.0
        for g, s in zip(guess, secret):
            if g == s:
                r += 0.2
            elif g in secret:
                r += 0.05

        if guess == secret:
            r += 1.0

        rewards.append(r)

    return rewards

**Define the GRPO training arguments**

In [None]:
training_args = GRPOConfig(
    output_dir="Qwen2.5-0.5B-Wordle-GRPO",
    learning_rate=1e-5,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    fp16=True,
    bf16=False,

    max_prompt_length=256,
    max_completion_length=8,  # 5-letter word + buffer
    num_generations=4,

    logging_steps=5,
    report_to="tensorboard",
    save_strategy="no"
)

**Define the model and GRPO trainer**

In [None]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True
)

trainer = GRPOTrainer(
    model=model,
    args=training_args,
    train_dataset=trl_ds,
    reward_funcs=wordle_reward
)

**Train the model**

In [None]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss
5,-0.0216
10,-0.0082


TrainOutput(global_step=12, training_loss=-0.0033825250963370004, metrics={'train_runtime': 101.6316, 'train_samples_per_second': 2.184, 'train_steps_per_second': 0.118, 'total_flos': 0.0, 'train_loss': -0.0033825250963370004})

Training completed with training_loss=-0.0033825250963370004

**Save the model**

In [None]:
save_dir = "qwen2.5-0.5b-wordle-grpo"

model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

('qwen2.5-0.5b-wordle-grpo/tokenizer_config.json',
 'qwen2.5-0.5b-wordle-grpo/special_tokens_map.json',
 'qwen2.5-0.5b-wordle-grpo/chat_template.jinja',
 'qwen2.5-0.5b-wordle-grpo/vocab.json',
 'qwen2.5-0.5b-wordle-grpo/merges.txt',
 'qwen2.5-0.5b-wordle-grpo/added_tokens.json',
 'qwen2.5-0.5b-wordle-grpo/tokenizer.json')

**Push the model to HuggingFace**

In [None]:
model.push_to_hub("username/qwen2.5-0.5b-wordle-grpo")
tokenizer.push_to_hub("username/qwen2.5-0.5b-wordle-grpo")

**Load the model**

In [None]:
repo_id = "username/qwen2.5-0.5b-wordle-grpo"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    device_map="auto",
    trust_remote_code=True
).eval()

**Test the model with simple prompt**

In [None]:
test_prompt = """You are playing Wordle.
Rules:
- Guess a 5-letter word.
- Respond with ONLY the word.

Previous guesses:
CRANE → C(x) R(x) A(-) N(x) E(-)

Your guess:
"""

test_secret = "ALLEY"   # ← NOT passed to model

In [None]:
def single_test(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=8,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,# deterministic
            pad_token_id=tokenizer.pad_token_id
        )

    text = tokenizer.decode(
        output[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True
    )

    match = re.search(r"[A-Za-z]{5}", text)
    return match.group(0).upper() if match else "XXXXX"

In [None]:
guess = single_test(test_prompt)
print("Model guess:", guess)


Model guess: CURRE


RL model is able to guess a five letter word

### **Building Gameplay Loop with constraint-aware reranker**

**Define the base prompt**

In [None]:
BASE_PROMPT = """You are playing Wordle.
Rules:
- Guess a 5-letter word.
- Respond with ONLY the word.

Previous guesses:
"""

def build_prompt(history):
    prompt = BASE_PROMPT
    if not history:
        prompt += "(none)\n"
    else:
        for guess, fb in history:
            prompt += f"{guess} → {fb}\n"
    prompt += "\nYour guess:\n"
    return prompt

**Define generate guess function**

In [None]:
def generate_guess(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=8,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id
        )

    text = tokenizer.decode(
        output[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True
    )

    match = re.search(r"[A-Za-z]{5}", text)
    return match.group(0).upper() if match else "XXXXX"

**Define wordle feedback based on the guesses**

In [None]:
def wordle_feedback(guess, secret):
    feedback = []
    secret_chars = list(secret)

    # First pass: correct positions
    for i, g in enumerate(guess):
        if g == secret[i]:
            feedback.append("✓")
            secret_chars[i] = None
        else:
            feedback.append(None)

    # Second pass: wrong positions
    for i, g in enumerate(guess):
        if feedback[i] is None:
            if g in secret_chars:
                feedback[i] = "-"
                secret_chars[secret_chars.index(g)] = None
            else:
                feedback[i] = "x"

    return " ".join(f"{g}({f})" for g, f in zip(guess, feedback))

**Create a function to parse feedback**

In [None]:
def parse_feedback(feedback_str):
    # "C(x) R(✓) A(-) N(x) E(x)"
    feedback = []
    for token in feedback_str.split():
        letter = token[0]
        status = token[2]  # x, -, or ✓
        feedback.append((letter, status))
    return feedback

**Define a function to with constraints based on the game**

In [None]:
def satisfies_constraints(guess, history):
    guess = guess.upper()
    guess_counts = Counter(guess)

    # Never repeat guesses
    if guess in {g for g, _ in history}:
        return False

    min_counts = defaultdict(int)
    max_counts = defaultdict(lambda: 5)
    forbidden_positions = defaultdict(set)
    required_positions = {}

    for prev_guess, fb_str in history:
        prev_guess = prev_guess.upper()
        feedback = parse_feedback(fb_str)

        for i, (letter, status) in enumerate(feedback):
            if status == "✓":
                required_positions[i] = letter
                min_counts[letter] += 1

            elif status == "-":
                forbidden_positions[letter].add(i)
                min_counts[letter] += 1

            elif status == "x":
                if letter not in min_counts:
                    max_counts[letter] = 0

    # Enforce required positions (THIS WAS MISSING)
    for pos, letter in required_positions.items():
        if guess[pos] != letter:
            return False

    # Enforce forbidden positions
    for letter, positions in forbidden_positions.items():
        for pos in positions:
            if guess[pos] == letter:
                return False

    # Enforce counts
    for l, c in min_counts.items():
        if guess_counts[l] < c:
            return False
    for l, c in max_counts.items():
        if guess_counts[l] > c:
            return False

    return True

**Define a function for constraint scoring**

In [None]:
def constraint_score(guess, history):
    score = 0
    guess = guess.upper()

    for prev_guess, fb_str in history:
        feedback = parse_feedback(fb_str)

        for i, (letter, status) in enumerate(feedback):
            if status == "✓" and guess[i] == letter:
                score += 2
            elif status == "-" and letter in guess and guess[i] != letter:
                score += 1
            elif status == "x" and letter not in guess:
                score += 0.5
            else:
                score -= 1

    return score

**Function to generate reranked guess**

In [None]:
def generate_reranked_guess(prompt, history, num_samples=64):
    candidates = [generate_guess(prompt) for _ in range(num_samples)]

    valid = [
        g for g in candidates
        if len(g) == 5 and satisfies_constraints(g, history)
    ]

    if not valid:
        return max(candidates, key=lambda g: constraint_score(g, history))

    # Prefer guesses that satisfy MORE confirmed positions
    def score(g):
        s = 0
        for prev, fb in history:
            for i, (l, st) in enumerate(parse_feedback(fb)):
                if st == "✓" and g[i] == l:
                    s += 2
        return s

    return max(valid, key=score)

**Putting everything together**

In [None]:
def play_wordle(secret, model, tokenizer):
    history = []

    for step in range(6):
        prompt = build_prompt(history)
        guess = generate_reranked_guess(prompt, history)

        feedback = wordle_feedback(guess, secret)
        history.append((guess, feedback))

        print(f"Step {step+1}: {guess} → {feedback}")

        if guess == secret:
            print("✅ Solved!")
            return True, step + 1

    print("❌ Failed. Secret was:", secret)
    return False, 6

**Set the model in evaluation mode**

In [None]:
model.eval()

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896, padding_idx=151643)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
   

**Sanity test**

In [None]:
# make sure generation is not deterministic
print(generate_guess(build_prompt([])))
print(generate_guess(build_prompt([])))

NOODL
OPPON


**Single test to check if the model is behaving as expected**

In [None]:
history = [
    ("CRANE", "C(x) R(x) A(-) N(x) E(-)")
]

prompt = build_prompt(history)
guess = generate_reranked_guess(prompt, history)

print("Prompt:\n", prompt)
print("Model guess:", guess)

Prompt:
 You are playing Wordle.
Rules:
- Guess a 5-letter word.
- Respond with ONLY the word.

Previous guesses:
CRANE → C(x) R(x) A(-) N(x) E(-)

Your guess:

Model guess: SATEW


Model was able to make a new guess without the letters that are not there in the secret word and also place the required letters in the different positions based on the previous guesses.

**Playing the wordle game**

In [None]:
play_wordle("BRICK", model, tokenizer)

Step 1: DUBLI → D(x) U(x) B(-) L(x) I(-)
Step 2: FIRST → F(x) I(-) R(-) S(x) T(x)
Step 3: BROOK → B(✓) R(✓) O(x) O(x) K(✓)
Step 4: BRACK → B(✓) R(✓) A(x) C(✓) K(✓)
Step 5: BRACK → B(✓) R(✓) A(x) C(✓) K(✓)
Step 6: BRACK → B(✓) R(✓) A(x) C(✓) K(✓)
❌ Failed. Secret was: BRICK


(False, 6)

Based on the above output, RL model with GRPO guessed the words and came close to guessing the secret word. Considering the model was trained on small dataset and qwen2.5-0.5b, there is room for improvement. But overall it is working as expected.  

**Takeaways**

* Reinforcement learning teaches preferences.
* Symbolic constraints guarantee correctness.

This project demonstrates how to combine both cleanly for structured reasoning tasks.