# 1. Inference-Time Scaling: Pushing the Model Before Changing It

**Purpose:**

Before committing to model adaptation, test whether spending more compute at inference time can close the gap. This is the last and most rigorous check before escalating to training. If inference-time scaling solves the problem, we avoid the cost and complexity of fine-tuning entirely. If it does not, we have even stronger evidence that the model's weights need to change.


## 1.1 The Idea: Spend Compute at Inference, Not Training

There is a common assumption in AI engagements that when a model gives a wrong answer, the model needs to be retrained. Sometimes that is true. But retraining is expensive, slow, and organizationally complex. Before going there, it is worth asking a simpler question:

What if we just let the model try harder?

That is the core idea behind inference-time scaling. Instead of changing the model's weights (which is what fine-tuning does), you change how much work the model does at inference time. You let it generate multiple candidate answers, evaluate them, and select the best one. The model itself is unchanged. You are spending compute instead of training data.

This matters for two reasons.

First, it is cheaper and faster to try. There is no training pipeline to build, no data to curate, no model to validate. You are using the same model, the same endpoint, the same API key. The only thing that changes is how many times you call it and how you pick the winner.

Second, the results are diagnostic. If the model produces the right answer on attempt 3 out of 5, that tells you something important: the knowledge is accessible but the model's default sampling path does not reliably surface it. That is a very different problem than "the model fundamentally cannot do this." And the fix might be as simple as a better decoding strategy rather than a full training run.

If the model cannot produce the right answer in N attempts with good context, that is also diagnostic. It means the failure is not a sampling issue. The model genuinely does not know how to handle this class of question. That is the strongest possible evidence for model adaptation.

Either way, you learn something. And learning before spending is the entire philosophy of this workshop.

## 1.2 Introducing `its_hub`

``its_hub`` is an inference-time scaling library built by the Red Hat AI Innovation Team. It implements several algorithms for improving model output quality without modifying model weights, and it works with any OpenAI-compatible API endpoint. That means it works with our MaaS setup without any infrastructure changes.

The library has three layers:

A **language model wrapper** (`OpenAICompatibleLanguageModel`) that handles generation against any OpenAI-compatible endpoint, including batched and async calls.

A **reward model** that scores candidate responses. For Best-of-N, this is ``LLMJudgeRewardModel``, which uses a separate LLM to evaluate response quality. For more advanced algorithms like Particle Filtering, it can be a local Process Reward Model that scores each reasoning step.

A **scaling algorithm** that orchestrates the generation and scoring. `BestOfN` generates N candidates and picks the best. `SelfConsistency` generates N candidates and picks the most common answer. `ParticleFiltering` and `BeamSearch` operate at the step level, pruning weak reasoning paths as they go.

For this lab, we will use **Best-of-N with LLM Judge**. The idea is straightforward:

1. Send the same prompt to the model N times (the "budget")
2. Collect N candidate responses
3. Use a separate LLM judge to evaluate and rank the candidates
4. Return the highest-scoring response

The generation model is not retrained. It is not prompted differently. It simply gets multiple chances, and a judge picks the best attempt.

### 1.2.1 Environment Setup

Open the `01/01Inference_Time_Scaling/01_BestOfN.ipynb` notebook.


Before we touch any code, we need two values: an API key and an endpoint URL for the MaaS (Model as a Service) platform. If you attended Day 2, you already have these. If you are starting fresh today, follow the setup steps below. If you already have a working `.env` file from Day 2, skip ahead to the verification cell.

**For participants who did not attend Day 2:**

You need access to the Red Hat Demo Platform MaaS service. Using your Red Hat SSO credentials, log in at:

> https://litellm-prod-frontend.apps.maas.redhatworkshops.io/home

Once logged in:

1. Click Subscriptions in the left sidebar, then New Subscription. Subscribe to the following models: granite-3-2-8b-instruct, granite-4-0-h-tiny, microsoft-phi-4, and qwen3-14b. For each one, click the model name to open the model card, then click Subscribe.
2. Click API Keys in the left sidebar, then Create API Key. Give it a name (anything you like), check Select All to attach all your subscribed models, and click Create API Key.
3. When the key appears, click the copy button and save it somewhere. You will not be able to see the full key again after closing the dialog.
4. Before closing the dialog, note the API URL displayed on the key details screen. It will look like `https://litellm-prod.apps.maas.redhatworkshops.io/v1`. Copy this as well.

Now create a file named `.env` in the parent of this folder. The file should contain exactly two lines:

```
API_KEY=your-api-key-here
ENDPOINT_BASE=endpoint-here
```

**For everyone:**

Replace your-api-key-here with the key you copied and endpoint-here with the correct endpoint


No quotes around the values.  

No spaces around the equals sign.
If you are unsure where to create the file, run this in a notebook cell to confirm the expected path:


The lab uses a small config helper that reads the `.env` file. 

Run the following cell to verify that everything loads correctly.

In [12]:
from pathlib import Path
import os

expected = Path(os.getcwd()).parent / ".env"
print(f"config.py expects .env at: {expected}")
print(f"File exists: {expected.exists()}")

config.py expects .env at: /opt/app-root/src/Day3/.env
File exists: True


If it shows `File exists: False`, create the file at the printed path and re-run the config cell.

The lab uses a small config helper that reads the `.env` file. Run the following cell to verify that everything loads correctly.

In [13]:
import sys
sys.path.insert(0, "..")
from config import API_KEY as key, ENDPOINT_BASE as endpoint_base

print(f"Endpoint: {endpoint_base}")
print(f"API Key:  {key[:8]}...")

Endpoint: https://litellm-prod.apps.maas.redhatworkshops.io/v1
API Key:  sk-UFHcL...


The output will show your endpoint URL in full and the first 8 characters of your API key followed by an ellipsis. 

The key is deliberately truncated so it is not exposed in notebook output, screenshots, or screen shares. 

If either value is blank or shows `None`, check that the `.env` file exists in the Day3 directory and that the variable names match exactly: `API_KEY` and `ENDPOINT_BASE`.

### 1.2.2 Install and Verify Dependencies

The `its_hub` library is pre-installed in the lab environment along with most of its dependencies. Run the following cell to confirm the installation and install `nest_asyncio`, which is needed for running async code inside Jupyter notebooks.


In [5]:
! pip install its_hub nest_asyncio

You will likely see "Requirement already satisfied" for most packages. That is expected. The output confirms what is present and at what version. If any package fails to install or reports a version conflict, notify the instructor.


A quick note on why `nest_asyncio` is necessary. Under the hood, `its_hub` uses Python's `asyncio` to send generation requests in parallel. That is how Best-of-N can fire off 5 candidate requests concurrently rather than sequentially. The problem is that Jupyter itself already runs an event loop, and Python's default behavior is to reject a second event loop inside an existing one. `nest_asyncio` patches that restriction so the two can coexist. 

Without it, every call to `scaling_alg.infer()` would raise a `RuntimeError: This event loop is already running` exception. This is not specific to `its_hub`. Any async library used inside a Jupyter notebook has the same issue.

In [14]:
import nest_asyncio
nest_asyncio.apply()

No output from that cell means it worked. If it throws an `ImportError`, the `pip install` cell above did not complete successfully.

### 1.2.3 Connecting the Generator and Judge to MaaS

No new infrastructure. No new credentials. We are using the same MaaS endpoint and the same API key from the previous cell.

We need two model connections: one for generating answers and one for judging them. The generator is `granite-3-2-8b-instruct`, the same model we have been using throughout. The judge will be `qwen3-14b`, the largest model available on our MaaS endpoint. Using a separate, larger model for judgment matters because a model is not always the best evaluator of its own output. A bigger model with broader training can often spot weaknesses that the smaller model cannot recognize in its own responses.



In [16]:
from its_hub.lms import OpenAICompatibleLanguageModel
from its_hub.algorithms import BestOfN
from its_hub.integration.reward_hub import LLMJudgeRewardModel

# The generation model: same as Day 2
lm = OpenAICompatibleLanguageModel(
    endpoint=endpoint_base,
    api_key=key,
    model_name="granite-3-2-8b-instruct",
    temperature=0.7,  # Some variation across candidates
)

# The judge: a separate, larger model for evaluating candidates
# Note: LLMJudgeRewardModel uses litellm internally, which requires
# the "openai/" prefix to route to an OpenAI-compatible endpoint.
judge = LLMJudgeRewardModel(
    model="openai/qwen3-14b",
    criterion="overall_quality",
    judge_type="groupwise",
    api_key=key,
    base_url=endpoint_base,
)

# Wire them together
scaling_alg = BestOfN(judge)

print("Generator: granite-3-2-8b-instruct")
print("Judge:     qwen3-14b (groupwise, overall_quality)")
print("Algorithm: Best-of-N")
print("Ready.")

Generator: granite-3-2-8b-instruct
Judge:     qwen3-14b (groupwise, overall_quality)
Algorithm: Best-of-N
Ready.


Let's walk through what each piece does.

`OpenAICompatibleLanguageModel` wraps any OpenAI-compatible API endpoint. It handles batched requests, retries, and concurrency internally. The `temperature=0.7` is important here. In Day 2 we used `temperature=0` for deterministic single-shot answers. Now we want variation. If every candidate is identical, there is nothing for the judge to choose between. A temperature of 0.7 gives the model enough freedom to explore different phrasings and reasoning paths while staying coherent.

`LLMJudgeRewardModel` uses a second LLM to evaluate and rank candidate responses. The `criterion="overall_quality"` tells the judge to evaluate on general quality rather than a narrow metric. The `judge_type="groupwise"` means the judge sees all candidates at once and picks a winner, rather than scoring each one independently. This tends to produce better selections because the judge can compare directly.

Two details on the judge configuration deserve attention.

First, the `base_url` parameter. Without it, `LLMJudgeRewardModel` defaults to calling OpenAI's API. Setting base_url=endpoint_base routes the judge's calls through our MaaS endpoint instead. This is a common gotcha when using LLM-as-judge with custom serving infrastructure.

Second, the `openai/` prefix on the model name. The `LLMJudgeRewardModel` uses `litellm` internally for its API calls, and litellm uses that prefix to determine which provider protocol to speak. Without it, litellm does not know that MaaS is an OpenAI-compatible endpoint and the call fails with a "LLM Provider NOT provided" error. The generator does not need this prefix because `OpenAICompatibleLanguageModel` makes direct HTTP requests to the endpoint, bypassing litellm entirely. This is a subtle but important difference between the two components. If you hit a provider routing error on the judge but the generator works fine, check the prefix first.

`BestOfN` is the algorithm that ties them together. It takes the judge at construction time. When we call `scaling_alg.infer()`, it will generate N candidates using the language model, pass all of them to the judge, and return the one the judge ranked highest.

If the cell runs without error, you should see the three confirmation lines printed. If you get a connection error, verify that your endpoint and API key are correct and that both `granite-3-2-8b-instruct` and `qwen3-14b` are in your MaaS subscriptions.

## 1.3 Best-of-N with LLM Judge

Now we apply this to the questions from Day 2. We will run all 10 questions through the Best-of-N pipeline so we can observe two things: whether the passing questions still pass (no regression), and whether the failing questions improve.


### 1.3.1 Running the Target Questions Through Best-of-N

First, we load the Day 2 results. This version of the evaluation data includes the retrieved context for each question so we can reconstruct the exact RAG prompts from Day 2.

In [18]:
import json

with open("../prebuilt/eval_with_context.json", "r", encoding="utf-8") as f:
    eval_data = json.load(f)

results_day2 = eval_data["results"]

print(f"Loaded {len(results_day2)} questions from Day 2 evaluation")
print(f"Passes:   {sum(1 for r in results_day2 if r['classification'] == 'pass')}")
print(f"Failures: {sum(1 for r in results_day2 if r['classification'] != 'pass')}")

Loaded 10 questions from Day 2 evaluation
Passes:   6
Failures: 4


Let's walk through the structure of this loop before running it.

For each of the 10 questions, we build a prompt that includes the system instruction, the retrieved context from Day 2, and the question itself. This is the same prompt structure the model saw during the Day 2 evaluation. The only difference is what happens after the prompt is sent. Instead of a single generation call, `scaling_alg.infer()` sends 5 identical requests (the `budget`), collects the 5 candidate responses, passes them to the `qwen3-14b` judge, and returns the winner.

The `return_response_only=False` flag tells `BestOfN` to return the full result object rather than just the winning answer. That gives us access to all 5 scores, which candidate was selected, and how many candidates were generated. We capture all of this for comparison in the next cell.

Each question generates 5 candidate requests plus 1 judge request, so 10 questions means roughly 60 API calls total. 

Expect this to take 4 to 5 minutes depending on endpoint load, with individual questions ranging from 15 to 60 seconds. The variation is normal: longer answers and more complex comparisons take longer for the judge to evaluate.

The cell below controls whether the loop runs live or loads previously saved results. For the first time through, leave `RUN_LIVE = True` and let it run. If you have already run it once, or if you need to move through the material faster, set it to `False` and the notebook will load the results from `../prebuilt/bon_results.json` instead. Either way, the analysis cells that follow work identically.


> **Facilitator note**: If time is tight or the endpoint is slow, set RUN_LIVE = False and move on. This is expected behavior during the workshop, not a failure. The first clean run of the day should save results to the prebuilt directory, and every subsequent participant or session can load from there.



In [22]:
# Set to True to run Best-of-N live. Set to False to load saved results.
RUN_LIVE = True

In [21]:
import time
from its_hub.utils import extract_content_from_lm_response

if RUN_LIVE:
    SYSTEM_PROMPT = (
        "Answer the question using only the provided context. "
        "Be specific and cite rules where possible."
    )

    BUDGET = 5

    bon_results = []

    print(f"\n{'='*70}")
    print(f"RUNNING BEST-OF-N (budget={BUDGET}) WITH LLM JUDGE")
    print(f"{'='*70}")

    total_start = time.time()

    for i, r in enumerate(results_day2):
        print(f"\n  [{r['id']}] {r['question'][:60]}...")

        context = r.get("retrieved_context", "")
        prompt = f"""{SYSTEM_PROMPT}

Context:
{context}

Question: {r['question']}
Answer:"""

        q_start = time.time()
        result = scaling_alg.infer(
            lm, prompt, budget=BUDGET, return_response_only=False
        )
        q_elapsed = time.time() - q_start

        best_answer = extract_content_from_lm_response(result.the_one)

        bon_results.append({
            "id": r["id"],
            "question": r["question"],
            "expected": r["expected"],
            "category": r["category"],
            "day2_answer": r["answer"],
            "day2_classification": r["classification"],
            "bon_answer": best_answer,
            "bon_scores": result.scores,
            "bon_selected_index": result.selected_index,
            "bon_n_candidates": len(result.responses),
        })

        day2_status = "PASS" if r["classification"] == "pass" else "FAIL"
        print(f"    Day 2: {day2_status}")
        print(f"    Best candidate: #{result.selected_index + 1} of {len(result.responses)}")
        print(f"    Answer: {best_answer[:100]}...")
        print(f"    Time: {q_elapsed:.1f}s  ({i+1}/{len(results_day2)})")

    total_elapsed = time.time() - total_start

    print(f"\n{'='*70}")
    print(f"Best-of-N complete. Total time: {total_elapsed:.1f}s ({total_elapsed/60:.1f} min)")
    print(f"{'='*70}")

else:
    with open("../prebuilt/bon_results.json", "r", encoding="utf-8") as f:
        bon_results = json.load(f)
    BUDGET = bon_results[0]["bon_n_candidates"]
    print(f"Loaded {len(bon_results)} pre-built Best-of-N results (budget={BUDGET})")


RUNNING BEST-OF-N (budget=5) WITH LLM JUDGE

  [q01] What happens if a Thief fails an Open Locks attempt?...
    Day 2: PASS
    Best candidate: #3 of 5
    Answer: If a Thief fails an Open Locks attempt, they must wait until they have gained another level of exper...
    Time: 15.5s  (1/10)

  [q02] Why can't Elves roll higher than a d6 for hit points?...
    Day 2: FAIL
    Best candidate: #1 of 5
    Answer: According to the provided context, Elves never roll larger than six-sided dice (d6) for hit points. ...
    Time: 24.6s  (2/10)

  [q03] Can a character wear leather armor and cast spells?...
    Day 2: PASS
    Best candidate: #2 of 5
    Answer: Yes, an Elf character can wear leather armor and cast spells. This is because Elves are a combinatio...
    Time: 30.6s  (3/10)

  [q04] What is the saving throw for a 3rd level Fighter against Dra...
    Day 2: FAIL
    Best candidate: #1 of 5
    Answer: 3rd level Fighters have a saving throw of 15 against Dragon Breath. This is lis

In [24]:
# Save results only if we ran live
if RUN_LIVE:
    with open("../prebuilt/bon_results.json", "w", encoding="utf-8") as f:
        json.dump(bon_results, f, indent=2)

    with open("bon_results.json", "w", encoding="utf-8") as f:
        json.dump(bon_results, f, indent=2)

    print(f"Saved {len(bon_results)} results to ../prebuilt/bon_results.json")
    print(f"Saved {len(bon_results)} results to ./bon_results.json")
else:
    print("Using pre-built results. Nothing to save.")

Saved 10 results to ../prebuilt/bon_results.json
Saved 10 results to ./bon_results.json


### 1.3.2 comparison. Here are the cells in order:


Now we put the results side by side. Questions that failed on Day 2 are marked with `>>` so they stand out.

In [25]:
print(f"\n{'='*70}")
print("COMPARISON: DAY 2 SINGLE-SHOT vs. BEST-OF-N")
print(f"{'='*70}")

for br in bon_results:
    day2_status = "PASS" if br["day2_classification"] == "pass" else "FAIL"
    marker = "  " if day2_status == "PASS" else ">>"

    print(f"\n{marker} [{br['id']}] {br['question']}")
    print(f"   Category:     {br['category']}")
    print(f"   Day 2 status: {br['day2_classification']}")
    print()
    print(f"   Day 2 answer:")
    print(f"     {br['day2_answer'][:150]}")
    print()
    print(f"   Best-of-N answer (#{br['bon_selected_index']+1} of {br['bon_n_candidates']}):")
    print(f"     {br['bon_answer'][:150]}")
    print(f"   Scores: {br['bon_scores']}")
    print(f"   {'---'}")


COMPARISON: DAY 2 SINGLE-SHOT vs. BEST-OF-N

   [q01] What happens if a Thief fails an Open Locks attempt?
   Category:     explicit_rule
   Day 2 status: pass

   Day 2 answer:
     If a Thief fails an Open Locks attempt, they must wait until they have gained another level of experience before trying again.

   Best-of-N answer (#3 of 5):
     If a Thief fails an Open Locks attempt, they must wait until they have gained another level of experience before trying again. This rule is stated in 
   Scores: [0.0, 0.0, 1.0, 0.0, 0.0]
   ---

>> [q02] Why can't Elves roll higher than a d6 for hit points?
   Category:     terminology
   Day 2 status: implicit_reasoning_failure

   Day 2 answer:
     According to the provided context, Elves never roll larger than six-sided dice (d6) for hit points. The reason for this restriction is not explicitly 

   Best-of-N answer (#1 of 5):
     According to the provided context, Elves never roll larger than six-sided dice (d6) for hit points. This is a

Read through the output carefully. For each question, ask yourself: did Best-of-N produce a better answer than single-shot?

If yes, the model had the capability. It just did not surface it on the first try. This is a sampling win, not a knowledge win. The model's weights already contain something useful, and multiple attempts with a good judge found it.

If no, the model tried 5 times, a separate larger judge picked the best attempt, and it is still wrong. That means the failure is not about sampling luck. The model genuinely struggles with this class of question.

A note about the scores. Because we are using a groupwise judge, the scores are binary: `1.0` for the selected response and `0.0` for the others. The judge is making a ranking decision, not assigning granular scores. If you see all responses scoring identically, it means even the judge could not differentiate them, which itself is a signal.

### 1.3.3 Where It Helps and Where It Doesn't

In [26]:
print(f"\n{'='*70}")
print("BEST-OF-N IMPACT ANALYSIS")
print(f"{'='*70}")

day2_failures = [br for br in bon_results if br["day2_classification"] != "pass"]
day2_passes = [br for br in bon_results if br["day2_classification"] == "pass"]

print(f"\n  Previously passing: {len(day2_passes)} questions")
print(f"  Previously failing: {len(day2_failures)} questions")

print(f"\n  --- Previously Failing Questions ---")
for br in day2_failures:
    print(f"\n  [{br['id']}] {br['question']}")
    print(f"    Category:  {br['category']}")
    print(f"    Expected:  {br['expected'][:120]}")
    print(f"    Day 2:     {br['day2_answer'][:120]}")
    print(f"    Best-of-N: {br['bon_answer'][:120]}")
    print(f"    Scores:    {br['bon_scores']}")


BEST-OF-N IMPACT ANALYSIS

  Previously passing: 6 questions
  Previously failing: 4 questions

  --- Previously Failing Questions ---

  [q02] Why can't Elves roll higher than a d6 for hit points?
    Category:  terminology
    Expected:  Elves use a d6 for hit points because that is the hit die assigned to the Elf combination class in Basic Fantasy RPG.
    Day 2:     According to the provided context, Elves never roll larger than six-sided dice (d6) for hit points. The reason for this 
    Best-of-N: According to the provided context, Elves never roll larger than six-sided dice (d6) for hit points. This is a restrictio
    Scores:    [1.0, 0.0, 0.0, 0.0, 0.0]

  [q04] What is the saving throw for a 3rd level Fighter against Dragon Breath?
    Category:  table_lookup
    Expected:  Based on the Fighter saving throw table, a 3rd level Fighter has a Dragon Breath saving throw of 15.
    Day 2:     The context does not provide specific information on the saving throw for a 3rd level Fi

> **Facilitator note:** This is a discussion moment, not just a code output. Walk the room through each of the 4 previously failing questions and ask:
> "Is this answer better? Is it correct? Is it good enough?"

Point participants to the specific patterns.

**q04 (saving throw table lookup):** Day 2 said "the context does not provide specific information." Best-of-N returned the correct answer: 15. One of the 5 candidates actually read the table. The judge picked it. This is the textbook sampling win. The model can do it. It just does not do it reliably on the first try.

**q07 (retainer vs hireling):** Day 2 hedged with "the context does not provide specific details about hirelings." Best-of-N correctly states that hirelings are normal people hired for specific services who do not go on adventures. One of the 5 candidates used both sections of the retrieved context. The judge picked it.

**q06 (Strength bonus):** The answer lands on "+2" which is the correct number, but look at the reasoning. It says "as per the standard D&D rules" or refers to an ability score table "not provided in the context." The model guessed correctly from its general training data, not from the retrieved context. In a customer setting where answers must be grounded in their documents, that is still a failure. The right number for the wrong reason is not trustworthy.

**q02 (Elf hit dice):** All 5 candidates produced the same hedge: "The reason for this restriction is not explicitly stated in the context." The model consistently cannot infer the "why" from the text. Five attempts, same wall. This is the clearest signal that the failure is systematic, not stochastic.

The scorecard: 

* 2 clear wins (q04, q07),
* 1 false positive (q06, right answer from wrong source),
* 1 persistent failure (q02).

Inference-time scaling helped but did not solve everything.

## 1.4 Other Inference-Time Scaling Strategies (Conceptual Only)

Best-of-N is the simplest algorithm in `its_hub`. There are others, and they are worth understanding conceptually even though we will not implement them all today. The point is not to teach every algorithm. It is to show that inference-time scaling is a family of techniques with different tradeoffs, and Best-of-N is just the entry point.

### 1.4.1 Self-Consistency

Self-consistency generates multiple reasoning paths and selects the answer that appears most frequently. Instead of using a judge to evaluate quality, it uses agreement as a signal. If 4 out of 5 attempts arrive at the same answer, that answer is probably more reliable than one that only appears once.

The implementation in `its_hub` takes a **projection function** that extracts the "answer" from each response. For math problems, that might mean pulling a number out of a paragraph of reasoning. It then votes on the projected answers. The response whose projected answer matches the majority is selected.

In [27]:
# Conceptual only. Do not run this cell.
# Shown to illustrate the Self-Consistency API pattern.

from its_hub.algorithms import SelfConsistency

# For math: extract the boxed answer and vote
# sc = SelfConsistency(lambda s: extract_boxed(s))
# result = sc.infer(lm, prompt, budget=5)

Self-consistency works well when the model can reason through the problem but occasionally takes a wrong turn. The majority vote filters out the unlucky paths. It works poorly when the model consistently reasons in the same incorrect direction, because all 5 attempts will agree on the wrong answer.

For our use case, self-consistency is less useful because our questions have free-text answers, not extractable numeric results. There is no clean projection function for "explain the difference between a retainer and a hireling." The algorithm needs a way to determine when two answers are "the same," and for open-ended text that comparison is not straightforward.

### 1.4.2 Particle Filtering and Beam Search

These approaches operate at the token or step level rather than the response level. Instead of generating complete responses and selecting among them, they maintain multiple partial generations simultaneously and prune weak reasoning paths as they go.

Particle Filtering treats generation as a search problem: generate one reasoning step at a time, score each partial solution with a Process Reward Model, resample the population toward higher-scoring paths, and continue. Think of it as Best-of-N applied at every sentence rather than at the end. `its_hub` includes both standard and entropic variants.

Beam Search maintains the top-K partial sequences at each step and only expands from those. It is more aggressive about pruning than Particle Filtering, keeping only the highest-scoring paths alive at each step.

Both require a **Process Reward Model** (PRM), which scores intermediate reasoning steps, not just final answers. That means they need `pip install its_hub[prm]` and a local GPU to run the PRM. They are most relevant when you control the serving infrastructure and can run reward models locally.

**Why this matters for the field:** When a customer asks "what else can we try before fine-tuning," you now have a layered answer. Best-of-N works with any API endpoint and no extra infrastructure. Self-consistency works when answers are extractable and comparable. Particle Filtering and Beam Search offer the most control but require GPU resources for the reward model. Each is a step up in infrastructure complexity, and each is still cheaper than model training.

> **Facilitator note:** Keep this conceptual. Do not attempt to run Particle Filtering or Beam Search in this lab. The point is awareness, not implementation. If participants ask about these in depth, point them to the `its_hub` documentation and the quick-start examples in the repository.
