# 1. Inference-Time Scaling: Pushing the Model Before Changing It

**Purpose:**

Before committing to model adaptation, test whether spending more compute at inference time can close the gap. This is the last and most rigorous check before escalating to training. If inference-time scaling solves the problem, we avoid the cost and complexity of fine-tuning entirely. If it does not, we have even stronger evidence that the model's weights need to change.


## 1.1 The Idea: Spend Compute at Inference, Not Training

There is a common assumption in AI engagements that when a model gives a wrong answer, the model needs to be retrained. Sometimes that is true. But retraining is expensive, slow, and organizationally complex. Before going there, it is worth asking a simpler question:

What if we just let the model try harder?

That is the core idea behind inference-time scaling. Instead of changing the model's weights (which is what fine-tuning does), you change how much work the model does at inference time. You let it generate multiple candidate answers, evaluate them, and select the best one. The model itself is unchanged. You are spending compute instead of training data.

This matters for two reasons.

First, it is cheaper and faster to try. There is no training pipeline to build, no data to curate, no model to validate. You are using the same model, the same endpoint, the same API key. The only thing that changes is how many times you call it and how you pick the winner.

Second, the results are diagnostic. If the model produces the right answer on attempt 3 out of 5, that tells you something important: the knowledge is accessible but the model's default sampling path does not reliably surface it. That is a very different problem than "the model fundamentally cannot do this." And the fix might be as simple as a better decoding strategy rather than a full training run.

If the model cannot produce the right answer in N attempts with good context, that is also diagnostic. It means the failure is not a sampling issue. The model genuinely does not know how to handle this class of question. That is the strongest possible evidence for model adaptation.

Either way, you learn something. And learning before spending is the entire philosophy of this workshop.

## 1.2 Introducing `its_hub`

``its_hub`` is an inference-time scaling library built by the Red Hat AI Innovation Team. It implements several algorithms for improving model output quality without modifying model weights, and it works with any OpenAI-compatible API endpoint. That means it works with our MaaS setup without any infrastructure changes.

The library has three layers:

A **language model wrapper** (`OpenAICompatibleLanguageModel`) that handles generation against any OpenAI-compatible endpoint, including batched and async calls.

A **reward model** that scores candidate responses. For Best-of-N, this is ``LLMJudgeRewardModel``, which uses a separate LLM to evaluate response quality. For more advanced algorithms like Particle Filtering, it can be a local Process Reward Model that scores each reasoning step.

A **scaling algorithm** that orchestrates the generation and scoring. `BestOfN` generates N candidates and picks the best. `SelfConsistency` generates N candidates and picks the most common answer. `ParticleFiltering` and `BeamSearch` operate at the step level, pruning weak reasoning paths as they go.

For this lab, we will use **Best-of-N with LLM Judge**. The idea is straightforward:

1. Send the same prompt to the model N times (the "budget")
2. Collect N candidate responses
3. Use a separate LLM judge to evaluate and rank the candidates
4. Return the highest-scoring response

The generation model is not retrained. It is not prompted differently. It simply gets multiple chances, and a judge picks the best attempt.

### 1.2.1 Environment Setup

Open the `01/01Inference_Time_Scaling/01_BestOfN.ipynb` notebook.


Before we touch any code, we need two values: an API key and an endpoint URL for the MaaS (Model as a Service) platform. If you attended Day 2, you already have these. If you are starting fresh today, follow the setup steps below. If you already have a working `.env` file from Day 2, skip ahead to the verification cell.

**For participants who did not attend Day 2:**

You need access to the Red Hat Demo Platform MaaS service. Using your Red Hat SSO credentials, log in at:

> https://litellm-prod-frontend.apps.maas.redhatworkshops.io/home

Once logged in:

1. Click Subscriptions in the left sidebar, then New Subscription. Subscribe to the following models: granite-3-2-8b-instruct, granite-4-0-h-tiny, microsoft-phi-4, and qwen3-14b. For each one, click the model name to open the model card, then click Subscribe.
2. Click API Keys in the left sidebar, then Create API Key. Give it a name (anything you like), check Select All to attach all your subscribed models, and click Create API Key.
3. When the key appears, click the copy button and save it somewhere. You will not be able to see the full key again after closing the dialog.
4. Before closing the dialog, note the API URL displayed on the key details screen. It will look like `https://litellm-prod.apps.maas.redhatworkshops.io/v1`. Copy this as well.

Now create a file named `.env` in the parent of this folder. The file should contain exactly two lines:

```
API_KEY=your-api-key-here
ENDPOINT_BASE=endpoint-here
```

**For everyone:**

Replace your-api-key-here with the key you copied and endpoint-here with the correct endpoint


No quotes around the values.  

No spaces around the equals sign.
If you are unsure where to create the file, run this in a notebook cell to confirm the expected path:


The lab uses a small config helper that reads the `.env` file. 

Run the following cell to verify that everything loads correctly.

In [12]:
from pathlib import Path
import os

expected = Path(os.getcwd()).parent / ".env"
print(f"config.py expects .env at: {expected}")
print(f"File exists: {expected.exists()}")

config.py expects .env at: /opt/app-root/src/Day3/.env
File exists: True


If it shows `File exists: False`, create the file at the printed path and re-run the config cell.

The lab uses a small config helper that reads the `.env` file. Run the following cell to verify that everything loads correctly.

In [13]:
import sys
sys.path.insert(0, "..")
from config import API_KEY as key, ENDPOINT_BASE as endpoint_base

print(f"Endpoint: {endpoint_base}")
print(f"API Key:  {key[:8]}...")

Endpoint: https://litellm-prod.apps.maas.redhatworkshops.io/v1
API Key:  sk-UFHcL...


The output will show your endpoint URL in full and the first 8 characters of your API key followed by an ellipsis. 

The key is deliberately truncated so it is not exposed in notebook output, screenshots, or screen shares. 

If either value is blank or shows `None`, check that the `.env` file exists in the Day3 directory and that the variable names match exactly: `API_KEY` and `ENDPOINT_BASE`.

### 1.2.2 Install and Verify Dependencies

The `its_hub` library is pre-installed in the lab environment along with most of its dependencies. Run the following cell to confirm the installation and install `nest_asyncio`, which is needed for running async code inside Jupyter notebooks.


In [5]:
! pip install its_hub nest_asyncio

You will likely see "Requirement already satisfied" for most packages. That is expected. The output confirms what is present and at what version. If any package fails to install or reports a version conflict, notify the instructor.


A quick note on why `nest_asyncio` is necessary. Under the hood, `its_hub` uses Python's `asyncio` to send generation requests in parallel. That is how Best-of-N can fire off 5 candidate requests concurrently rather than sequentially. The problem is that Jupyter itself already runs an event loop, and Python's default behavior is to reject a second event loop inside an existing one. `nest_asyncio` patches that restriction so the two can coexist. 

Without it, every call to `scaling_alg.infer()` would raise a `RuntimeError: This event loop is already running` exception. This is not specific to `its_hub`. Any async library used inside a Jupyter notebook has the same issue.

In [14]:
import nest_asyncio
nest_asyncio.apply()

No output from that cell means it worked. If it throws an `ImportError`, the `pip install` cell above did not complete successfully.

### 1.2.3 Connecting the Generator and Judge to MaaS

No new infrastructure. No new credentials. We are using the same MaaS endpoint and the same API key from the previous cell.

We need two model connections: one for generating answers and one for judging them. The generator is `granite-3-2-8b-instruct`, the same model we have been using throughout. The judge will be `qwen3-14b`, the largest model available on our MaaS endpoint. Using a separate, larger model for judgment matters because a model is not always the best evaluator of its own output. A bigger model with broader training can often spot weaknesses that the smaller model cannot recognize in its own responses.



In [16]:
from its_hub.lms import OpenAICompatibleLanguageModel
from its_hub.algorithms import BestOfN
from its_hub.integration.reward_hub import LLMJudgeRewardModel

# The generation model: same as Day 2
lm = OpenAICompatibleLanguageModel(
    endpoint=endpoint_base,
    api_key=key,
    model_name="granite-3-2-8b-instruct",
    temperature=0.7,  # Some variation across candidates
)

# The judge: a separate, larger model for evaluating candidates
# Note: LLMJudgeRewardModel uses litellm internally, which requires
# the "openai/" prefix to route to an OpenAI-compatible endpoint.
judge = LLMJudgeRewardModel(
    model="openai/qwen3-14b",
    criterion="overall_quality",
    judge_type="groupwise",
    api_key=key,
    base_url=endpoint_base,
)

# Wire them together
scaling_alg = BestOfN(judge)

print("Generator: granite-3-2-8b-instruct")
print("Judge:     qwen3-14b (groupwise, overall_quality)")
print("Algorithm: Best-of-N")
print("Ready.")

Generator: granite-3-2-8b-instruct
Judge:     qwen3-14b (groupwise, overall_quality)
Algorithm: Best-of-N
Ready.


Let's walk through what each piece does.

`OpenAICompatibleLanguageModel` wraps any OpenAI-compatible API endpoint. It handles batched requests, retries, and concurrency internally. The `temperature=0.7` is important here. In Day 2 we used `temperature=0` for deterministic single-shot answers. Now we want variation. If every candidate is identical, there is nothing for the judge to choose between. A temperature of 0.7 gives the model enough freedom to explore different phrasings and reasoning paths while staying coherent.

`LLMJudgeRewardModel` uses a second LLM to evaluate and rank candidate responses. The `criterion="overall_quality"` tells the judge to evaluate on general quality rather than a narrow metric. The `judge_type="groupwise"` means the judge sees all candidates at once and picks a winner, rather than scoring each one independently. This tends to produce better selections because the judge can compare directly.

Two details on the judge configuration deserve attention.

First, the `base_url` parameter. Without it, `LLMJudgeRewardModel` defaults to calling OpenAI's API. Setting base_url=endpoint_base routes the judge's calls through our MaaS endpoint instead. This is a common gotcha when using LLM-as-judge with custom serving infrastructure.

Second, the `openai/` prefix on the model name. The `LLMJudgeRewardModel` uses `litellm` internally for its API calls, and litellm uses that prefix to determine which provider protocol to speak. Without it, litellm does not know that MaaS is an OpenAI-compatible endpoint and the call fails with a "LLM Provider NOT provided" error. The generator does not need this prefix because `OpenAICompatibleLanguageModel` makes direct HTTP requests to the endpoint, bypassing litellm entirely. This is a subtle but important difference between the two components. If you hit a provider routing error on the judge but the generator works fine, check the prefix first.

`BestOfN` is the algorithm that ties them together. It takes the judge at construction time. When we call `scaling_alg.infer()`, it will generate N candidates using the language model, pass all of them to the judge, and return the one the judge ranked highest.

If the cell runs without error, you should see the three confirmation lines printed. If you get a connection error, verify that your endpoint and API key are correct and that both `granite-3-2-8b-instruct` and `qwen3-14b` are in your MaaS subscriptions.

## 1.3 Best-of-N with LLM Judge

Now we apply this to the questions from Day 2. We will run all 10 questions through the Best-of-N pipeline so we can observe two things: whether the passing questions still pass (no regression), and whether the failing questions improve.


### 1.3.1 Running the Target Questions Through Best-of-N

First, we load the Day 2 results. This version of the evaluation data includes the retrieved context for each question so we can reconstruct the exact RAG prompts from Day 2.

In [18]:
import json

with open("../prebuilt/eval_with_context.json", "r", encoding="utf-8") as f:
    eval_data = json.load(f)

results_day2 = eval_data["results"]

print(f"Loaded {len(results_day2)} questions from Day 2 evaluation")
print(f"Passes:   {sum(1 for r in results_day2 if r['classification'] == 'pass')}")
print(f"Failures: {sum(1 for r in results_day2 if r['classification'] != 'pass')}")

Loaded 10 questions from Day 2 evaluation
Passes:   6
Failures: 4


Let's walk through the structure of this loop before running it.

For each of the 10 questions, we build a prompt that includes the system instruction, the retrieved context from Day 2, and the question itself. This is the same prompt structure the model saw during the Day 2 evaluation. The only difference is what happens after the prompt is sent. Instead of a single generation call, `scaling_alg.infer()` sends 5 identical requests (the `budget`), collects the 5 candidate responses, passes them to the `qwen3-14b` judge, and returns the winner.

The return_response_only=False flag tells BestOfN to return the full result object rather than just the winning answer. That gives us access to all 5 scores, which candidate was selected, and how many candidates were generated. We capture all of this for comparison in the next cell.


