### Notebook Walkthrough & References
Prefer watching the flow? Here is the video companion for this notebook: [YouTube walkthrough](https://www.youtube.com/watch?v=w3aTT4kW318).

Helpful references while you run the demo:
- [LangGraph Agentic RAG guide](https://docs.langchain.com/oss/python/langgraph/agentic-rag)
- [Ollama granite4 model card](https://ollama.com/library/granite4)
- [EmbeddingGemma 300M on Hugging Face](https://huggingface.co/google/embeddinggemma-300m)


### Install Dependencies
Install LangGraph, LangChain, and helper packages used throughout this demo.

Need to finish setting up `uv` or Ollama first? I demo both in this video: [uv install @ 1:14](https://www.youtube.com/watch?v=LXSfjOCYD40&t=74s&pp=0gcJCdcCDuyUWbzu) and [Ollama install @ 1:50](https://www.youtube.com/watch?v=LXSfjOCYD40&t=110s).


In [1]:
!uv pip install -U langgraph "langchain[openai]" langchain-community langchain-text-splitters bs4 langchain_huggingface load_dotenv langchain-chroma sentence_transformers langchain-ollama

[2mUsing Python 3.13.5 environment at: langgraph-rag-demo[0m
[2K[2mResolved [1m141 packages[0m [2min 712ms[0m[0m                                       [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/113)                                                 
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/113)------------[0m[0m     0 B/392.52 KiB          [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/113)------------[0m[0m 16.00 KiB/392.52 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/113)------------[0m[0m 32.00 KiB/392.52 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/113)------------[0m[0m 48.00 KiB/392.52 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/113)------------[0m[0m 61.72 KiB/392.52 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/113)------------[0m[0m 77.72 KiB/392.52 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/113)------------[0m

### Load Source Articles
Fetch Lilian Weng blog posts with WebBaseLoader so we have raw content to work with.


In [2]:
from langchain_community.document_loaders import WebBaseLoader

urls = [
    "https://lilianweng.github.io/posts/2024-11-28-reward-hacking/",
    "https://lilianweng.github.io/posts/2024-07-07-hallucination/",
    "https://lilianweng.github.io/posts/2024-04-12-diffusion-video/",
]

docs = [WebBaseLoader(url).load() for url in urls]

  from .autonotebook import tqdm as notebook_tqdm
USER_AGENT environment variable not set, consider setting it to identify your requests.


### Preview Raw Content
Glance at the opening portion of the first document to confirm the loader worked.


In [3]:
docs[0][0].page_content.strip()[:1000]

"Reward Hacking in Reinforcement Learning | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n|\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\n\n\n\n\n\n      Reward Hacking in Reinforcement Learning\n    \nDate: November 28, 2024  |  Estimated Reading Time: 37 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nBackground\n\nReward Function in RL\n\nSpurious Correlation\n\n\nLet’s Define Reward Hacking\n\nList of Examples\n\nReward hacking examples in RL tasks\n\nReward hacking examples in LLM tasks\n\nReward hacking examples in real life\n\n\nWhy does Reward Hacking Exist?\n\n\nHacking RL Environment\n\nHacking RLHF of LLMs\n\nHacking the Training Process\n\nHacking the Evaluator\n\nIn-Context Reward Hacking\n\n\nGeneralization of Hacking Skills\n\nPeek into Mitigations\n\nRL Algorithm Improvement\n\nDetecting Reward Hacking\n\nData Analysis of RLHF\

### Chunk Documents
Split the articles into overlapping chunks that are easier to embed and retrieve.


In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

docs_list = [item for sublist in docs for item in sublist]

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=50
)
doc_splits = text_splitter.split_documents(docs_list)

### Inspect First Chunk
Print one chunk to understand the snippets the retriever will see.


In [5]:
print(doc_splits[0].page_content.strip())

Reward Hacking in Reinforcement Learning | Lil'Log








































Lil'Log

















|






Posts




Archive




Search




Tags




FAQ


### Configure Embeddings
Log into Hugging Face and initialize the EmbeddingGemma model used for vectorization.

Replace `login(token='YOUR_HUGGINGFACE_API_TOKEN')` with your own token (loading it from an env var is safest) so the embedding model can authenticate.


In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from huggingface_hub import login

login(token='YOUR_HUGGINGFACE_API_TOKEN')

EMBEDDING_MODEL = 'google/embeddinggemma-300m'  # Google's new EmbeddingGemma model
EMBEDDING_DIMS = 256  # Truncated from 768 for 3x faster processing


embeddings = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL,
    encode_kwargs={"truncate_dim": EMBEDDING_DIMS}
)

### Build Vector Store
Store the chunked documents in an in-memory vector index and expose a retriever.


In [7]:
from langchain_core.vectorstores import InMemoryVectorStore

vectorstore = InMemoryVectorStore.from_documents(
    documents=doc_splits, embedding=embeddings
)

retriever = vectorstore.as_retriever()


### Create Retriever Tool
Wrap the retriever as a LangChain tool so the agent can call it.


In [8]:
from langchain_classic.tools.retriever import create_retriever_tool

retriever_tool = create_retriever_tool(
    retriever,
    "retrieve_blog_posts",
    "Search and return information about Lilian Weng blog posts.",
)

### Run Sample Retrieval
Issue a reward-hacking query through the tool to fetch supporting passages.


In [9]:
res = retriever_tool.invoke({"query": "types of reward hacking"})

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Display Retrieved Chunks
Pretty-print the retrieval result to inspect what was returned.


In [10]:
from pprint import pprint
pprint(res)

('Detecting Reward Hacking#\n'
 '\n'
 '(Note: Some work defines reward tampering as a distinct category of '
 'misalignment behavior from reward hacking. But I consider reward hacking as '
 'a broader concept here.)\n'
 'At a high level, reward hacking can be categorized into two types: '
 'environment or goal misspecification, and reward tampering.\n'
 '\n'
 'Why does Reward Hacking Exist?#\n'
 '\n'
 'In-Context Reward Hacking#')


### Configure Response Model
Set up a ChatOllama model plus message/state types that power the agent.

Reminder: make sure Ollama is installed, the service is running locally, and the `granite4:350m` model (or whichever you pick) is already pulled—see the video segment above if you need a refresher.


In [11]:
from typing import List
from langgraph.graph import MessagesState
from langchain.messages import AIMessage
from langchain.tools import tool
from langchain_ollama import ChatOllama

response_model = ChatOllama(
    model="granite4:350m",
    validate_model_on_init=True,
    temperature=0,
)

### Define Generate-or-Respond Node
Create the node that either answers directly or requests retrieval based on the conversation.


In [12]:
def generate_query_or_respond(state: MessagesState):
    """Call the model to generate a response based on the current state. Given
    the question, it will decide to retrieve using the retriever tool, or simply respond to the user.
    """
    response = (
        response_model
        .bind_tools([retriever_tool]).invoke(state["messages"])  
    )
    return {"messages": [response]}

### Test Direct Response Path
Probe the node with a simple greeting to ensure it can reply without retrieval.


In [13]:
input = {"messages": [{"role": "user", "content": "hello!"}]}
generate_query_or_respond(input)["messages"][-1].pretty_print()


I'm a helpful assistant with access to certain tools. Could you please provide more details about what you need assistance with? I can look up blog posts or perform other tasks using the functions available to me.


### Test Retrieval-triggering Question
Send a domain question to confirm the model issues a retriever tool call.


In [14]:
input = {
    "messages": [
        {
            "role": "user",
            "content": "What does Lilian Weng say about types of reward hacking?",
        }
    ]
}
generate_query_or_respond(input)["messages"][-1].pretty_print()

Tool Calls:
  retrieve_blog_posts (738041e7-d93d-411a-a856-db91c35721a3)
 Call ID: 738041e7-d93d-411a-a856-db91c35721a3
  Args:
    query: types of reward hacking


### Set Up Document Grader
Define the relevance-grading prompt, schema, and logic that decide if context is good.


In [15]:
from pydantic import BaseModel, Field
from typing import Literal

GRADE_PROMPT = (
    "You are a grader assessing relevance of a retrieved document to a user question. \n "
    "Here is the retrieved document: \n\n {context} \n\n"
    "Here is the user question: {question} \n"
    "If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. \n"
    "Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question."
)


class GradeDocuments(BaseModel):  
    """Grade documents using a binary score for relevance check."""

    binary_score: str = Field(
        description="Relevance score: 'yes' if relevant, or 'no' if not relevant"
    )


grader_model = ChatOllama(
    model="granite4:350m",
    # model="llama3.2:1b",
    # model="qwen3:0.6b",
    validate_model_on_init=True,
    temperature=0,)


def grade_documents(
    state: MessagesState,
) -> Literal["generate_answer", "rewrite_question"]:
    """Determine whether the retrieved documents are relevant to the question."""
    question = state["messages"][0].content
    context = state["messages"][-1].content

    prompt = GRADE_PROMPT.format(question=question, context=context)
    response = (
        grader_model
        .with_structured_output(GradeDocuments).invoke(  
            [{"role": "user", "content": prompt}]
        )
    )
    score = response.binary_score

    if score == "yes":
        return "generate_answer"
    else:
        return "rewrite_question"

### Grade Irrelevant Context
Run the grader on nonsense tool output to watch it route toward question rewriting.


In [17]:
from langchain_core.messages import convert_to_messages

input = {
    "messages": convert_to_messages(
        [
            {
                "role": "user",
                "content": "What does Lilian Weng say about types of reward hacking?",
            },
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "id": "1",
                        "name": "retrieve_blog_posts",
                        "args": {"query": "types of reward hacking"},
                    }
                ],
            },
            {"role": "tool", "content": "meow", "tool_call_id": "1"},
        ]
    )
}
grade_documents(input)

'rewrite_question'

### Grade Relevant Context
Run the grader on a good snippet to verify it green-lights answer generation.


In [18]:
input = {
    "messages": convert_to_messages(
        [
            {
                "role": "user",
                "content": "What does Lilian Weng say about types of reward hacking?",
            },
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "id": "1",
                        "name": "retrieve_blog_posts",
                        "args": {"query": "types of reward hacking"},
                    }
                ],
            },
            {
                "role": "tool",
                "content": "reward hacking can be categorized into two types: environment or goal misspecification, and reward tampering",
                "tool_call_id": "1",
            },
        ]
    )
}
grade_documents(input)

'generate_answer'

### Define Question Rewriter
Add a prompt/function that rewrites the user query before rerunning retrieval.


In [19]:
REWRITE_PROMPT = (
    "Look at the input and try to reason about the underlying semantic intent / meaning.\n"
    "Here is the initial question:"
    "\n ------- \n"
    "{question}"
    "\n ------- \n"
    "Formulate an improved question:"
)


def rewrite_question(state: MessagesState):
    """Rewrite the original user question."""
    messages = state["messages"]
    question = messages[0].content
    prompt = REWRITE_PROMPT.format(question=question)
    response = response_model.invoke([{"role": "user", "content": prompt}])
    return {"messages": [{"role": "user", "content": response.content}]}

### Test Question Rewriter
Pass in a mocked trace to see how the question is rephrased.


In [20]:
input = {
    "messages": convert_to_messages(
        [
            {
                "role": "user",
                "content": "What does Lilian Weng say about types of reward hacking?",
            },
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "id": "1",
                        "name": "retrieve_blog_posts",
                        "args": {"query": "types of reward hacking"},
                    }
                ],
            },
            {"role": "tool", "content": "meow", "tool_call_id": "1"},
        ]
    )
}

response = rewrite_question(input)
print(response["messages"][-1]["content"])

How can we better understand Lilian Weng's perspective on the various types of reward hacking?


### Define Answer Generator
Create the concise answer prompt/function that consumes question plus retrieved context.


In [31]:
GENERATE_PROMPT = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question. "
    "If you don't know the answer, just say that you don't know. "
    "Use three sentences maximum and keep the answer concise.\n"
    "Retrieved Context: \n'''{context}\n'''\n"
    "Question: \n{question} \n"
)


def generate_answer(state: MessagesState):
    """Generate an answer."""
    question = state["messages"][0].content
    context = state["messages"][-1].content
    prompt = GENERATE_PROMPT.format(question=question, context=context)
    response = response_model.invoke([{"role": "user", "content": prompt}])
    return {"messages": [response]}

### Test Answer Generator
Feed in a mocked retrieval response to ensure the agent can craft a final reply.


In [32]:
input = {
    "messages": convert_to_messages(
        [
            {
                "role": "user",
                "content": "What does Lilian Weng say about types of reward hacking?",
            },
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "id": "1",
                        "name": "retrieve_blog_posts",
                        "args": {"query": "types of reward hacking"},
                    }
                ],
            },
            {
                "role": "tool",
                "content": "Lilian Weng says reward hacking can be categorized into two types: environment or goal misspecification, and reward tampering",
                "tool_call_id": "1",
            },
        ]
    )
}

response = generate_answer(input)
response["messages"][-1].pretty_print()


Lilian Weng categorizes reward hacking into two types: environment or goal mis-specification and reward tampering.


### Assemble LangGraph Workflow
Wire all nodes, tools, and routing logic into a LangGraph state machine.


In [33]:
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode, tools_condition

workflow = StateGraph(MessagesState)

# Define the nodes we will cycle between
workflow.add_node(generate_query_or_respond)
workflow.add_node("retrieve", ToolNode([retriever_tool]))
workflow.add_node(rewrite_question)
workflow.add_node(generate_answer)

workflow.add_edge(START, "generate_query_or_respond")

# Decide whether to retrieve
workflow.add_conditional_edges(
    "generate_query_or_respond",
    # Assess LLM decision (call `retriever_tool` tool or respond to the user)
    tools_condition,
    {
        # Translate the condition outputs to nodes in our graph
        "tools": "retrieve",
        END: END,
    },
)

# Edges taken after the `action` node is called.
workflow.add_conditional_edges(
    "retrieve",
    # Assess agent decision
    grade_documents,
)
workflow.add_edge("generate_answer", END)
workflow.add_edge("rewrite_question", "generate_query_or_respond")

# Compile
graph = workflow.compile()

### Stream End-to-end Run
Execute the compiled graph on a sample question and view each node's updates.


In [34]:
for chunk in graph.stream(
    {
        "messages": [
            {
                "role": "user",
                "content": "What does Lilian Weng say about types of reward hacking?",
            }
        ]
    }
):
    for node, update in chunk.items():
        print("Update from node", node)
        update["messages"][-1].pretty_print()
        print("\n\n")

Update from node generate_query_or_respond
Tool Calls:
  retrieve_blog_posts (d66958a1-ee7f-4830-a0c6-7704cf750506)
 Call ID: d66958a1-ee7f-4830-a0c6-7704cf750506
  Args:
    query: types of reward hacking



Update from node retrieve
Name: retrieve_blog_posts

Detecting Reward Hacking#

(Note: Some work defines reward tampering as a distinct category of misalignment behavior from reward hacking. But I consider reward hacking as a broader concept here.)
At a high level, reward hacking can be categorized into two types: environment or goal misspecification, and reward tampering.

Why does Reward Hacking Exist?#

In-Context Reward Hacking#



Update from node generate_answer

Lilian Weng states that reward hacking can be categorized into two types: environment or goal misspecification, and reward tampering.



