---
title: "TIL from Stanford CME295 Transformers & LLMs | Lecture 7 - Agentic LLMs"
date: "2025-12-26"
categories: 
    - Agentic LLMs
    - RAG
---

## My notes and reflections on Lecture 7 - Agentic LLMs
Lecture emphasises the shift from "Prompt Engineering" to "Context Engineering"

### RAG is needed for fresh knowledge
1. Model has knowledge cut-off and it is tricky to update it (think catastrophic forgetting, LoRA/ fine-tuning and doing it for every knowledge-update or use-case)
2. even with large context windows of 100k-1M tokens, we still need **retrieval** because there are problems like:
    - finding needle in haystack - high recall for single needle but for multiple needles (more real-world scenario) recall drops massively -> garbage in, garbage out still applies
    - we have rate limits on #tokens; higher costs for more tokens; full corpus can't fit in context window. RAG reduces cost per token.

### RAG pipeline

1. Candidate Retrieval: Millions of chunks to hundreds of candidates using bi-encoder embeddings and Approximate Nearest Neighbors (ANN)
    - Bi-Encoder: Query and document chunk encoded independently via SentenceBERT (SBERT) -> compute fast cosine similarity. Also called siamese
    - Hybrid: embedding search + BM25 
    - hyperparameters:Embedding size, chunk size, overlap between chunks
2. Reranking (optional): rescore candidates using Cross-Encoder -> query and document fed simultaneously -> self-attention magic -> more accurate score than simple cosine similarity
3. Context Composition: 
4. Generation




```{mermaid}
flowchart LR
    C["Retrieve <br>(High Recall w Hybrid)"] --> D["Rerank <br>(High Precision w Cross-Encoder)"]
    D --> n1["Compose Context<br>(Clean, Dedup, Scope)"]
    n1 --> n2["Generate"]
``` 

### High value tweaks to retrieval step
- **Contextual Retrieval**: use cheap LLM to contextualize chunks to make them "self-contained"
    - e.g., "He won the election" -> "Donald Trump won the 2024 election..." before embedding
    - A chunk saying "It increased by 5%" is mathematically useless to an embedding model if the subject ("Revenue" or "Churn"?) was in the previous chunk.
    - widely used, no added latency, solves for "lost in the middle" problem better than overlapping windows
    - **Prompt Caching**: cache activations of static prompt prefixes saves costs (upto ~90%) -> when writing a prompt, put reusable part on top (system, instructions, examples, static context etc.) as it is *decoder* only


- **HyDE**: generate fake document of query Q (as it is usually shorter, a question and doesn't look like document) and then compare with document embeddings
    - niche, introduces latency


### Eval metrics for Retrieval
- NDGC (normalized discounted cumulative gain)
    - NDCG@10:If the right answer is in the top 10 but at rank #9, your system effectively failed (users won't read it). NDCG penalizes this heavily.
- MRR (mean reciprocal rank) - MRR (Mean Reciprocal Rank) is often more honest than Recall. Recall@10 says: "We found the answer in the top 10 results!" (Great, but if it was result #10, the LLM might have ignored it due to attention decay). MRR says:"On average, the answer appeared at Rank 1.2." (This confirms the model actually saw the data).
- MAP (mean average precision)
- Precision@k: Did we fetch only relevant stuff, or did we pollute the context window with noise? (needle in haystack)
- Recall@k: Did we find it at all?

#### MTEB (Massive Text Embedding Benchmark)
- for embedding model performance


### Tool Calling for Structured Data
- tool calling for structured data / deterministic operations, modeled as functions with arguments and return values


### Tool Selection
- Classification problem. Model outputs a probability distribution over available tools or `null` (if no tool required)

### MCP (Model Context Protocol)
- sits between LLM and tool, integration logic is standardised
- Standardizes how an LLM reads a PDF, a SQL row, or a Slack message. It replaces your custom def get_slack_messages(): function.
- [Nov 2025 version](https://modelcontextprotocol.io/specification/2025-11-25)

### Agents and ReAct (Reasoning + Acting) Framework
- *Agent* is a system that autonomously pursues goals and completes tasks on a user's behalf
- Traditional vs Reasoning vs Agent (tool calls)
- `while goal is not achieved`: Observe -> Plan -> Act
- in practice, ReAct -> LangGraph (state machine) - stricter
- hallucinations are a (big) problem
- agents interact with each other - A2A protocol (Google)
    - It's gRPC/REST for Agents.
    - Standardizes how a "Travel Agent" asks a "Calendar Agent" for availability. It handles the negotiation of intent, not just the reading of bytes.


### Safety and Guardrails

- **Prompt Injection** - external content (e.g., website) are vulnerable to indirect prompt injection where hidden text on a webpage hijacks agent's instructions

- **Guardrails** - lighter specialized models that can scan inputs/outputs for toxicity and policy violations before LLM processes them

### strategy

- Good to start simple, then iterate and progressively scale up
- Good to start with capable models, optimize on size later
- Transparency / observability helps with user trust and debuggability

### References


- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
    - The reason we can search 10M documents in milliseconds (pre-computed embeddings) vs. seconds (Cross-Encoders).
- HyDE paper | [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/pdf/2212.10496)
- Anthropic [Contextual Retrieval](https://www.anthropic.com/engineering/contextual-retrieval)
- [Cross-Encoder](https://sbert.net/examples/cross_encoder/applications/README.html)
- [Automatic Tool Selection to Reduce Large Language Model Latency](https://www.tdcommons.org/cgi/viewcontent.cgi?params=/context/dpubs_series/article/8702/&path_info=Automatic_Tool_Selection_to_Reduce_Large_Language_Model_Latency.pdf)
- [MCP - Nov 2025 - specification](https://modelcontextprotocol.io/specification/2025-11-25)
- ReAct paper | [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629)
    - The "While Loop" of AI. The paper that proved LLMs perform better when they "talk to themselves" before acting.
- [A2A: A New Era of Agent Interoperability](https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/) | [Google's Agent2Agent Protocol (A2A)](https://a2a-protocol.org/latest/specification/)
    - The first major attempt to standardize inter-agent communication (horizontal) rather than just tool communication (vertical)
- [ToolSword: Unveiling Safety Issues of Large Language Models
in Tool Learning Across Three Stages](https://arxiv.org/pdf/2402.10753)
- ["Towards Tool Use Alignment of Large Language Models", Chen et al., 2024.](https://aclanthology.org/2024.emnlp-main.82.pdf)
- [AGENT-SAFETYBENCH: Evaluating the Safety of
LLM Agents](https://arxiv.org/pdf/2412.14470)
- [Anthropic Cyber Attack](https://www.anthropic.com/news/disrupting-AI-espionage)



>**Licensing Notice**: Text and media: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/); Code: [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0)