# 02 - Context Optimization: From 45% to <5%

In the previous notebook, we saw how tool schemas consume 30-50% of context.

Now let's solve it with **semantic vector search** and **on-demand loading**.

## Learning Objectives

After this notebook, you will:

- [ ] Understand how vector embeddings enable semantic search
- [ ] See on-demand tool loading in action
- [ ] Measure the context savings

---

## The Idea: Load Only What You Need

Instead of loading all 120 tools upfront, what if we could:

1. **Index** all tools once (at startup)
2. **Search** for relevant tools when needed (semantic similarity)
3. **Load** only the top 3-5 matching tools

```
User: "Read the config file and create a GitHub issue"
     â†“
Vector Search: Find tools matching this intent
     â†“
Results: [filesystem:read_file, github:create_issue, json:parse]
     â†“
Load: Only these 3 tool schemas (not all 120)
```

## How Vector Search Works

### Step 1: Create Embeddings

Each tool description is converted to a **vector** (array of numbers) that captures its meaning:

```
"read_file: Read contents of a file from the filesystem"
    â†“ BGE-M3 model
[0.023, -0.156, 0.891, ..., 0.044]  (1024 dimensions)
```

### Step 2: Index in Database

All embeddings are stored in PGlite with pgvector for fast similarity search.

### Step 3: Query

When you ask a question, we:

1. Convert your question to an embedding
2. Find the closest tool embeddings (cosine similarity)
3. Return the top-k matches

In [5]:
// Simulate vector search behavior
// (In production, this uses real embeddings from BGE-Large-EN)

interface Tool {
  id: string;
  server: string;
  name: string;
  description: string;
  tokens: number;
}

// Sample of our 120 tools
const allTools: Tool[] = [
  {
    id: "gh-1",
    server: "github",
    name: "create_issue",
    description: "Create a new issue in a GitHub repository",
    tokens: 850,
  },
  {
    id: "gh-2",
    server: "github",
    name: "list_commits",
    description: "List commits from a repository branch",
    tokens: 720,
  },
  {
    id: "gh-3",
    server: "github",
    name: "create_pr",
    description: "Create a pull request",
    tokens: 920,
  },
  {
    id: "fs-1",
    server: "filesystem",
    name: "read_file",
    description: "Read contents of a file from the filesystem",
    tokens: 480,
  },
  {
    id: "fs-2",
    server: "filesystem",
    name: "write_file",
    description: "Write content to a file",
    tokens: 520,
  },
  {
    id: "fs-3",
    server: "filesystem",
    name: "list_directory",
    description: "List files and folders in a directory",
    tokens: 450,
  },
  {
    id: "db-1",
    server: "database",
    name: "query",
    description: "Execute a SQL query on the database",
    tokens: 680,
  },
  {
    id: "db-2",
    server: "database",
    name: "insert",
    description: "Insert rows into a database table",
    tokens: 750,
  },
  {
    id: "sl-1",
    server: "slack",
    name: "send_message",
    description: "Send a message to a Slack channel",
    tokens: 620,
  },
  {
    id: "sl-2",
    server: "slack",
    name: "search_messages",
    description: "Search for messages in Slack",
    tokens: 580,
  },
  {
    id: "pw-1",
    server: "playwright",
    name: "screenshot",
    description: "Take a screenshot of a webpage",
    tokens: 540,
  },
  {
    id: "pw-2",
    server: "playwright",
    name: "click",
    description: "Click an element on a webpage",
    tokens: 490,
  },
  {
    id: "json-1",
    server: "utils",
    name: "parse_json",
    description: "Parse a JSON string into an object",
    tokens: 320,
  },
];

console.log(`Indexed ${allTools.length} tools (sample from 120 total)`);
console.log();
console.log("Tools by server:");
const byServer = allTools.reduce((acc, t) => {
  acc[t.server] = (acc[t.server] || 0) + 1;
  return acc;
}, {} as Record<string, number>);
for (const [server, count] of Object.entries(byServer)) {
  console.log(`  ${server}: ${count} tools`);
}

Indexed 13 tools (sample from 120 total)

Tools by server:
  github: 3 tools
  filesystem: 3 tools
  database: 2 tools
  slack: 2 tools
  playwright: 2 tools
  utils: 1 tools


In [6]:
// Simulate semantic search (keyword-based approximation for demo)
function searchTools(query: string, topK: number = 5): { tool: Tool; score: number }[] {
  const queryWords = query.toLowerCase().split(/\s+/);

  const scored = allTools.map((tool) => {
    const text = `${tool.name} ${tool.description}`.toLowerCase();
    let score = 0;

    for (const word of queryWords) {
      if (text.includes(word)) score += 0.3;
      if (tool.name.toLowerCase().includes(word)) score += 0.5;
    }

    // Boost exact matches
    if (text.includes(query.toLowerCase())) score += 0.8;

    return { tool, score: Math.min(score, 1.0) };
  });

  return scored
    .filter((r) => r.score > 0.2)
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

// Test search
const query = "read config file and create github issue";
console.log(`Query: "${query}"\n`);

const results = searchTools(query, 5);
console.log("Search Results:\n" + "=".repeat(50));
for (const { tool, score } of results) {
  console.log(`  ${(score * 100).toFixed(0)}% â”‚ ${tool.server}:${tool.name}`);
  console.log(`      â””â”€ ${tool.description}`);
}

Query: "read config file and create github issue"

Search Results:
  100% â”‚ github:create_issue
      â””â”€ Create a new issue in a GitHub repository
  100% â”‚ filesystem:read_file
      â””â”€ Read contents of a file from the filesystem
  80% â”‚ github:create_pr
      â””â”€ Create a pull request
  80% â”‚ filesystem:write_file
      â””â”€ Write content to a file
  60% â”‚ filesystem:list_directory
      â””â”€ List files and folders in a directory


## Measuring the Savings

Now let's compare context usage:

In [7]:
// Calculate context savings
const CONTEXT_WINDOW = 200_000;

// Traditional approach: load ALL tools
const totalAllTools = 120;
const avgTokensPerTool = 680;
const traditionalTokens = totalAllTools * avgTokensPerTool;
const traditionalPct = traditionalTokens / CONTEXT_WINDOW * 100;

// On-demand approach: load only matched tools
const matchedTools = results.map((r) => r.tool);
const onDemandTokens = matchedTools.reduce((sum, t) => sum + t.tokens, 0);
const onDemandPct = onDemandTokens / CONTEXT_WINDOW * 100;

console.log("Context Usage Comparison\n" + "=".repeat(50));
console.log();
console.log("TRADITIONAL (load all tools):");
console.log(`  Tools loaded:    ${totalAllTools}`);
console.log(`  Tokens used:     ${traditionalTokens.toLocaleString()}`);
console.log(`  Context %:       ${traditionalPct.toFixed(1)}%`);
console.log();
console.log("ON-DEMAND (vector search):");
console.log(`  Tools loaded:    ${matchedTools.length}`);
console.log(`  Tokens used:     ${onDemandTokens.toLocaleString()}`);
console.log(`  Context %:       ${onDemandPct.toFixed(2)}%`);
console.log();
console.log("â”€".repeat(50));
const savings = ((traditionalTokens - onDemandTokens) / traditionalTokens * 100).toFixed(0);
const reduction = (traditionalPct / onDemandPct).toFixed(0);
console.log(`ðŸŽ‰ SAVINGS: ${savings}% reduction (${reduction}x less context)`);

Context Usage Comparison

TRADITIONAL (load all tools):
  Tools loaded:    120
  Tokens used:     81,600
  Context %:       40.8%

ON-DEMAND (vector search):
  Tools loaded:    5
  Tokens used:     3,220
  Context %:       1.61%

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
ðŸŽ‰ SAVINGS: 96% reduction (25x less context)


## Real Implementation

Let's use the actual Casys MCP Gateway search functionality:

In [8]:
// Use the real MCP tool for semantic search
// This requires the gateway to be running

try {
  const response = await fetch("http://localhost:3000/mcp", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      jsonrpc: "2.0",
      id: 1,
      method: "tools/call",
      params: {
        name: "pml_search_tools",
        arguments: {
          query: "take a screenshot of a webpage",
          limit: 5,
          include_related: true,
        },
      },
    }),
  });

  const result = await response.json();
  console.log("Real Search Results:");
  console.log(JSON.stringify(result, null, 2));
} catch (e) {
  console.log("Gateway not running - using simulation above.");
  console.log("To test with real search, start the gateway with: deno task dev");
}

Gateway not running - using simulation above.
To test with real search, start the gateway with: deno task dev


## The Technical Stack

Casys MCP Gateway uses:

| Component  | Technology             | Purpose                          |
| ---------- | ---------------------- | -------------------------------- |
| Embeddings | BGE-M3 (Xenova/bge-m3) | Convert text to 1024-dim vectors |
| Vector DB  | PGlite + pgvector      | Store and search embeddings      |
| Similarity | Cosine distance        | Find closest matches             |
| Caching    | In-memory LRU          | Avoid re-embedding queries       |

### Why BGE-M3?

- **Local**: Runs on your machine, no API calls
- **Multilingual**: Supports 100+ languages
- **Quality**: State-of-the-art retrieval performance
- **Free**: No token costs

---

## Quick Check

Before moving on:

1. **What is a vector embedding?**
   - An array of numbers that captures the semantic meaning of text

2. **How does on-demand loading save context?**
   - Instead of loading all 120 tools, we search and load only 3-5 relevant ones

3. **What's the typical context reduction?**
   - From 30-50% down to <5% (often <1%)

---

**Next:** [03-dag-execution.ipynb](./03-dag-execution.ipynb) - Parallelize workflows for 5x speedup