# AI-Powered Data Enrichment Agent

## Overview

This notebook demonstrates building an intelligent data enrichment agent using:

- **LangGraph**: Agentic workflow orchestration
- **Bright Data MCP**: Model Context Protocol integration for web data access
- **Claude Sonnet 4**: Advanced reasoning and structured extraction

### What This Agent Does

1. Takes a research topic and JSON schema as input
2. Autonomously searches the web using Bright Data's SERP API
3. Scrapes relevant websites with anti-bot bypass
4. Extracts and structures data matching your schema
5. Returns validated JSON output

---

## Setup

### Install Dependencies

In [1]:
!pip install -q langgraph langchain-openai langchain-mcp-adapters python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Configure Environment Variables

You'll need:
- `BRIGHT_DATA_API_KEY`: Get from [Bright Data Dashboard](https://brightdata.com/cp/api_tokens)
- `ANTHROPIC_API_KEY`: Get from [Anthropic Console](https://console.anthropic.com/)

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

# Verify keys are set
assert os.getenv("BRIGHT_DATA_API_KEY"), "BRIGHT_DATA_API_KEY not set"
assert os.getenv("ANTHROPIC_API_KEY"), "ANTHROPIC_API_KEY not set"

print("✓ Environment configured")

✓ Environment configured


### Suppress Verbose Warnings

In [3]:
import logging
import warnings

logging.getLogger().addFilter(
    lambda record: "Failed to validate notification" not in record.getMessage()
)
warnings.filterwarnings("ignore", message=".*Failed to validate notification.*")

print("✓ Logging configured")

✓ Logging configured


---

## Agent Implementation

### 1. Define Agent State

The state tracks:
- Research topic
- Target extraction schema
- Conversation messages
- Extracted information

In [4]:
import json
import asyncio
from dataclasses import dataclass, field
from typing import Any, Annotated, List, Optional

from langchain_core.messages import AIMessage, HumanMessage, BaseMessage
from langgraph.graph.message import add_messages

@dataclass
class AgentState:
    """State for the enrichment agent."""
    topic: str
    extraction_schema: dict[str, Any]
    messages: Annotated[List[BaseMessage], add_messages] = field(default_factory=list)
    info: Optional[dict[str, Any]] = None

print("✓ Agent state defined")

✓ Agent state defined


### 2. System Prompt

Instructs the agent on its research capabilities and workflow

In [5]:
SYSTEM_PROMPT = """You are a research agent. Your task is to gather information about a topic and extract structured data.

You have access to these tools:
- search_engine: Search the web for information (Google/Bing/Yandex)
- scrape_as_markdown: Get content from a specific URL with bot detection bypass
- web_data_* tools: Fast, reliable structured data extraction from major platforms
- submit_info: Call this when you have gathered all the required information

Research topic: {topic}

Required information schema:
{schema}

Search for relevant information, scrape important pages, then call submit_info with the extracted data."""

print("✓ System prompt configured")

✓ System prompt configured


### 3. Create the Agent Graph

This builds the LangGraph workflow with:
- MCP client connection to Bright Data
- Claude LLM integration
- Tool execution node
- Conditional routing logic

In [6]:
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph
from langgraph.prebuilt import ToolNode
from langchain_mcp_adapters.client import MultiServerMCPClient

async def create_agent():
    """Create the enrichment agent graph."""

    # Configure MCP client
    client = MultiServerMCPClient({
        "bright_data": {
            "url": f"https://mcp.brightdata.com/sse?token={os.getenv('BRIGHT_DATA_API_KEY')}",
            "transport": "sse",
        }
    })

    # Get available tools from MCP
    tools = await client.get_tools()
    print(f"✓ Connected to Bright Data MCP ({len(tools)} tools available)")

    # Initialize the model
    llm = ChatOpenAI(
        openai_api_key=os.getenv("ANTHROPIC_API_KEY"),
        openai_api_base="https://api.anthropic.com/v1",
        model_name="claude-sonnet-4-20250514"
    )

    async def call_model(state: AgentState) -> dict:
        """Call the LLM to decide next action."""
        prompt = SYSTEM_PROMPT.format(
            topic=state.topic,
            schema=json.dumps(state.extraction_schema, indent=2)
        )

        # Build messages: system prompt as first human message, then conversation
        messages = [HumanMessage(content=prompt)] + list(state.messages)

        # Create dynamic submit_info tool based on schema
        info_tool = {
            "name": "submit_info",
            "description": "Submit the extracted information when done researching. Call this with the structured data matching the required schema.",
            "parameters": state.extraction_schema,
        }

        # Bind all tools including the dynamic info tool
        model = llm.bind_tools(list(tools) + [info_tool])

        response = await model.ainvoke(messages)

        # Check if submitting info
        info = None
        if hasattr(response, 'tool_calls') and response.tool_calls:
            for tc in response.tool_calls:
                if tc["name"] == "submit_info":
                    info = tc["args"]
                    break

        return {"messages": [response], "info": info}

    def route(state: AgentState) -> str:
        """Route to next node based on last message."""
        # If we have extracted info, we're done
        if state.info:
            return "__end__"

        # Check the last message
        if not state.messages:
            return "agent"

        last_msg = state.messages[-1]

        if isinstance(last_msg, AIMessage) and hasattr(last_msg, 'tool_calls') and last_msg.tool_calls:
            # Check if it's a submit_info call
            for tc in last_msg.tool_calls:
                if tc["name"] == "submit_info":
                    return "__end__"
            # Otherwise, execute the tools
            return "tools"

        return "agent"

    # Build graph
    graph = StateGraph(AgentState)
    graph.add_node("agent", call_model)
    graph.add_node("tools", ToolNode(tools))
    graph.add_edge("__start__", "agent")
    graph.add_conditional_edges("agent", route)
    graph.add_edge("tools", "agent")

    return graph.compile()

print("✓ Agent graph builder defined")

✓ Agent graph builder defined


### 4. Enrichment Function

Simple API to run the agent

In [7]:
async def enrich(topic: str, schema: dict) -> dict:
    """Run the enrichment agent."""
    agent = await create_agent()
    result = await agent.ainvoke({
        "topic": topic,
        "extraction_schema": schema,
    })
    return result.get("info", {})

print("✓ Enrichment function ready")

✓ Enrichment function ready


---

## Demo: Extract Company Information

### Define Extraction Schema

Specify exactly what information you want to extract

In [8]:
company_schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string"},
        "industry": {"type": "string"},
        "headquarters": {"type": "string"},
        "founded": {"type": "string"},
        "key_products": {"type": "array", "items": {"type": "string"}},
    },
    "required": ["company_name", "industry"]
}

print("Schema defined:")
print(json.dumps(company_schema, indent=2))

Schema defined:
{
  "type": "object",
  "properties": {
    "company_name": {
      "type": "string"
    },
    "industry": {
      "type": "string"
    },
    "headquarters": {
      "type": "string"
    },
    "founded": {
      "type": "string"
    },
    "key_products": {
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  },
  "required": [
    "company_name",
    "industry"
  ]
}


### Run the Agent

Watch the agent autonomously research and extract information

In [9]:
result = await enrich("Stripe payments company", company_schema)

print("\n" + "="*60)
print("EXTRACTED INFORMATION")
print("="*60)
print(json.dumps(result, indent=2))

  + Exception Group Traceback (most recent call last):
  |   File "/home/meirk/BrightAI/Demos/workshop/.venv/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3699, in run_code
  |     await eval(code_obj, self.user_global_ns, self.user_ns)
  |   File "/tmp/ipykernel_2712937/319134285.py", line 1, in <module>
  |     result = await enrich("Stripe payments company", company_schema)
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/tmp/ipykernel_2712937/1868289780.py", line 3, in enrich
  |     agent = await create_agent()
  |             ^^^^^^^^^^^^^^^^^^^^
  |   File "/tmp/ipykernel_2712937/3043821972.py", line 18, in create_agent
  |     tools = await client.get_tools()
  |             ^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/meirk/BrightAI/Demos/workshop/.venv/lib/python3.11/site-packages/langchain_mcp_adapters/client.py", line 197, in get_tools
  |     tools_list = await asyncio.gather(*load_mcp_tool_tasks)
  |               

---

## Additional Examples

### Example 2: Competitor Analysis

In [None]:
competitor_schema = {
    "type": "object",
    "properties": {
        "competitors": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "market_position": {"type": "string"},
                    "key_differentiator": {"type": "string"}
                }
            }
        }
    },
    "required": ["competitors"]
}

result = await enrich("Stripe competitors in payment processing", competitor_schema)

print("\nCompetitor Analysis:")
print(json.dumps(result, indent=2))

### Example 3: Product Feature Extraction

In [None]:
product_schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string"},
        "category": {"type": "string"},
        "key_features": {"type": "array", "items": {"type": "string"}},
        "pricing_model": {"type": "string"},
        "target_audience": {"type": "string"}
    },
    "required": ["product_name", "category"]
}

result = await enrich("Claude AI by Anthropic", product_schema)

print("\nProduct Information:")
print(json.dumps(result, indent=2))

---

## Key Advantages

### 1. **Autonomous Research**
The agent independently decides what to search and which pages to scrape

### 2. **Schema-Driven Extraction**
Define your data structure once, get consistent JSON output

### 3. **Enterprise-Grade Infrastructure**
- Bright Data's global proxy network
- Anti-bot detection bypass
- Geo-targeting capabilities
- 99.99% uptime SLA

### 4. **Production Ready**
- Asynchronous execution
- Error handling built-in
- Scalable architecture
- MCP standardization

---

## Architecture Overview

```
┌─────────────────────────────────────────────────────┐
│                   LangGraph Agent                   │
│  ┌──────────────┐         ┌──────────────────────┐ │
│  │ Claude LLM   │◄────────┤  Agent Reasoning     │ │
│  │ (Sonnet 4)   │         │  - Plan research     │ │
│  └──────┬───────┘         │  - Choose tools      │ │
│         │                 │  - Extract data      │ │
│         ▼                 └──────────────────────┘ │
│  ┌──────────────┐                                  │
│  │  Tool Node   │                                  │
│  └──────┬───────┘                                  │
└─────────┼───────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────────┐
│         Bright Data MCP (Remote SSE)                │
│  ┌──────────────┐  ┌─────────────┐  ┌───────────┐  │
│  │ search_engine│  │ scrape_as_  │  │ web_data_ │  │
│  │              │  │ markdown    │  │ * tools   │  │
│  └──────────────┘  └─────────────┘  └───────────┘  │
└─────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────────┐
│          Bright Data Infrastructure                 │
│  • 72M+ residential IPs                             │
│  • Global geo-targeting                             │
│  • Automatic retries & rotation                     │
│  • CAPTCHA solving                                  │
└─────────────────────────────────────────────────────┘
```

---

## Next Steps

- **Scale**: Process batches of topics concurrently
- **Customize**: Add domain-specific extraction logic
- **Integrate**: Connect to your data pipeline
- **Extend**: Add more MCP tools for LinkedIn, GitHub, etc.

---

## Resources

- [Bright Data MCP Documentation](https://docs.brightdata.com/mcp)
- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
- [Claude API Documentation](https://docs.anthropic.com/)

---