# Data Extraction Agent - Interactive Notebook

This notebook demonstrates the Data Extraction Agent built on LangChain's Deep Agents framework.

## Features
- **Multi-Provider LLM Support**: Groq, Together AI, OpenRouter, HuggingFace, Google, Anthropic
- **Open-Source First**: Prioritizes free/open-source models
- **Data Sources**: Web, APIs, Databases (SQL/MongoDB), Files (CSV, Excel, JSON, PDF)
- **Built on Deep Agents**: Planning, file system, sub-agent delegation

## 1. Setup & Configuration

In [None]:
# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Check available providers
from data_extraction_agent.providers import list_available_providers

available = list_available_providers()
print("Available LLM Providers:")
for provider, models in available.items():
    print(f"  - {provider}: {len(models)} models")

if not available:
    print("\n⚠️ No providers configured!")
    print("Set at least one API key in your .env file:")
    print("  - GROQ_API_KEY (free at console.groq.com)")
    print("  - TOGETHER_API_KEY")
    print("  - OPENROUTER_API_KEY")
    print("  - GOOGLE_API_KEY")
    print("  - ANTHROPIC_API_KEY")

## 2. Create the Agent

The agent automatically selects the best available open-source model.

In [None]:
from data_extraction_agent import create_data_extraction_agent
from data_extraction_agent.providers import RoutingStrategy

# Create agent with open-source first strategy (default)
agent = create_data_extraction_agent(
    strategy=RoutingStrategy.OPEN_SOURCE_ONLY
)

print("✅ Agent created successfully!")

## 3. Example: Extract Data from a Public API

Let's extract product data from the FakeStore API.

In [None]:
# Define extraction request
request = """
Extract all product information from this API:
https://fakestoreapi.com/products

I need:
- Product ID
- Title
- Price
- Category
- Rating score

Return as clean JSON.
"""

print(f"Request: {request}")

In [None]:
# Run the extraction
result = await agent.ainvoke({
    "messages": [{"role": "user", "content": request}]
})

# Display results
for msg in result.get("messages", []):
    if hasattr(msg, "content"):
        print(f"\n[{type(msg).__name__}]")
        print(msg.content[:2000])  # First 2000 chars

## 4. Example: Web Search and Data Extraction

Search the web and extract structured data.

In [None]:
# Web search extraction
web_request = """
Search for the top 5 Python data extraction libraries.

For each library, extract:
- Name
- GitHub URL
- Main features
- Use cases

Return as a structured JSON report.
"""

result = await agent.ainvoke({
    "messages": [{"role": "user", "content": web_request}]
})

for msg in result.get("messages", []):
    if hasattr(msg, "content"):
        print(msg.content[:3000])

## 5. Using Individual Tools

You can also use the extraction tools directly.

In [None]:
from data_extraction_agent.tools import (
    call_api,
    web_search,
    analyze_schema,
    transform_data
)

# Direct API call
api_result = call_api.invoke({
    "url": "https://jsonplaceholder.typicode.com/users",
    "method": "GET"
})

print("API Result:")
print(api_result[:1500])

In [None]:
# Web search
search_result = web_search.invoke({
    "query": "best practices data extraction python 2024"
})

print("Search Results:")
print(search_result[:2000])

## 6. Provider Selection

Create agents with specific providers.

In [None]:
from data_extraction_agent.agent import (
    create_groq_agent,
    create_together_agent,
    create_lightweight_agent,
    create_quality_agent
)

# Groq agent (free, ultra-fast)
try:
    groq_agent = create_groq_agent()
    print("✅ Groq agent created")
except ValueError as e:
    print(f"❌ Groq not available: {e}")

# Together agent (budget)
try:
    together_agent = create_together_agent()
    print("✅ Together agent created")
except ValueError as e:
    print(f"❌ Together not available: {e}")

# Lightweight agent (fastest available)
try:
    light_agent = create_lightweight_agent()
    print("✅ Lightweight agent created")
except ValueError as e:
    print(f"❌ No providers available: {e}")

## 7. Fallback Chain Demo

The fallback chain automatically retries with alternative providers if one fails.

In [None]:
from data_extraction_agent.providers import FallbackChain

# Create a budget fallback chain
chain = FallbackChain.budget()

# Test it
result = chain.invoke("Extract the email from: Contact us at support@example.com for help.")

print(f"Success: {result.success}")
print(f"Provider: {result.provider}")
print(f"Model: {result.model_id}")
print(f"Attempts: {result.attempts}")
print(f"Time: {result.total_time_ms:.0f}ms")
print(f"\nResponse:\n{result.content}")

## 8. Stream Agent Output

Stream the agent's responses in real-time.

In [None]:
# Streaming extraction
stream_request = """
Extract contact information from this text:

"Our team includes John Smith (john@acme.com, 555-1234) 
and Jane Doe (jane@acme.com, 555-5678). 
Visit us at 123 Main St, NYC."

Return as JSON with fields: name, email, phone, address.
"""

print("Streaming extraction...\n")

async for event in agent.astream({"messages": [{"role": "user", "content": stream_request}]}):
    if "messages" in event:
        for msg in event["messages"]:
            if hasattr(msg, "content") and msg.content:
                print(f"[{type(msg).__name__}] {msg.content[:500]}...")
    elif "tool_calls" in event:
        for tc in event["tool_calls"]:
            print(f"[Tool] {tc.get('name', 'unknown')}")

## 9. Next Steps

- **Deploy**: Run `langgraph dev` to start the LangGraph server
- **UI**: Connect to [Deep Agents UI](https://github.com/langchain-ai/deep-agents-ui)
- **Customize**: Add your own tools in `tools.py`
- **Monitor**: Enable LangSmith tracing for debugging