# 🤖 TavilyCrawl Tutorial: Intelligent Documentation Discovery

> **📚 Part of the LangChain Course: Building AI Agents & RAG Apps**  
> [🎓 Get the full course](https://www.udemy.com/course/langchain/?referralCode=D981B8213164A3EA91AC)

This notebook demonstrates the power of **TavilyCrawl** - Tavily's most intelligent web crawling tool that:
- 🕷️ **Crawls intelligently** using graph-based traversal with parallel exploration
- 🧠 **Understands instructions** and filters content based on your specific needs
- 🎯 **Finds exactly what you need** without manual filtering or post-processing
- ⚡ **Processes hundreds of pages** efficiently with built-in content extraction

**Perfect for targeted documentation discovery, research, and intelligent content collection!** 🚀

---

## 📦 Setup & Installation

First, let's install the required packages and set up our environment.


In [None]:
# Install required packages
%pip install langchain-tavily certifi

# For pretty printing and visualization
%pip install rich pandas


In [None]:
import os
import ssl
from typing import Any, Dict, List

import certifi
from langchain_tavily import TavilyCrawl
from rich.console import Console
from rich.panel import Panel
from rich.table import Table

# Configure SSL context
ssl_context = ssl.create_default_context(cafile=certifi.where())
os.environ["SSL_CERT_FILE"] = certifi.where()
os.environ["REQUESTS_CA_BUNDLE"] = certifi.where()

# Initialize rich console for pretty printing
console = Console()

print("✅ All imports successful!")


## 🔑 API Key Setup

You'll need a Tavily API key to use TavilyCrawl. Get yours at [tavily.com](https://app.tavily.com/home).

Set environment variable `TAVILY_API_KEY`


In [None]:
# Set directly (uncomment and add your key)
# tavily_api_key = "your_tavily_api_key_here"

os.environ["TAVILY_API_KEY"] = "your_tavily_api_key_here"


## 🤖 What is TavilyCrawl?

TavilyCrawl is Tavily's most advanced crawling tool that goes beyond simple URL discovery. Here's what makes it special:

### 🧠 **Intelligent Understanding**
- Takes natural language instructions
- Understands context and intent
- Filters content based on your specific needs

### 🕷️ **Advanced Crawling**
- Graph-based website traversal
- Parallel exploration of hundreds of paths
- Built-in content extraction and cleaning

### 🎯 **Smart Filtering**
- Only returns relevant content
- Eliminates manual post-processing
- Saves time and API calls

Let's see it in action!


In [None]:
# Initialize TavilyCrawl
tavily_crawl = TavilyCrawl()

print("✅ TavilyCrawl initialized successfully!")


## 🎯 Demo 1: TavilyCrawl Without Instructions

Let's first see what happens when we use TavilyCrawl without any specific instructions. This will show us the baseline behavior.


In [None]:
# Basic TavilyCrawl without instructions
target_url = "https://python.langchain.com/"

console.print(Panel.fit(
    f"🎯 **Target**: {target_url}\n📋 **Instructions**: None (baseline crawl)",
    title="Basic Crawl Configuration",
    border_style="yellow"
))

console.print("🔄 Running TavilyCrawl without instructions...", style="bold yellow")

# Basic crawl without instructions
basic_result = tavily_crawl.invoke({
    "url": target_url,
    "max_depth": 3,
    "extract_depth": "advanced"
})

basic_results = basic_result.get("results", [])
console.print(f"✅ Basic crawl completed! Found {len(basic_results)} pages", style="bold green")

# Show what we got without instructions
console.print(f"\n📊 **Without Instructions**: TavilyCrawl returned {len(basic_results)} pages", style="bold yellow")
console.print("   📄 Mix of all content types (chains, prompts, agents, guides, etc.)")
console.print("   🔍 No filtering - everything from the crawled sections")
console.print("   ⚠️  Requires manual work to find what you actually need")


In [None]:
# Let's look at what the basic crawl found
console.print("\n📋 **Sample Results from Basic Crawl (no filtering):**\n", style="bold yellow")

for i, result in enumerate(basic_results[:3], 1):  # Show first 3 results
    title = result.get("title", "Untitled")
    url = result.get("url", "No URL")
    content = result.get("raw_content", "No content")[:150] + "..."
    
    panel_content = f"""
🔗 **URL**: {url}

📖 **Content Preview**:
{content}
    """.strip()
    
    console.print(Panel(
        panel_content,
        title=f"📄 {i}. {title}",
        border_style="yellow"
    ))
    print()

console.print(f"... and {len(basic_results) - 3} more mixed results", style="italic yellow")
console.print("🔍 **Notice**: These are mixed content types - not specifically about agents!", style="bold yellow")


## 🎯 Demo 2: TavilyCrawl WITH Instructions - The Magic!

Now let's see the real power of TavilyCrawl! By adding simple instructions, we can get exactly what we need without any manual filtering.

### The Problem We Just Saw:
- Demo 1 returned everything (all content types mixed together)
- We'd need to manually filter through all results
- No way to target specific content

### The TavilyCrawl Solution:
- Add simple natural language instructions
- Let AI do the intelligent filtering
- Get exactly what we need automatically!


In [None]:
# Now let's add intelligent instructions to the same target
instructions = "Find and extract content from pages specifically related to agents"

console.print(Panel.fit(
    f"🎯 **Target**: {target_url} (same as Demo 1)\n📋 **Instructions**: {instructions}",
    title="Intelligent Crawl Configuration", 
    border_style="green"
))


In [None]:
# Execute the intelligent crawl - this demonstrates TavilyCrawl's power!
console.print("🚀 Starting intelligent crawl with instructions...", style="bold blue")
console.print("Watch how instructions transform the results!", style="italic")

# Use TavilyCrawl with instructions
crawl_result = tavily_crawl.invoke({
    "url": target_url,
    "instructions": instructions,
    "max_depth": 3,
    "extract_depth": "advanced"
})

console.print("✅ Intelligent crawl completed successfully!", style="bold green")

# Show the power of intelligent filtering
results = crawl_result.get("results", [])

# Compare with previous demo
console.print(f"\n🎯 **The Magic of Instructions:**", style="bold blue")
console.print(f"   📊 Demo 1 (no instructions): {len(basic_results)} mixed pages")
console.print(f"   🎯 Demo 2 (with instructions): {len(results)} targeted agent pages")
console.print(f"   ⚡ **Intelligent filtering eliminated {len(basic_results) - len(results)} irrelevant pages!**", style="bold green")


In [None]:
# Display the agent documentation TavilyCrawl found
console.print("\n🎯 **LangChain Agent Documentation Found by TavilyCrawl:**\n", style="bold green")

for i, result in enumerate(results, 1):
    title = result.get("title", "Untitled")
    url = result.get("url", "No URL")
    content = result.get("raw_content", "No content")[:200] + "..."
    
    panel_content = f"""
🔗 **URL**: {url}

📖 **Content Preview**:
{content}
    """.strip()
    
    console.print(Panel(
        panel_content,
        title=f"📑 {i}. {title}",
        border_style="green"
    ))
    print()

# Show the comparison table
comparison_table = Table(title="⚡ Crawling Approaches Comparison")
comparison_table.add_column("Approach", style="cyan", no_wrap=True)
comparison_table.add_column("Time Required", style="yellow")
comparison_table.add_column("Expertise Needed", style="red")
comparison_table.add_column("Flexibility", style="blue")
comparison_table.add_column("Precision", style="green")

comparison_table.add_row(
    "Manual Crawling",
    "3-5 hours",
    "High (coding, filtering)",
    "Low",
    "Variable"
)

comparison_table.add_row(
    "TavilyMap + Extract",
    "30-60 minutes",
    "High (caveats, complexity)",
    "High",
    "Good (with expertise)"
)

comparison_table.add_row(
    "🤖 TavilyCrawl",
    "2-5 minutes",
    "Minimal (just instructions)",
    "High",
    "High (AI-powered)"
)

console.print(comparison_table)

console.print("\n🎉 **Why TavilyCrawl is the Best of Both Worlds:**", style="bold magenta")
console.print("   🚀 **Speed**: 60-100x faster than manual crawling")
console.print("   🧠 **Simplicity**: No expertise needed - just natural language instructions")
console.print("   🎯 **Flexibility**: High flexibility without the complexity of Map+Extract")
console.print("   ⚡ **Precision**: AI-powered filtering with no caveats or manual work")
console.print("   🔧 **Ready-to-Use**: Immediate integration into your applications")

console.print("\n💡 **TavilyCrawl = TavilyMap + TavilyExtract + AI Intelligence - Complexity!**", style="bold blue")


## 🎉 Conclusion: TavilyCrawl - The Best of Both Worlds

This tutorial demonstrated how **TavilyCrawl** combines the flexibility of TavilyMap + TavilyExtract with the simplicity of natural language instructions:

### 🧠 **The Evolution of Web Crawling**:

1. **Manual Crawling**: Slow, requires coding expertise, variable results
2. **TavilyMap + TavilyExtract**: Powerful but complex, lots of caveats, needs expertise
3. **🤖 TavilyCrawl**: **Best of both worlds** - flexible, intelligent, and simple!

### 🚀 **Why TavilyCrawl Wins**:

- **🎯 Combines Power & Simplicity**: All the flexibility without the complexity
- **🧠 Natural Language Instructions**: No need to learn APIs or handle caveats
- **⚡ Instant Intelligence**: AI does the heavy lifting automatically
- **🔧 Zero Expertise Required**: Just describe what you want in plain English
- **📈 Production Ready**: Immediate integration with consistent results

### 🎯 **Perfect For**:

- Building intelligent RAG systems without crawling expertise
- Rapid prototyping of documentation discovery systems
- Production applications that need reliable, filtered content
- Anyone who wants TavilyMap + TavilyExtract power without the complexity

---

**TavilyCrawl = TavilyMap + TavilyExtract + AI Intelligence - All the Complexity!** 🤖✨
