# **AI Research Assistant**

AI assitant that discovers, filters, and analyzes web content using Crawö4AI's URL Seeder to:

* Discover all available URLs without crawling them first.
* Score and rank them by relevance using AI
* Crawl only the most relevant content
* Generate research insights with proper citations.

**About the research assistant** :

A smart research assistant that:

1. Takes any research query (eg. Knowledge graphs)
2. Discovers relevant articles from news sites
3. Ranks them by relevance using BM25 scoring
4. Crawls only the top-ranked articles
5. Synthesizes findings into a comprehensive report
   

## Pipeline Overview

User Query -> Query Enhancement -> URL Discovery -> Relevance Scoring -> Smart Crawling -> AI Synthesis. -> Research Report

In [None]:
import asyncio
import json
import os
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass, asdict
from datetime import datetime
from pathlib import Path

#Rich for beutiful console output
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from rich.progress import Progress, SpinnerColumn, TextColumn

# Crawl4AI imports for intelligent crawling
from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CrawlerRunConfig,
    AsyncUrlSeeder,
    SeedingConfig,
    AsyncLogger,
    PruningContentFilter,
    DefaultMarkdownGenerator
)


# LiteLLM for AI capabilities
import litellm

# Initialize Rich console for pretty outputs
console=Console()

print("Environment ready :) All dependencies loaded successfully.")

Environment ready :) All dependencies loaded successfully.


## Step 1: Configuration and Data Classes

Here we define the research pipeline configuration. These dataclasses act as out control center, allowing us to fine-tune every aspect of the research process. Think of them as the settings panel for the research assistant, from discovery limits to AI model choices.


In [14]:
@dataclass
class ResearchConfig:
        """
        Configuration for the research pipeline
        
        This class controls every aspect of our research assistant:
        - How many URLs to discover and crawl
        - Which scoring methods to use
        - Whether to use AI enhancement
        - Output preferences
        """
        
        # Core Settings
        domain: str= "www.bbc.com/sport"
        max_urls_discovery: int =100    # Cast a wide net initially
        max_urls_to_crawl: int=10       # But only crawl the best
        top_k_urls: int=10              # Focus on top results
        
        # Scoring and filtering
        score_threshold: float=0.3      # Minimum relevance score
        scoring_method: str="bm25"      # BM25 is great for relevance
        
        # AI and processing
        use_llm_enhancement: bool=True  # Enhance queries with AI
        llm_model: str="openai/gpt-4o-mini" # Fast and capable
        
        # URL discovery options
        extract_head_metada: bool = False   # Get titles, descriptions
        live_check: bool= True              # Verify URLs are accessible
        force_refresh: bool= True           # Bypass cache
        
        # Crawler settings
        max_concurrent_crawls: int=5        # Parallel crawling
        timeout: int = 30000                # 30 second timeout
        headless: bool = True               # No browser window
        
        # Output settings
        output_dir: Path = Path("research_results")
        verbose: bool=True

@dataclass
class ResearchQuery:
    """Container for research query and metadata """
    original_query: str
    enhanced_query: Optional[str] = None
    search_patterns: List[str] = None
    timestamp: str = None
    
@dataclass
class ResearchResult:
    """Container for research results"""
    query: ResearchQuery
    discovered_urls: List[Dict]
    crawled_content: List[Dict]
    synthesis: str
    citations: List[Dict]
    metadata: Dict

# Create default configuration
config= ResearchConfig()
console.print(Panel(
    f"[bold cyan]Research Configuration[/bold cyan]\n"
    f" Domain: {config.domain}\n"
    f" Max Discovery: {config.max_urls_discovery}\n"
    f" Max Crawl: {config.max_urls_to_crawl}\n"
    f" AI Model: {config.llm_model}",
    title="Settings"
))

    

## Step 2: Query Enhancement with AI

Not all search queries are created equal. Here we use AI to transform simple queries into comprehensive search strategies. The LLM analyzes your query, extracts key concepts, and generates related terms - turning "football news" into a rich set of search patters.

In [16]:
import os
from dotenv import load_dotenv

load_dotenv()

os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")

In [21]:
async def enhance_query_with_llm(query: str,config:ResearchConfig) -> ResearchQuery:
    """
    Transform simple queries into comprehensive search strategies
    Why enhance queries?
    - User often use simple terms ("football news")
    - But relevant content might use varied terminology
    - AI helps capture all relevant variations    
    """
    
    console.print(f"\n[cyan] Enhancing query: '{query}...[/cyan]")
    try:
        # Ask AI to analyze and expand the query
        response = await litellm.acompletion(
            model=config.llm_model,
            messages=[{
                "role":"user",
                "content":f"""Given this research query: "{query}"
                Extract:
                1. Key terms and concepts (as a list)
                2. Related search terms
                3. A more specific/enhanced version of the query
                
                Return as JSON:
                {{
                    "key_terms":["term1","term2"],
                    "related_terms": ["related1","related2"],
                    "enhanced_query": "enhanced version of query"
                }}
                
               """
            }],
            temperature=0.3, #Low termperature for consistency
            response_format={"type":"json_object"}
        )
        
        data=json.loads(response.choices[0].message.content)
        
        # Create search patterns from extracted terms
        # These patterns help the URL seeder find relevant pages
        
        all_terms= data["key_terms"] + data ["related_terms"]
        #patterns = [f"*{term.lower()}*" for term in all_terms]
        
        result = ResearchQuery(
            original_query=query,
            enhanced_query=data["enhanced_query"],
            search_patterns= "", #patterns[:10], #Limit to 10 patterns
            timestamp=datetime.now().isoformat()
        )
        
        # Show the enhancement
        console.print(Panel(
            f"[green] Enhanced Query:[/green] {result.enhanced_query}\n"
            f"[dim] Key terms: {', '.join(data['key_terms'])}[/dim]",
            title = "Query Enhancement"
        ))
    
        return result

    except Exception as e:
        console.print(f"[yellow] Enhancement failed, using original query: {e}[/yellow]")
        #Fallback to simple tokenization
        words= query.lower().split()
        patterns =[f"*{word}*" for word in words if len(word)>2]
        
        return ResearchQuery(
            original_query=query,
            enhanced_query=query,
            search_patterns=patterns,
            timestamp=datetime.now().isoformat()
        )
        
# Example usage
test_query= "knowledge graphs news"
enhanced = await enhance_query_with_llm(test_query,config)