# GPU Programming Ladder URL Validation System

This notebook orchestrates an AI agent system to validate and update URLs in the GPU Programming Ladder data.

## Features
- **Multi-agent system**: Task creator agent + multiple consumer agents
- **URL validation**: Checks if URLs exist and are accessible
- **Content analysis**: Ensures exercise/video URLs point to specific content, not listings
- **AI-powered replacements**: Uses local LLM (GPT-4o via LM Studio) to find replacements
- **Parallel processing**: Configurable concurrent requests with rate limiting
- **Thread safety**: No race conditions in concurrent operations

## Requirements
- Python 3.8+
- LM Studio running locally with GPT-4o model
- Required Python packages (installed in virtual environment)

In [1]:
# Import required modules
import sys
import os
import asyncio
import json
from datetime import datetime
from pathlib import Path

# Add current directory to path for imports
sys.path.append('.')

# Import our custom modules
from url_validation_orchestrator import URLValidationOrchestrator
from url_extractor import extract_urls_from_data_js
from task_creator_agent import TaskCreatorAgent

## Configuration

Configure the validation system parameters below:

In [2]:
# Load configuration from config.json
def load_config():
    """Load configuration from config.json file."""
    try:
        with open("config.json", "r") as f:
            config = json.load(f)
        
        # Convert config structure to match notebook expectations
        return {
            "lm_studio_url": config["lm_studio"]["url"],
            "num_consumer_agents": config["agents"]["num_consumer_agents"],
            "max_concurrent_requests_per_agent": config["agents"]["max_concurrent_requests_per_agent"],
            "data_js_file": config["files"]["data_js"],
            "urls_file": config["files"]["urls_to_validate"],
            "tasks_file": config["files"]["validation_tasks"],
            "results_file": config["files"]["validation_results"],
            "update_data_js": config["processing"]["update_data_js"],
            "force_revalidation": config["processing"]["force_revalidation"]
        }
    except FileNotFoundError:
        print("‚ùå config.json not found. Using default configuration.")
        return {
            "lm_studio_url": "http://localhost:1234",
            "num_consumer_agents": 2,
            "max_concurrent_requests_per_agent": 2,
            "data_js_file": "../data.js",
            "urls_file": "urls_to_validate.json",
            "tasks_file": "validation_tasks.json",
            "results_file": "validation_results.json",
            "update_data_js": True,
            "force_revalidation": False
        }
    except Exception as e:
        print(f"‚ùå Error loading config.json: {e}. Using default configuration.")
        return {
            "lm_studio_url": "http://localhost:1234",
            "num_consumer_agents": 2,
            "max_concurrent_requests_per_agent": 2,
            "data_js_file": "../data.js",
            "urls_file": "urls_to_validate.json",
            "tasks_file": "validation_tasks.json",
            "results_file": "validation_results.json",
            "update_data_js": True,
            "force_revalidation": False
        }

# Load configuration
CONFIG = load_config()

print("Configuration loaded:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")


Configuration loaded:
  lm_studio_url: http://localhost:1234
  num_consumer_agents: 2
  max_concurrent_requests_per_agent: 2
  data_js_file: ../data.js
  urls_file: urls_to_validate.json
  tasks_file: validation_tasks.json
  results_file: validation_results.json
  update_data_js: True
  force_revalidation: False


## Step 1: Extract URLs from data.js

First, let's extract all URLs from the data.js file that need validation.

In [3]:
# Extract URLs from data.js
print("üîç Extracting URLs from data.js...")

try:
    urls = extract_urls_from_data_js(CONFIG['data_js_file'])
    print(f"‚úÖ Extracted {len(urls)} URLs")
    
    # Show URL type distribution
    url_types = {}
    for url in urls:
        url_type = url['url_type']
        url_types[url_type] = url_types.get(url_type, 0) + 1
    
    print("\nüìä URL distribution by type:")
    for url_type, count in url_types.items():
        print(f"   {url_type}: {count}")
    
    # Show sample URLs
    print("\nüìã Sample URLs to validate:")
    for i, url in enumerate(urls[:5]):
        print(f"   {i+1}. [{url['url_type']}] {url['topic_title'][:50]}...")
        print(f"      URL: {url['url']}")
    
except Exception as e:
    print(f"‚ùå Error extracting URLs: {e}")
    urls = []

üîç Extracting URLs from data.js...
‚úÖ Extracted 260 URLs

üìä URL distribution by type:
   article: 62
   paper: 26
   video: 61
   exercise: 65
   python: 37
   cpp: 9

üìã Sample URLs to validate:
   1. [article] CPU vs GPU: Why GPUs for ML and HPC...
      URL: https://developer.nvidia.com/blog/even-easier-introduction-cuda/
   2. [paper] CPU vs GPU: Why GPUs for ML and HPC...
      URL: https://dl.acm.org/doi/10.1145/1365490.1365500
   3. [video] CPU vs GPU: Why GPUs for ML and HPC...
      URL: https://www.youtube.com/watch?v=-P28LKWTzrI
   4. [exercise] CPU vs GPU: Why GPUs for ML and HPC...
      URL: https://leetgpu.com/challenges
   5. [article] GPU Architecture: SMs, warps, cores...
      URL: https://cvw.cac.cornell.edu/gpu-architecture/gpu-characteristics/kernel_sm


## Step 2: Create Validation Tasks

Create validation tasks for URLs that haven't been validated yet.

In [4]:
# Create validation tasks
print("üìã Creating validation tasks...")

try:
    task_creator = TaskCreatorAgent(CONFIG['urls_file'])
    
    # Run task creation
    tasks = await task_creator.run()
    
    print(f"\n‚úÖ Created {len(tasks)} validation tasks")
    
    # Show task distribution
    task_types = {}
    for task in tasks:
        url_type = task['url_entry']['url_type']
        task_types[url_type] = task_types.get(url_type, 0) + 1
    
    print("\nüìä Tasks by URL type:")
    for url_type, count in task_types.items():
        print(f"   {url_type}: {count}")
    
except Exception as e:
    print(f"‚ùå Error creating tasks: {e}")
    tasks = []

üìã Creating validation tasks...
üîÑ Task Creator Agent starting...
üìã Loaded 260 URLs from urls_to_validate.json
‚úÖ Created 260 validation tasks
üíæ Saved tasks to validation_tasks.json

üìä Task Summary:
   Total tasks: 260
   Tasks by type:
     article: 62
     paper: 26
     video: 61
     exercise: 65
     python: 37
     cpp: 9

‚úÖ Created 260 validation tasks

üìä Tasks by URL type:
   article: 62
   paper: 26
   video: 61
   exercise: 65
   python: 37
   cpp: 9


## Step 3: Run URL Validation

Run the multi-agent URL validation system. This will:
- Check if URLs exist and are accessible
- Analyze content to ensure appropriate targeting
- Use AI to find replacements for broken/inappropriate URLs

In [5]:
# Run URL validation with orchestrator
print("üöÄ Starting URL validation pipeline...")
print(f"   Using {CONFIG['num_consumer_agents']} consumer agents")
print(f"   Max {CONFIG['max_concurrent_requests_per_agent']} concurrent requests per agent")
print(f"   LM Studio URL: {CONFIG['lm_studio_url']}")

try:
    orchestrator = URLValidationOrchestrator(
        num_consumer_agents=CONFIG['num_consumer_agents'],
        max_concurrent_requests_per_agent=CONFIG['max_concurrent_requests_per_agent'],
        lm_studio_url=CONFIG['lm_studio_url'],
        tasks_file=CONFIG['tasks_file'],
        results_file=CONFIG['results_file']
    )
    
    # Run validation pipeline
    start_time = datetime.now()
    summary = await orchestrator.run_validation_pipeline()
    end_time = datetime.now()
    
    duration = end_time - start_time
    print(f"\n‚è±Ô∏è  Validation completed in {duration.total_seconds():.1f} seconds")
    
    # Add timing information to summary for enhanced reporting
    if 'error' not in summary:
        summary['total_duration_seconds'] = duration.total_seconds()
    
except Exception as e:
    print(f"‚ùå Error during validation: {e}")
    summary = {'error': str(e)}

2025-12-23 15:33:12,452 - URLValidatorAgent_agent_1 - INFO - Agent agent_1 initialized with log file: url_validation_agent_agent_1_20251223_153312.log
2025-12-23 15:33:12,452 - URLValidatorAgent_agent_1 - INFO - Starting to process 130 tasks
2025-12-23 15:33:12,453 - URLValidatorAgent_agent_1 - INFO - [e177a8d8-b909-4d79-8ac5-9adf3c1d9b9d] Processing task 1/130: https://developer.nvidia.com/blog/even-easier-introduction-cuda/ (type: article)
2025-12-23 15:33:12,453 - URLValidatorAgent_agent_1 - INFO - Starting generic content analysis
2025-12-23 15:33:12,453 - URLValidatorAgent_agent_1 - INFO - Starting iteration 1/10
2025-12-23 15:33:12,454 - URLValidatorAgent_agent_2 - INFO - Agent agent_2 initialized with log file: url_validation_agent_agent_2_20251223_153312.log
2025-12-23 15:33:12,455 - URLValidatorAgent_agent_2 - INFO - Starting to process 130 tasks
2025-12-23 15:33:12,463 - URLValidatorAgent_agent_2 - INFO - [f4a1aa80-fc84-495c-b0fe-6bc3dfefd2d7] Processing task 1/130: https://a

üöÄ Starting URL validation pipeline...
   Using 2 consumer agents
   Max 2 concurrent requests per agent
   LM Studio URL: http://localhost:1234
üöÄ Starting URL Validation Pipeline
   Consumer agents: 2
   Max concurrent requests per agent: 2
   LM Studio URL: http://localhost:1234

üìã Step 1: Creating validation tasks...
üîÑ Task Creator Agent starting...
üìã Loaded 260 URLs from urls_to_validate.json
‚úÖ Created 260 validation tasks
üíæ Saved tasks to validation_tasks.json

üìä Task Summary:
   Total tasks: 260
   Tasks by type:
     article: 62
     paper: 26
     video: 61
     exercise: 65
     python: 37
     cpp: 9

ü§ñ Step 2: Starting consumer agents...
   üü¢ Agent agent_1 starting with 130 tasks
üîç Processing task: https://developer.nvidia.com/blog/even-easier-introduction-cuda/
   üü¢ Agent agent_2 starting with 130 tasks
üîç Processing task: https://arxiv.org/abs/2205.05198


2025-12-23 15:33:15,539 - URLValidatorAgent_agent_1 - INFO - Iteration 1 - Action: use_tools, Tool calls: 1
2025-12-23 15:33:15,540 - URLValidatorAgent_agent_1 - INFO - Executing 1 tool calls
2025-12-23 15:33:15,540 - URLValidatorAgent_agent_1 - INFO - Executing tool: firecrawl_scrape


üîß Executing tool: firecrawl_scrape - Using firecrawl_scrape tool as requested by LLM


2025-12-23 15:33:16,340 - URLValidatorAgent_agent_1 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:33:16,341 - URLValidatorAgent_agent_1 - INFO - Tool firecrawl_scrape completed in 0.80s
2025-12-23 15:33:16,341 - URLValidatorAgent_agent_1 - INFO - Completed executing 1 tools
2025-12-23 15:33:16,341 - URLValidatorAgent_agent_1 - INFO - Starting iteration 2/10
2025-12-23 15:33:18,333 - URLValidatorAgent_agent_2 - INFO - Iteration 1 - Action: use_tools, Tool calls: 1
2025-12-23 15:33:18,333 - URLValidatorAgent_agent_2 - INFO - Executing 1 tool calls
2025-12-23 15:33:18,334 - URLValidatorAgent_agent_2 - INFO - Executing tool: firecrawl_scrape


üîß Executing tool: firecrawl_scrape - Using firecrawl_scrape tool as requested by LLM


2025-12-23 15:33:19,064 - URLValidatorAgent_agent_2 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:33:19,065 - URLValidatorAgent_agent_2 - INFO - Tool firecrawl_scrape completed in 0.73s
2025-12-23 15:33:19,066 - URLValidatorAgent_agent_2 - INFO - Completed executing 1 tools
2025-12-23 15:33:19,066 - URLValidatorAgent_agent_2 - INFO - Starting iteration 2/10
2025-12-23 15:33:21,450 - URLValidatorAgent_agent_1 - INFO - Iteration 2 - Action: use_tools, Tool calls: 1
2025-12-23 15:33:21,451 - URLValidatorAgent_agent_1 - INFO - Executing 1 tool calls
2025-12-23 15:33:21,452 - URLValidatorAgent_agent_1 - INFO - Executing tool: firecrawl_scrape


üîß Executing tool: firecrawl_scrape - Using firecrawl_scrape tool as requested by LLM


2025-12-23 15:33:21,720 - URLValidatorAgent_agent_1 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:33:21,721 - URLValidatorAgent_agent_1 - INFO - Tool firecrawl_scrape completed in 0.27s
2025-12-23 15:33:21,722 - URLValidatorAgent_agent_1 - INFO - Completed executing 1 tools
2025-12-23 15:33:21,722 - URLValidatorAgent_agent_1 - INFO - Starting iteration 3/10
2025-12-23 15:33:25,987 - URLValidatorAgent_agent_2 - INFO - Iteration 2 - Action: use_tools, Tool calls: 1
2025-12-23 15:33:25,988 - URLValidatorAgent_agent_2 - INFO - Executing 1 tool calls
2025-12-23 15:33:25,988 - URLValidatorAgent_agent_2 - INFO - Executing tool: firecrawl_scrape


üîß Executing tool: firecrawl_scrape - Using firecrawl_scrape tool as requested by LLM


2025-12-23 15:33:26,250 - URLValidatorAgent_agent_2 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:33:26,251 - URLValidatorAgent_agent_2 - INFO - Tool firecrawl_scrape completed in 0.26s
2025-12-23 15:33:26,253 - URLValidatorAgent_agent_2 - INFO - Completed executing 1 tools
2025-12-23 15:33:26,253 - URLValidatorAgent_agent_2 - INFO - Starting iteration 3/10
2025-12-23 15:33:30,175 - URLValidatorAgent_agent_1 - INFO - Iteration 3 - Action: use_tools, Tool calls: 1
2025-12-23 15:33:30,177 - URLValidatorAgent_agent_1 - INFO - Executing 1 tool calls
2025-12-23 15:33:30,177 - URLValidatorAgent_agent_1 - INFO - Executing tool: curl


üîß Executing tool: curl - Using curl tool as requested by LLM


2025-12-23 15:33:32,551 - URLValidatorAgent_agent_1 - INFO - Tool curl completed in 2.37s
2025-12-23 15:33:32,553 - URLValidatorAgent_agent_1 - INFO - Completed executing 1 tools
2025-12-23 15:33:32,553 - URLValidatorAgent_agent_1 - INFO - Forcing final conclusion after successful tool results
2025-12-23 15:33:32,554 - URLValidatorAgent_agent_1 - INFO - [e177a8d8-b909-4d79-8ac5-9adf3c1d9b9d] Analysis completed in 20.10s
2025-12-23 15:33:32,554 - URLValidatorAgent_agent_1 - INFO - [e177a8d8-b909-4d79-8ac5-9adf3c1d9b9d] Result: valid=True, replacement=False
2025-12-23 15:33:32,558 - URLValidatorAgent_agent_1 - INFO - [26c26581-36ec-4be7-b00e-0c3675070320] Processing task 2/130: https://dl.acm.org/doi/10.1145/1365490.1365500 (type: paper)
2025-12-23 15:33:32,559 - URLValidatorAgent_agent_1 - INFO - Starting generic content analysis
2025-12-23 15:33:32,559 - URLValidatorAgent_agent_1 - INFO - Starting iteration 1/10


üîç Processing task: https://dl.acm.org/doi/10.1145/1365490.1365500


2025-12-23 15:33:35,328 - URLValidatorAgent_agent_2 - INFO - Iteration 3 - Action: use_tools, Tool calls: 1
2025-12-23 15:33:35,329 - URLValidatorAgent_agent_2 - INFO - Executing 1 tool calls
2025-12-23 15:33:35,330 - URLValidatorAgent_agent_2 - INFO - Executing tool: curl
2025-12-23 15:33:35,423 - URLValidatorAgent_agent_2 - INFO - Tool curl completed in 0.09s
2025-12-23 15:33:35,424 - URLValidatorAgent_agent_2 - INFO - Completed executing 1 tools
2025-12-23 15:33:35,425 - URLValidatorAgent_agent_2 - INFO - Forcing final conclusion after successful tool results
2025-12-23 15:33:35,425 - URLValidatorAgent_agent_2 - INFO - [f4a1aa80-fc84-495c-b0fe-6bc3dfefd2d7] Analysis completed in 22.96s
2025-12-23 15:33:35,426 - URLValidatorAgent_agent_2 - INFO - [f4a1aa80-fc84-495c-b0fe-6bc3dfefd2d7] Result: valid=True, replacement=False
2025-12-23 15:33:35,426 - URLValidatorAgent_agent_2 - INFO - [d3249a9c-c856-438d-8916-0c1996c8db4a] Processing task 2/130: https://www.youtube.com/watch?v=0QwZ9BtVu

üîß Executing tool: curl - Using curl tool as requested by LLM
üîç Processing task: https://www.youtube.com/watch?v=0QwZ9BtVu0E


2025-12-23 15:33:39,375 - URLValidatorAgent_agent_1 - INFO - Iteration 1 - Action: use_tools, Tool calls: 1
2025-12-23 15:33:39,377 - URLValidatorAgent_agent_1 - INFO - Executing 1 tool calls
2025-12-23 15:33:39,378 - URLValidatorAgent_agent_1 - INFO - Executing tool: firecrawl_scrape


üîß Executing tool: firecrawl_scrape - Using firecrawl_scrape tool as requested by LLM


2025-12-23 15:33:40,224 - URLValidatorAgent_agent_1 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:33:40,225 - URLValidatorAgent_agent_1 - INFO - Tool firecrawl_scrape completed in 0.85s
2025-12-23 15:33:40,225 - URLValidatorAgent_agent_1 - INFO - Completed executing 1 tools
2025-12-23 15:33:40,226 - URLValidatorAgent_agent_1 - INFO - Starting iteration 2/10
2025-12-23 15:33:42,608 - URLValidatorAgent_agent_2 - INFO - Iteration 1 - Action: use_tools, Tool calls: 1
2025-12-23 15:33:42,609 - URLValidatorAgent_agent_2 - INFO - Executing 1 tool calls
2025-12-23 15:33:42,610 - URLValidatorAgent_agent_2 - INFO - Executing tool: firecrawl_scrape


üîß Executing tool: firecrawl_scrape - Using firecrawl_scrape tool as requested by LLM


2025-12-23 15:33:43,345 - URLValidatorAgent_agent_2 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:33:43,345 - URLValidatorAgent_agent_2 - INFO - Tool firecrawl_scrape completed in 0.73s
2025-12-23 15:33:43,346 - URLValidatorAgent_agent_2 - INFO - Completed executing 1 tools
2025-12-23 15:33:43,346 - URLValidatorAgent_agent_2 - INFO - Starting iteration 2/10
2025-12-23 15:33:47,805 - URLValidatorAgent_agent_1 - INFO - Iteration 2 - Action: use_tools, Tool calls: 1
2025-12-23 15:33:47,806 - URLValidatorAgent_agent_1 - INFO - Executing 1 tool calls
2025-12-23 15:33:47,806 - URLValidatorAgent_agent_1 - INFO - Executing tool: curl
2025-12-23 15:33:47,870 - URLValidatorAgent_agent_1 - INFO - Tool curl completed in 0.06s
2025-12-23 15:33:47,871 - URLValidatorAgent_agent_1 - INFO - Completed executing 1 tools
2025-12-23 15:33:47,872 - URLValidatorAgent_agent_1 - INFO - Starting iteration 3/10


üîß Executing tool: curl - Using curl tool as requested by LLM


2025-12-23 15:33:57,085 - URLValidatorAgent_agent_2 - INFO - Iteration 2 - Action: use_tools, Tool calls: 1
2025-12-23 15:33:57,086 - URLValidatorAgent_agent_2 - INFO - Executing 1 tool calls
2025-12-23 15:33:57,086 - URLValidatorAgent_agent_2 - INFO - Executing tool: curl


üîß Executing tool: curl - Using curl tool as requested by LLM


2025-12-23 15:33:57,635 - URLValidatorAgent_agent_2 - INFO - Tool curl completed in 0.55s
2025-12-23 15:33:57,638 - URLValidatorAgent_agent_2 - INFO - Completed executing 1 tools
2025-12-23 15:33:57,639 - URLValidatorAgent_agent_2 - INFO - Forcing final conclusion after successful tool results
2025-12-23 15:33:57,640 - URLValidatorAgent_agent_2 - INFO - [d3249a9c-c856-438d-8916-0c1996c8db4a] Analysis completed in 22.21s
2025-12-23 15:33:57,640 - URLValidatorAgent_agent_2 - INFO - [d3249a9c-c856-438d-8916-0c1996c8db4a] Result: valid=True, replacement=False
2025-12-23 15:33:57,650 - URLValidatorAgent_agent_2 - INFO - [6ab14ab5-c7ec-4bd0-8c62-4b3c5a9dff69] Processing task 3/130: https://github.com/NVIDIA/Megatron-LM (type: exercise)
2025-12-23 15:33:57,651 - URLValidatorAgent_agent_2 - INFO - Starting generic content analysis
2025-12-23 15:33:57,653 - URLValidatorAgent_agent_2 - INFO - Starting iteration 1/10


üîç Processing task: https://github.com/NVIDIA/Megatron-LM


2025-12-23 15:34:06,308 - URLValidatorAgent_agent_1 - INFO - Iteration 3 - Action: use_tools, Tool calls: 1
2025-12-23 15:34:06,309 - URLValidatorAgent_agent_1 - INFO - Executing 1 tool calls
2025-12-23 15:34:06,310 - URLValidatorAgent_agent_1 - INFO - Executing tool: firecrawl_search


üîß Executing tool: firecrawl_search - Using firecrawl_search tool as requested by LLM


2025-12-23 15:34:07,065 - URLValidatorAgent_agent_1 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:34:07,066 - URLValidatorAgent_agent_1 - INFO - Tool firecrawl_search completed in 0.76s
2025-12-23 15:34:07,067 - URLValidatorAgent_agent_1 - INFO - Completed executing 1 tools
2025-12-23 15:34:07,067 - URLValidatorAgent_agent_1 - INFO - Starting iteration 4/10
2025-12-23 15:34:10,518 - URLValidatorAgent_agent_2 - INFO - Iteration 1 - Action: use_tools, Tool calls: 1
2025-12-23 15:34:10,520 - URLValidatorAgent_agent_2 - INFO - Executing 1 tool calls
2025-12-23 15:34:10,521 - URLValidatorAgent_agent_2 - INFO - Executing tool: firecrawl_scrape


üîß Executing tool: firecrawl_scrape - Using firecrawl_scrape tool as requested by LLM


2025-12-23 15:34:11,324 - URLValidatorAgent_agent_2 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:34:11,325 - URLValidatorAgent_agent_2 - INFO - Tool firecrawl_scrape completed in 0.80s
2025-12-23 15:34:11,325 - URLValidatorAgent_agent_2 - INFO - Completed executing 1 tools
2025-12-23 15:34:11,326 - URLValidatorAgent_agent_2 - INFO - Starting iteration 2/10
2025-12-23 15:34:15,261 - URLValidatorAgent_agent_1 - INFO - Iteration 4 - Action: use_tools, Tool calls: 1
2025-12-23 15:34:15,262 - URLValidatorAgent_agent_1 - INFO - Executing 1 tool calls
2025-12-23 15:34:15,262 - URLValidatorAgent_agent_1 - INFO - Executing tool: firecrawl_scrape


üîß Executing tool: firecrawl_scrape - Using firecrawl_scrape tool as requested by LLM


2025-12-23 15:34:15,519 - URLValidatorAgent_agent_1 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:34:15,521 - URLValidatorAgent_agent_1 - INFO - Tool firecrawl_scrape completed in 0.26s
2025-12-23 15:34:15,521 - URLValidatorAgent_agent_1 - INFO - Completed executing 1 tools
2025-12-23 15:34:15,522 - URLValidatorAgent_agent_1 - INFO - Starting iteration 5/10
2025-12-23 15:34:20,541 - URLValidatorAgent_agent_2 - INFO - Iteration 2 - Action: use_tools, Tool calls: 1
2025-12-23 15:34:20,542 - URLValidatorAgent_agent_2 - INFO - Executing 1 tool calls
2025-12-23 15:34:20,542 - URLValidatorAgent_agent_2 - INFO - Executing tool: curl


üîß Executing tool: curl - Using curl tool as requested by LLM


2025-12-23 15:34:21,362 - URLValidatorAgent_agent_2 - INFO - Tool curl completed in 0.82s
2025-12-23 15:34:21,365 - URLValidatorAgent_agent_2 - INFO - Completed executing 1 tools
2025-12-23 15:34:21,365 - URLValidatorAgent_agent_2 - INFO - Forcing final conclusion after successful tool results
2025-12-23 15:34:21,369 - URLValidatorAgent_agent_2 - INFO - [6ab14ab5-c7ec-4bd0-8c62-4b3c5a9dff69] Analysis completed in 23.72s
2025-12-23 15:34:21,370 - URLValidatorAgent_agent_2 - INFO - [6ab14ab5-c7ec-4bd0-8c62-4b3c5a9dff69] Result: valid=False, replacement=False
2025-12-23 15:34:21,375 - URLValidatorAgent_agent_2 - INFO - [042e86f2-00ff-4735-ba1f-0275d0d23c9f] Processing task 4/130: https://github.com/NVIDIA/Megatron-LM (type: exercise)
2025-12-23 15:34:21,376 - URLValidatorAgent_agent_2 - INFO - Starting generic content analysis
2025-12-23 15:34:21,377 - URLValidatorAgent_agent_2 - INFO - Starting iteration 1/10


üîç Processing task: https://github.com/NVIDIA/Megatron-LM


2025-12-23 15:34:25,118 - URLValidatorAgent_agent_1 - INFO - Iteration 5 - Action: use_tools, Tool calls: 1
2025-12-23 15:34:25,119 - URLValidatorAgent_agent_1 - INFO - Executing 1 tool calls
2025-12-23 15:34:25,119 - URLValidatorAgent_agent_1 - INFO - Executing tool: curl
2025-12-23 15:34:25,190 - URLValidatorAgent_agent_1 - INFO - Tool curl completed in 0.07s
2025-12-23 15:34:25,191 - URLValidatorAgent_agent_1 - INFO - Completed executing 1 tools
2025-12-23 15:34:25,191 - URLValidatorAgent_agent_1 - INFO - Starting iteration 6/10


üîß Executing tool: curl - Using curl tool as requested by LLM


2025-12-23 15:34:35,196 - URLValidatorAgent_agent_2 - INFO - Iteration 1 - Action: continue, Tool calls: 0
2025-12-23 15:34:35,198 - URLValidatorAgent_agent_2 - INFO - Starting iteration 2/10
2025-12-23 15:34:45,666 - URLValidatorAgent_agent_1 - INFO - Iteration 6 - Action: continue, Tool calls: 0
2025-12-23 15:34:45,667 - URLValidatorAgent_agent_1 - INFO - Starting iteration 7/10
2025-12-23 15:34:49,868 - URLValidatorAgent_agent_2 - INFO - Iteration 2 - Action: use_tools, Tool calls: 1
2025-12-23 15:34:49,869 - URLValidatorAgent_agent_2 - INFO - Executing 1 tool calls
2025-12-23 15:34:49,869 - URLValidatorAgent_agent_2 - INFO - Executing tool: firecrawl_scrape


üîß Executing tool: firecrawl_scrape - Using firecrawl_scrape tool as requested by LLM


2025-12-23 15:34:50,609 - URLValidatorAgent_agent_2 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:34:50,610 - URLValidatorAgent_agent_2 - INFO - Tool firecrawl_scrape completed in 0.74s
2025-12-23 15:34:50,610 - URLValidatorAgent_agent_2 - INFO - Completed executing 1 tools
2025-12-23 15:34:50,611 - URLValidatorAgent_agent_2 - INFO - Starting iteration 3/10
2025-12-23 15:34:55,672 - URLValidatorAgent_agent_1 - INFO - Iteration 7 - Action: use_tools, Tool calls: 1
2025-12-23 15:34:55,673 - URLValidatorAgent_agent_1 - INFO - Executing 1 tool calls
2025-12-23 15:34:55,673 - URLValidatorAgent_agent_1 - INFO - Executing tool: firecrawl_scrape


üîß Executing tool: firecrawl_scrape - Using firecrawl_scrape tool as requested by LLM


2025-12-23 15:34:56,425 - URLValidatorAgent_agent_1 - ERROR - Firecrawl API error 401: {"success":false,"error":"Unauthorized: Invalid token"}
2025-12-23 15:34:56,426 - URLValidatorAgent_agent_1 - INFO - Tool firecrawl_scrape completed in 0.75s
2025-12-23 15:34:56,427 - URLValidatorAgent_agent_1 - INFO - Completed executing 1 tools
2025-12-23 15:34:56,427 - URLValidatorAgent_agent_1 - INFO - Starting iteration 8/10
2025-12-23 15:35:05,879 - URLValidatorAgent_agent_2 - INFO - Iteration 3 - Action: continue, Tool calls: 0
2025-12-23 15:35:05,880 - URLValidatorAgent_agent_2 - INFO - Starting iteration 4/10


‚ö†Ô∏è  Agent agent_1 was cancelled, attempting to save partial results...
‚ö†Ô∏è  Agent agent_2 was cancelled, attempting to save partial results...


CancelledError: 

## Step 4: Review Validation Results

Review the validation results and summary statistics.

In [None]:
# Display validation summary
if 'error' not in summary:
    print("üìä Validation Summary:")
    print(f"   Total URLs processed: {summary['total_urls_processed']}")
    print(f"   Valid URLs: {summary['valid_urls']}")
    print(f"   Invalid URLs: {summary['invalid_urls']}")
    print(f"   URLs with replacements: {summary['replaced_urls']}")
    print(f"   URLs to be removed: {summary['removed_urls']}")
    print(f"   Success rate: {summary['success_rate']:.1f}%")
    
    # Time-related statistics
    if 'total_duration_seconds' in summary and summary['total_urls_processed'] > 0:
        avg_time_per_url = summary['total_duration_seconds'] / summary['total_urls_processed']
        urls_per_minute = (summary['total_urls_processed'] / summary['total_duration_seconds']) * 60
        
        print("\n‚è±Ô∏è  Performance Statistics:")
        print(f"   Total processing time: {summary['total_duration_seconds']:.1f} seconds")
        print(f"   Average time per URL: {avg_time_per_url:.2f} seconds")
        print(f"   Processing rate: {urls_per_minute:.1f} URLs/minute")
        
        # Phase breakdown if available
        if 'phase_durations' in summary:
            phases = summary['phase_durations']
            print("\nüìà Time Breakdown by Phase:")
            for phase, duration in phases.items():
                percentage = (duration / summary['total_duration_seconds']) * 100
                print(f"   {phase}: {duration:.1f}s ({percentage:.1f}%)")
    
    print("\nüìã Breakdown by URL type:")
    for url_type, stats in summary['urls_by_type'].items():
        total = stats['total']
        valid = stats['valid']
        replaced = stats['replaced']
        removed = stats['invalid'] - stats['replaced']
        
        # Time stats per URL type if available
        type_stats = f"{url_type}: {valid}/{total} valid, {replaced} replaced, {removed} to remove"
        if 'avg_time_per_url' in stats:
            type_stats += f" (avg: {stats['avg_time_per_url']:.2f}s/URL)"
        print(f"   {type_stats}")
        
    # Show agent performance if available
    if 'agent_performance' in summary:
        print("\nü§ñ Agent Performance:")
        for agent_id, perf in summary['agent_performance'].items():
            urls_processed = perf.get('urls_processed', 0)
            time_spent = perf.get('time_spent', 0)
            avg_time = time_spent / urls_processed if urls_processed > 0 else 0
            print(f"   Agent {agent_id}: {urls_processed} URLs, {time_spent:.1f}s total, {avg_time:.2f}s/URL")
else:
    print(f"‚ùå Validation failed: {summary['error']}")

## Step 5: Update data.js File (Optional)

Update the original data.js file with the validation results.

In [None]:
# Update data.js with validation results
if CONFIG['update_data_js'] and 'error' not in summary:
    print("üîÑ Updating data.js with validation results...")
    
    try:
        success = await orchestrator.update_data_js_with_results(CONFIG['data_js_file'])
        if success:
            print("‚úÖ data.js updated successfully")
        else:
            print("‚ùå Failed to update data.js")
    except Exception as e:
        print(f"‚ùå Error updating data.js: {e}")
else:
    print("‚è≠Ô∏è  Skipping data.js update")

## MCP Tools Configuration

To configure MCP (Model Context Protocol) tools for enhanced functionality:

### Available MCP Tools
1. **firecrawl**: Web scraping and content extraction
2. **context7**: Library documentation search
3. **brave-search**: Web search capabilities

### Configuration Steps
1. Install MCP server packages
2. Configure MCP client in your LLM setup
3. Update the `find_replacement_url` method to use MCP tools

### Example MCP Integration
```python
# Add to URLValidatorAgent.__init__
self.mcp_client = MCPClient(
    server_configs={
        'firecrawl': {'url': 'http://localhost:3000'},
        'context7': {'url': 'http://localhost:3001'}
    }
)

# Use in find_replacement_url method
search_results = await self.mcp_client.search('brave-search', query)
scraped_content = await self.mcp_client.scrape('firecrawl', url)
```

### Benefits of MCP Integration
- **Enhanced search**: Use Brave Search for finding replacements
- **Better scraping**: Use Firecrawl for content validation
- **Documentation lookup**: Use Context7 for library-specific resources
- **Fallback mechanisms**: Multiple tools for robust URL finding

## Troubleshooting

### Common Issues

**LM Studio Connection Issues**
```bash
# Check LM Studio status
curl http://localhost:1234/v1/models

# Verify model is loaded
# Restart LM Studio if needed
```

**Rate Limiting**
```json
// Reduce concurrent requests in config.json
{
  "agents": {
    "max_concurrent_requests_per_agent": 3
  }
}
```

**Memory Issues**
```json
// Process in smaller batches
{
  "processing": {
    "batch_size": 25
  }
}
```

**URL Validation Issues**
- Some sites block automated requests - consider using proxies
- PDFs and binary content may not be properly analyzed
- GitHub rate limiting may affect repository checks

## Next Steps

After running this validation system:

1. **Review changes**: Check the updated data.js file
2. **Manual verification**: Spot-check some URLs to ensure replacements are appropriate
3. **Re-run periodically**: URLs can break over time, so re-validation is recommended
4. **Extend functionality**: Add more URL types or validation rules as needed

### Potential Enhancements
- **Content freshness checking**: Verify that content is still relevant
- **Duplicate detection**: Find and remove duplicate URLs
- **Quality scoring**: Rate URLs by content quality
- **Automated scheduling**: Set up regular validation runs