# GPU Programming Ladder URL Validation System

This notebook orchestrates an AI agent system to validate and update URLs in the GPU Programming Ladder data.

## Features
- **Multi-agent system**: Task creator agent + multiple consumer agents
- **URL validation**: Checks if URLs exist and are accessible
- **Content analysis**: Ensures exercise/video URLs point to specific content, not listings
- **AI-powered replacements**: Uses local LLM (GPT-4o via LM Studio) to find replacements
- **Parallel processing**: Configurable concurrent requests with rate limiting
- **Thread safety**: No race conditions in concurrent operations

## Requirements
- Python 3.8+
- LM Studio running locally with GPT-4o model
- Required Python packages (installed in virtual environment)

In [None]:
# Import required modules
import sys
import os
import asyncio
import json
from datetime import datetime
from pathlib import Path

# Add current directory to path for imports
sys.path.append('.')

# Import our custom modules
from url_validation_orchestrator import URLValidationOrchestrator
from url_extractor import extract_urls_from_data_js
from task_creator_agent import TaskCreatorAgent

## Configuration

Configure the validation system parameters below:

In [None]:
# System Configuration
CONFIG = {
    # LM Studio configuration
    'lm_studio_url': 'http://localhost:1234',  # Update if LM Studio runs on different port
    
    # Agent configuration
    'num_consumer_agents': 3,  # Number of parallel consumer agents
    'max_concurrent_requests_per_agent': 5,  # Concurrent requests per agent (rate limiting)
    
    # File paths
    'data_js_file': '../data.js',
    'urls_file': '../urls_to_validate.json',
    'tasks_file': '../validation_tasks.json',
    'results_file': '../validation_results.json',
    
    # Processing options
    'update_data_js': True,  # Whether to update the original data.js file
    'force_revalidation': False  # Set to True to revalidate all URLs
}

print("Configuration loaded:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

## Step 1: Extract URLs from data.js

First, let's extract all URLs from the data.js file that need validation.

In [None]:
# Extract URLs from data.js
print("üîç Extracting URLs from data.js...")

try:
    urls = extract_urls_from_data_js(CONFIG['data_js_file'])
    print(f"‚úÖ Extracted {len(urls)} URLs")
    
    # Show URL type distribution
    url_types = {}
    for url in urls:
        url_type = url['url_type']
        url_types[url_type] = url_types.get(url_type, 0) + 1
    
    print("\nüìä URL distribution by type:")
    for url_type, count in url_types.items():
        print(f"   {url_type}: {count}")
    
    # Show sample URLs
    print("\nüìã Sample URLs to validate:")
    for i, url in enumerate(urls[:5]):
        print(f"   {i+1}. [{url['url_type']}] {url['topic_title'][:50]}...")
        print(f"      URL: {url['url']}")
    
except Exception as e:
    print(f"‚ùå Error extracting URLs: {e}")
    urls = []

## Step 2: Create Validation Tasks

Create validation tasks for URLs that haven't been validated yet.

In [None]:
# Create validation tasks
print("üìã Creating validation tasks...")

try:
    task_creator = TaskCreatorAgent(CONFIG['urls_file'])
    
    # Run task creation
    tasks = await task_creator.run()
    
    print(f"\n‚úÖ Created {len(tasks)} validation tasks")
    
    # Show task distribution
    task_types = {}
    for task in tasks:
        url_type = task['url_entry']['url_type']
        task_types[url_type] = task_types.get(url_type, 0) + 1
    
    print("\nüìä Tasks by URL type:")
    for url_type, count in task_types.items():
        print(f"   {url_type}: {count}")
    
except Exception as e:
    print(f"‚ùå Error creating tasks: {e}")
    tasks = []

## Step 3: Run URL Validation

Run the multi-agent URL validation system. This will:
- Check if URLs exist and are accessible
- Analyze content to ensure appropriate targeting
- Use AI to find replacements for broken/inappropriate URLs

In [None]:
# Run URL validation with orchestrator
print("üöÄ Starting URL validation pipeline...")
print(f"   Using {CONFIG['num_consumer_agents']} consumer agents")
print(f"   Max {CONFIG['max_concurrent_requests_per_agent']} concurrent requests per agent")
print(f"   LM Studio URL: {CONFIG['lm_studio_url']}")

try:
    orchestrator = URLValidationOrchestrator(
        num_consumer_agents=CONFIG['num_consumer_agents'],
        max_concurrent_requests_per_agent=CONFIG['max_concurrent_requests_per_agent'],
        lm_studio_url=CONFIG['lm_studio_url'],
        tasks_file=CONFIG['tasks_file'],
        results_file=CONFIG['results_file']
    )
    
    # Run validation pipeline
    start_time = datetime.now()
    summary = await orchestrator.run_validation_pipeline()
    end_time = datetime.now()
    
    duration = end_time - start_time
    print(f"\n‚è±Ô∏è  Validation completed in {duration.total_seconds():.1f} seconds")
    
    # Add timing information to summary for enhanced reporting
    if 'error' not in summary:
        summary['total_duration_seconds'] = duration.total_seconds()
    
except Exception as e:
    print(f"‚ùå Error during validation: {e}")
    summary = {'error': str(e)}

## Step 4: Review Validation Results

Review the validation results and summary statistics.

In [None]:
# Display validation summary
if 'error' not in summary:
    print("üìä Validation Summary:")
    print(f"   Total URLs processed: {summary['total_urls_processed']}")
    print(f"   Valid URLs: {summary['valid_urls']}")
    print(f"   Invalid URLs: {summary['invalid_urls']}")
    print(f"   URLs with replacements: {summary['replaced_urls']}")
    print(f"   URLs to be removed: {summary['removed_urls']}")
    print(f"   Success rate: {summary['success_rate']:.1f}%")
    
    # Time-related statistics
    if 'total_duration_seconds' in summary and summary['total_urls_processed'] > 0:
        avg_time_per_url = summary['total_duration_seconds'] / summary['total_urls_processed']
        urls_per_minute = (summary['total_urls_processed'] / summary['total_duration_seconds']) * 60
        
        print("\n‚è±Ô∏è  Performance Statistics:")
        print(f"   Total processing time: {summary['total_duration_seconds']:.1f} seconds")
        print(f"   Average time per URL: {avg_time_per_url:.2f} seconds")
        print(f"   Processing rate: {urls_per_minute:.1f} URLs/minute")
        
        # Phase breakdown if available
        if 'phase_durations' in summary:
            phases = summary['phase_durations']
            print("\nüìà Time Breakdown by Phase:")
            for phase, duration in phases.items():
                percentage = (duration / summary['total_duration_seconds']) * 100
                print(f"   {phase}: {duration:.1f}s ({percentage:.1f}%)")
    
    print("\nüìã Breakdown by URL type:")
    for url_type, stats in summary['urls_by_type'].items():
        total = stats['total']
        valid = stats['valid']
        replaced = stats['replaced']
        removed = stats['invalid'] - stats['replaced']
        
        # Time stats per URL type if available
        type_stats = f"{url_type}: {valid}/{total} valid, {replaced} replaced, {removed} to remove"
        if 'avg_time_per_url' in stats:
            type_stats += f" (avg: {stats['avg_time_per_url']:.2f}s/URL)"
        print(f"   {type_stats}")
        
    # Show agent performance if available
    if 'agent_performance' in summary:
        print("\nü§ñ Agent Performance:")
        for agent_id, perf in summary['agent_performance'].items():
            urls_processed = perf.get('urls_processed', 0)
            time_spent = perf.get('time_spent', 0)
            avg_time = time_spent / urls_processed if urls_processed > 0 else 0
            print(f"   Agent {agent_id}: {urls_processed} URLs, {time_spent:.1f}s total, {avg_time:.2f}s/URL")
else:
    print(f"‚ùå Validation failed: {summary['error']}")

## Step 5: Update data.js File (Optional)

Update the original data.js file with the validation results.

In [None]:
# Update data.js with validation results
if CONFIG['update_data_js'] and 'error' not in summary:
    print("üîÑ Updating data.js with validation results...")
    
    try:
        success = await orchestrator.update_data_js_with_results(CONFIG['data_js_file'])
        if success:
            print("‚úÖ data.js updated successfully")
        else:
            print("‚ùå Failed to update data.js")
    except Exception as e:
        print(f"‚ùå Error updating data.js: {e}")
else:
    print("‚è≠Ô∏è  Skipping data.js update")

## MCP Tools Configuration

To configure MCP (Model Context Protocol) tools for enhanced functionality:

### Available MCP Tools
1. **firecrawl**: Web scraping and content extraction
2. **context7**: Library documentation search
3. **brave-search**: Web search capabilities

### Configuration Steps
1. Install MCP server packages
2. Configure MCP client in your LLM setup
3. Update the `find_replacement_url` method to use MCP tools

### Example MCP Integration
```python
# Add to URLValidatorAgent.__init__
self.mcp_client = MCPClient(
    server_configs={
        'firecrawl': {'url': 'http://localhost:3000'},
        'context7': {'url': 'http://localhost:3001'}
    }
)

# Use in find_replacement_url method
search_results = await self.mcp_client.search('brave-search', query)
scraped_content = await self.mcp_client.scrape('firecrawl', url)
```

### Benefits of MCP Integration
- **Enhanced search**: Use Brave Search for finding replacements
- **Better scraping**: Use Firecrawl for content validation
- **Documentation lookup**: Use Context7 for library-specific resources
- **Fallback mechanisms**: Multiple tools for robust URL finding

## Troubleshooting

### Common Issues

**LM Studio Connection Issues**
```bash
# Check LM Studio status
curl http://localhost:1234/v1/models

# Verify model is loaded
# Restart LM Studio if needed
```

**Rate Limiting**
```json
// Reduce concurrent requests in config.json
{
  "agents": {
    "max_concurrent_requests_per_agent": 3
  }
}
```

**Memory Issues**
```json
// Process in smaller batches
{
  "processing": {
    "batch_size": 25
  }
}
```

**URL Validation Issues**
- Some sites block automated requests - consider using proxies
- PDFs and binary content may not be properly analyzed
- GitHub rate limiting may affect repository checks

## Next Steps

After running this validation system:

1. **Review changes**: Check the updated data.js file
2. **Manual verification**: Spot-check some URLs to ensure replacements are appropriate
3. **Re-run periodically**: URLs can break over time, so re-validation is recommended
4. **Extend functionality**: Add more URL types or validation rules as needed

### Potential Enhancements
- **Content freshness checking**: Verify that content is still relevant
- **Duplicate detection**: Find and remove duplicate URLs
- **Quality scoring**: Rate URLs by content quality
- **Automated scheduling**: Set up regular validation runs