In [None]:
# Web Scraping and Analysis Toolkit

A comprehensive, generic web scraping toolkit that can analyze any website's crawlability, extract content, handle JavaScript, and generate detailed reports.

## Features
- ü§ñ Robots.txt analysis and compliance checking
- üì∞ Generic content extraction with configurable selectors
- üåê JavaScript support with Playwright and Selenium
- üì° RSS feed parsing and analysis
- üó∫Ô∏è Sitemap structure analysis
- üìä Data export (CSV, JSON) and visualization
- üìà Crawlability scoring and recommendations

## Quick Start

1. Install dependencies: `pip install -r requirements.txt`
2. Copy `config.example.json` to `config.json` and configure for your target website
3. Run the analysis!

Collecting streamlit
  Downloading streamlit-1.45.1-py3-none-any.whl.metadata (8.9 kB)
Collecting playwright
  Downloading playwright-1.52.0-py3-none-manylinux1_x86_64.whl.metadata (3.5 kB)
Collecting feedparser
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.3/44.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting pyee<14,>=13 (from playwright)
  Downloading pyee-13.0.0-py3-none-any.whl.metadata (2.9 kB)
Collecting sgmllib3k (from feedparser)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading streamlit-1.45.1-py3-none-an

In [None]:
# Install required packages
!pip install -r requirements.txt

# If playwright is needed for JavaScript content
!playwright install chromium

## üöÄ Quick Demo - Setup and Configuration

Let's start by importing our modular web scraper and setting up configuration.

In [None]:
import sys
import os
sys.path.append('src')

# Import our web scraper components
from src.web_scraper import WebScraper
from src.config import Config

# For demonstration, we'll create a sample configuration
# In practice, you would modify config.json
demo_config = {
    "base_url": "https://example-news-site.com",  # Replace with your target website
    "news_section_path": "/news",
    "max_articles": 50,
    "crawl_delay": 2,
    "output_formats": ["csv", "json"],
    "user_agent": "WebScraper/1.0 (Educational Research)",
    "timeout": 30,
    "headless_browser": True,
    "respect_robots_txt": True,
    "selectors": {
        "article_links": "article a[href*='/news/']",  # Adjust for your target site
        "title": "h1, .article-title, .headline",
        "content": "p, .article-content, .story-body",
        "date": "time, .publish-date, .timestamp",
        "category": ".category, .topic, .section",
        "image": ".article-image img, .featured-image img"
    },
    "rss_feeds": []  # Add RSS feed URLs if available
}

print("‚úÖ Configuration loaded!")
print(f"Target website: {demo_config['base_url']}")
print(f"Max articles: {demo_config['max_articles']}")
print(f"Respect robots.txt: {demo_config['respect_robots_txt']}")

In [None]:
# Create a temporary config file for this demo
import json
import tempfile

# Save demo config to a temporary file
temp_config_path = "temp_config.json"
with open(temp_config_path, 'w') as f:
    json.dump(demo_config, f, indent=2)

print(f"‚úÖ Created temporary configuration file: {temp_config_path}")

# Initialize the web scraper with our configuration
try:
    scraper = WebScraper(config_path=temp_config_path, log_level="INFO")
    print("‚úÖ WebScraper initialized successfully!")
except Exception as e:
    print(f"‚ùå Error initializing scraper: {e}")
    print("Make sure to update the base_url in the config to a real website.")

In [None]:
## üîç Step 1: Analyze Robots.txt and Crawlability

# Analyze the target website's robots.txt
robots_analysis = scraper.analyze_robots_txt()

# Print detailed analysis
scraper.crawlability_analyzer.print_analysis_summary(robots_analysis)

# Calculate crawlability score
crawlability_score = scraper.crawlability_analyzer.calculate_crawlability_score(robots_analysis)
print(f"\nüéØ Crawlability Score: {crawlability_score}/100")

if crawlability_score >= 80:
    print("‚úÖ Excellent crawlability!")
elif crawlability_score >= 60:
    print("‚ö†Ô∏è Good crawlability with some restrictions")
elif crawlability_score >= 40:
    print("‚ö†Ô∏è Moderate crawlability - check restrictions carefully")
else:
    print("‚ùå Limited crawlability - proceed with caution")


üîç robots.txt Analysis Summary
URL: https://www.aljazeera.net/robots.txt
Status: robots.txt analyzed successfully
Crawling Allowed: ‚úÖ Yes

üìú Crawl Rules for User-Agent '*':

‚úÖ Allowed Paths:
  - /search/$

‚ùå Disallowed Paths:
  - /api
  - /asset-manifest.json
  - /search/
  - /home/search?q=

üïì Crawl Delay:
  Not specified

üó∫Ô∏è Sitemap Links:
  1. https://www.aljazeera.net/sitemap.xml
  2. https://www.aljazeera.net/news-sitemap.xml
  3. https://www.aljazeera.net/sitemaps/article-archive.xml
  4. https://www.aljazeera.net/sitemaps/article-new.xml
  5. https://www.aljazeera.net/sitemaps/video-archive.xml
  6. https://www.aljazeera.net/sitemaps/video-new.xml



## üì∞ Step 2: Extract Articles and Content

Now let's extract articles from the website using our intelligent content extractor.

In [None]:
# Note: Selenium is already included in requirements.txt
# If you need to install ChromeDriver separately, uncomment the next line:
# !pip install webdriver-manager

print("‚ÑπÔ∏è Selenium should already be installed from requirements.txt")
print("‚ÑπÔ∏è Make sure Chrome browser is installed on your system")

Collecting selenium
  Downloading selenium-4.32.0-py3-none-any.whl.metadata (7.5 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.30.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.32.0-py3-none-any.whl (9.4 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m9.4/9.4 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio-0.30.0-py3-none-any.whl (499 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m499.

In [None]:
# Extract articles from the website
print("üîÑ Starting article extraction...")
print("‚ÑπÔ∏è This may take a few minutes depending on the website and number of articles")

# For demo purposes, let's extract a smaller number of articles
max_articles_demo = 10  # Reduce for demo

try:
    articles = scraper.extract_articles(
        max_articles=max_articles_demo,
        max_show_more_clicks=2  # Limit for demo
    )
    
    print(f"\n‚úÖ Successfully extracted {len(articles)} articles!")
    
    # Display sample of extracted articles
    if articles:
        print("\nüìã Sample of extracted articles:")
        for i, article in enumerate(articles[:3], 1):
            print(f"\n--- Article {i} ---")
            print(f"Title: {article.get('title', 'N/A')[:100]}...")
            print(f"Link: {article.get('link', 'N/A')}")
            print(f"Category: {article.get('category', 'N/A')}")
            print(f"Published: {article.get('published_at', 'N/A')}")
            print(f"Content Length: {len(article.get('content', []))} paragraphs")
    else:
        print("‚ö†Ô∏è No articles were extracted. This might be because:")
        print("  - The website structure doesn't match our selectors")
        print("  - The target website is not accessible")
        print("  - Robots.txt restrictions prevent crawling")
        print("  - The website requires JavaScript (try updating selectors)")

except Exception as e:
    print(f"‚ùå Error during article extraction: {e}")
    print("? Try updating the CSS selectors in the configuration to match your target website")

‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (1)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (2)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (3)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (4)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (5)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (6)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (7)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (8)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (9)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (10)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (11)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (12)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (13)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (14)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (15)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (16)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (17)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (18)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (19)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (20)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (21)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (22)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (23)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (24)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' (25)
‚úÖ Clicked 'ÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØ' 

In [None]:
# Generate comprehensive analysis report
print("üìä Generating comprehensive analysis report...")

# Create analysis summary
if articles:
    analysis_summary = scraper.report_generator.generate_analysis_summary(
        articles=articles,
        crawlability_result=robots_analysis,
        sitemap_structure=None  # We'll add sitemap analysis later
    )
    
    # Print formatted report
    scraper.report_generator.print_summary_report(analysis_summary)
    
    # Export results
    export_paths = scraper.export_results({
        'articles': articles,
        'robots_analysis': robots_analysis,
        'analysis_summary': analysis_summary,
        'crawlability_score': crawlability_score
    })
    
    print(f"\n? Results exported to:")
    for format_type, path in export_paths.items():
        print(f"  - {format_type.upper()}: {path}")
        
else:
    print("‚ö†Ô∏è No articles to analyze. Skipping report generation.")

In [None]:
# Generate visualizations
print("üìà Creating data visualizations...")

if articles:
    try:
        viz_path = scraper.report_generator.create_visualization(articles)
        if viz_path:
            print(f"‚úÖ Visualization saved to: {viz_path}")
            
            # Display the visualization in the notebook
            from IPython.display import Image, display
            display(Image(filename=str(viz_path)))
        else:
            print("‚ö†Ô∏è Could not create visualization")
            
    except Exception as e:
        print(f"‚ùå Error creating visualization: {e}")
        print("This might be due to missing matplotlib or data issues")
else:
    print("‚ö†Ô∏è No articles available for visualization")

## üåê Step 3: JavaScript Content and RSS Feeds

Let's explore JavaScript handling and RSS feed parsing capabilities.

In [None]:
# Analyze JavaScript requirements for the website
print("üîç Analyzing JavaScript content requirements...")

try:
    js_analysis = scraper.js_handler.analyze_javascript_content(demo_config['base_url'])
    
    print("\nüìã JavaScript Analysis Results:")
    print(f"URL: {js_analysis.get('url', 'N/A')}")
    print(f"Static Content Length: {js_analysis.get('static_content_length', 'N/A')} characters")
    
    if 'js_content_length' in js_analysis:
        print(f"JavaScript Content Length: {js_analysis['js_content_length']} characters")
        print(f"Content Difference: {js_analysis.get('content_difference', 0)} characters")
        
        requires_js = js_analysis.get('requires_javascript', False)
        if requires_js:
            print("‚ö†Ô∏è This website heavily relies on JavaScript for content rendering")
            print("üí° Consider using Playwright or Selenium for better content extraction")
        else:
            print("‚úÖ This website renders most content without JavaScript")
    else:
        print("‚ÑπÔ∏è Playwright not available - install for JavaScript content analysis")
        
except Exception as e:
    print(f"‚ùå Error analyzing JavaScript content: {e}")
    print("This might be due to network issues or website restrictions")

ÿßŸÑÿ¨ÿ≤Ÿäÿ±ÿ© ŸÜÿ™: ÿ¢ÿÆÿ± ÿ£ÿÆÿ®ÿßÿ± ÿßŸÑŸäŸàŸÖ ÿ≠ŸàŸÑ ÿßŸÑÿπÿßŸÑŸÖ

ÿ™ÿÆÿ∑Ÿä ÿßŸÑÿ±Ÿàÿßÿ®ÿ∑ÿßÿ∞Ÿáÿ® ÿ•ŸÑŸâ ÿßŸÑŸÖÿ≠ÿ™ŸàŸâ ÿßŸÑŸÖŸÖŸäŸëÿ≤ÿßÿ∞Ÿáÿ® ÿ•ŸÑŸâ ÿ™ÿ∫ÿ∞Ÿäÿ© ÿßŸÑŸÖÿ≠ÿ™ŸàŸâÿßŸÜÿ™ŸÇŸÑ ÿ•ŸÑŸâ ÿßŸÑÿ£ŸÉÿ´ÿ± ŸÇÿ±ÿßÿ°ÿ©ÿ±ÿßÿ®ÿ∑ ÿ•ŸÑŸâ ÿßŸÑÿµŸÅÿ≠ÿ© ÿßŸÑÿ±ÿ¶Ÿäÿ≥Ÿäÿ© ŸÑŸÑÿ¨ÿ≤Ÿäÿ±ÿ©play ÿßŸÑÿ®ÿ´ ÿßŸÑÿ≠Ÿä ÿ™ÿ≥ÿ¨ŸäŸÑÿ£ÿ∏Ÿáÿ± ÿßŸÑŸÇÿßÿ¶ŸÖÿ© ÿßŸÑÿ±ÿ¶Ÿäÿ≥Ÿäÿ©Navigation menuÿ£ÿÆÿ®ÿßÿ±ÿßŸÑÿ¢ŸÜÿßÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØÿ≥Ÿàÿ±Ÿäÿß ÿ®ÿπÿØ ÿßŸÑÿ£ÿ≥ÿØÿ∑ŸàŸÅÿßŸÜ ÿßŸÑÿ£ŸÇÿµŸâÿßŸÑÿ£ÿ≤ŸÖÿ© ÿßŸÑÿ≥ŸàÿØÿßŸÜŸäÿ©ÿ≠ÿ±ÿ® ÿ£ŸàŸÉÿ±ÿßŸÜŸäÿßÿ£ŸÅÿ±ŸäŸÇŸäÿßÿ£ÿ®ÿπÿßÿØÿ±Ÿäÿßÿ∂ÿ©ŸÖŸÇÿßŸÑÿßÿ™ÿ®Ÿäÿ¶ÿ©ÿßŸÇÿ™ÿµÿßÿØÿ´ŸÇÿßŸÅÿ©ŸÅŸäÿØŸäŸàÿßŸÑŸÖÿ≤ŸäÿØÿßÿπÿ±ÿ∂ ÿßŸÑŸÖÿ≤ŸäÿØÿ≥ŸÅÿ±ÿµÿ≠ÿ©ÿ≥Ÿäÿßÿ≥ÿ©ÿ™ŸÉŸÜŸàŸÑŸàÿ¨ŸäÿßŸÖÿØŸàŸÜÿßÿ™ÿßŸÑŸÖŸàÿ≥Ÿàÿπÿ©ÿ™ÿ±ÿßÿ´ŸÅŸÜÿπŸÑŸàŸÖÿ£ÿ≥ŸÑŸàÿ® ÿ≠Ÿäÿßÿ©ŸÖÿπŸÖŸëŸÇÿ©ÿßŸÑŸÇÿØÿ≥ÿ£ÿ≥ÿ±ÿ©ŸÖŸÜŸàÿπÿßÿ™ÿ≠ÿ±Ÿäÿßÿ™ÿ•ÿ∞ÿßÿπÿ©ÿ®ÿßŸÑÿµŸàÿ±play ÿßŸÑÿ®ÿ´ ÿßŸÑÿ≠Ÿä ÿßÿ∂ÿ∫ÿ∑ ŸáŸÜÿß ŸÑŸÑÿ®ÿ≠ÿ´searchÿ™ÿ≥ÿ¨ŸäŸÑNavigation menucaret-leftÿßŸÑÿ¢ŸÜÿ≥Ÿàÿ±ŸäÿßŸÅŸÑÿ≥ÿ∑ŸäŸÜÿßŸÑÿ≥ŸàÿØÿßŸÜÿ£ŸàŸÉÿ±ÿßŸÜŸäÿßcaret-rightŸÖÿ≠ÿ™ŸàŸâ ÿ±ÿ¶Ÿäÿ≥Ÿäaj-lo

In [None]:
# Install Playwright browsers (if not already done)
# This is optional and only needed for JavaScript-heavy websites

print("‚ÑπÔ∏è To enable full JavaScript support, run:")
print("  pip install playwright")
print("  playwright install chromium")
print("")
print("‚ö†Ô∏è Note: This requires a few hundred MB of browser downloads")

# Uncomment the next line to install Playwright browsers
# !playwright install chromium

CompletedProcess(args=['playwright', 'install'], returncode=0)

In [None]:
# Demonstrate Playwright usage for JavaScript content (if available)
print("üé≠ Testing Playwright JavaScript handling...")

try:
    import asyncio
    
    # Test if Playwright can fetch content
    content = asyncio.run(scraper.js_handler.fetch_with_playwright(demo_config['base_url']))
    
    if content:
        print(f"‚úÖ Successfully fetched content with Playwright")
        print(f"Content length: {len(content)} characters")
        print(f"Sample content: {content[:200]}...")
    else:
        print("‚ö†Ô∏è Playwright available but couldn't fetch content")
        
except ImportError:
    print("‚ÑπÔ∏è Playwright not installed - using standard HTTP requests only")
    print("üí° Install with: pip install playwright && playwright install")
    
except Exception as e:
    print(f"‚ùå Error with Playwright: {e}")
    print("This might be due to network issues or missing browser installation")

<!DOCTYPE html><html lang="ar" dir="rtl" class="theme-aja" style="--vh: 7.2px; --clientWidth: 1280px;"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta http-equiv="content-type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><link rel="shortcut icon" href="/favicon_aja.ico"><title data-reactroot="">ÿ£ÿÆÿ®ÿßÿ± ÿßŸÑÿπÿßŸÑŸÖ | ÿßŸÑÿ¨ÿ≤Ÿäÿ±ÿ© ŸÜÿ™</title><script async="" src="https://connect.facebook.net/en_US/fbevents.js"></script><script type="text/javascript" async="" src="https://www.googletagmanager.com/gtag/js?id=AW-947178488&amp;cx=c&amp;gtm=45je55k1h1v894171536za200&amp;tag_exp=101509157~103116026~103130495~103130497~103136993~103136995~103200004~103233427~103252644~103252646~103301114~103301116"></script><script type="text/javascript" async="" src="https://static.ads-twitter.com/uwt.js"></script><script type="text/javascript" async="" src="https://secure.quantserv

In [None]:
# Demonstrate RSS feed parsing
print("üì° Testing RSS feed parsing capabilities...")

# Example RSS feeds (replace with actual feeds from your target website)
example_rss_feeds = [
    "https://feeds.bbci.co.uk/news/rss.xml",  # BBC News (example)
    "https://rss.cnn.com/rss/edition.rss",    # CNN (example)
]

print("‚ÑπÔ∏è For demonstration, using example RSS feeds")
print("üí° Update the 'rss_feeds' in your config.json with actual feeds from your target website")

try:
    # Fetch RSS entries
    rss_entries = scraper.fetch_rss_feeds(example_rss_feeds[:1])  # Just one for demo
    
    if rss_entries:
        print(f"\n‚úÖ Successfully fetched {len(rss_entries)} RSS entries")
        
        # Display sample entries
        for i, entry in enumerate(rss_entries[:3], 1):
            print(f"\n--- RSS Entry {i} ---")
            print(f"Title: {entry.get('title', 'N/A')[:80]}...")
            print(f"Link: {entry.get('link', 'N/A')}")
            print(f"Published: {entry.get('published', 'N/A')}")
            print(f"Source: {entry.get('feed_title', 'N/A')}")
    else:
        print("‚ö†Ô∏è No RSS entries found or feeds not accessible")
        
except Exception as e:
    print(f"‚ùå Error fetching RSS feeds: {e}")
    print("This might be due to network issues or invalid RSS URLs")

Title: ÿ≠ÿßÿØÿ´ "ÿÆÿ∑Ÿäÿ±" ÿÆŸÑÿßŸÑ ÿ™ÿØÿ¥ŸäŸÜ ÿ≥ŸÅŸäŸÜÿ© ÿ≠ÿ±ÿ®Ÿäÿ© ÿ®ŸÉŸàÿ±Ÿäÿß ÿßŸÑÿ¥ŸÖÿßŸÑŸäÿ© ŸàŸÉŸäŸÖ ÿ∫ÿßÿ∂ÿ®
Link: https://www.ajnet.me/news/2025/5/22/%d8%b9%d8%a7%d8%ac%d9%84-%d8%b1%d9%88%d9%8a%d8%aa%d8%b1%d8%b2-%d8%b9%d9%86-%d9%88%d9%83%d8%a7%d9%84%d8%a9-%d8%a7%d9%84%d8%a3%d9%86%d8%a8%d8%a7%d8%a1-%d8%a7%d9%84%d9%85%d8%b1%d9%83%d8%b2%d9%8a%d8%a9-2?traffic_source=rss

Title: ÿ•ÿ≥ÿ±ÿßÿ¶ŸäŸÑ: ÿßÿπÿ™ÿ±ÿßÿ∂ ÿµÿßÿ±ŸàÿÆ ÿ£Ÿèÿ∑ŸÑŸÇ ŸÖŸÜ ÿßŸÑŸäŸÖŸÜ
Link: https://www.ajnet.me/news/2025/5/22/%d8%b9%d8%a7%d8%ac%d9%84-%d8%a7%d9%84%d8%ac%d8%a8%d9%87%d8%a9-%d8%a7%d9%84%d8%af%d8%a7%d8%ae%d9%84%d9%8a%d8%a9-%d8%a7%d9%84%d8%a5%d8%b3%d8%b1%d8%a7%d8%a6%d9%8a%d9%84%d9%8a%d8%a9-8?traffic_source=rss

Title: ŸÜÿ™ŸÜŸäÿßŸáŸà ŸäŸÜŸÅŸä ÿµÿ≠ÿ© ŸÖÿß Ÿäÿ™ÿ±ÿØÿØ ÿπŸÜ ÿÆŸÑÿßŸÅ ŸÖÿπ ÿ™ÿ±ÿßŸÖÿ®
Link: https://www.ajnet.me/news/2025/5/22/%d9%86%d8%aa%d9%86%d9%8a%d8%a7%d9%87%d9%88-%d9%8a%d9%86%d9%81%d9%8a-%d8%b5%d8%ad%d8%a9-%d9%85%d8%a7-%d9%8a%d8%aa%d8%b1%d8%af%d8%af-%d8%b9%d9%86-%d8%ae%d9%84%d8%a7%d9%81-%d9%85%d8%

## üó∫Ô∏è Step 4: Sitemap Analysis and Full Report

Let's analyze the website's sitemap structure and generate a comprehensive final report.

In [None]:
# Analyze website sitemap
print("üó∫Ô∏è Analyzing website sitemap structure...")

try:
    sitemap_analysis = scraper.analyze_sitemap()
    
    print(f"\nüìã Sitemap Analysis Results:")
    print(f"Sitemap URL: {sitemap_analysis.get('sitemap_url', 'N/A')}")
    print(f"Total URLs found: {sitemap_analysis.get('total_urls', 0)}")
    
    # Display sitemap structure
    structure = sitemap_analysis.get('structure', {})
    if structure:
        print("\nüèóÔ∏è Website Structure:")
        for section, subsections in structure.items():
            print(f"  üìÅ {section}:")
            for subsection in subsections[:5]:  # Show first 5
                print(f"    - {subsection}")
            if len(subsections) > 5:
                print(f"    ... and {len(subsections) - 5} more")
    
    # Display sample URLs
    sample_urls = sitemap_analysis.get('sample_urls', [])
    if sample_urls:
        print(f"\nüîó Sample URLs:")
        for url in sample_urls[:3]:
            print(f"  - {url}")
        if len(sample_urls) > 3:
            print(f"  ... and {len(sample_urls) - 3} more URLs")
            
except Exception as e:
    print(f"‚ùå Error analyzing sitemap: {e}")
    print("This might be due to:")
    print("  - No sitemap.xml file available")
    print("  - Network connectivity issues") 
    print("  - Invalid sitemap format")

In [None]:
# Run comprehensive full analysis
print("üîÑ Running complete website analysis...")
print("This will combine all previous analyses into a comprehensive report")

try:
    # Run full analysis with all components
    full_results = scraper.run_full_analysis(
        max_articles=max_articles_demo,
        include_rss=True,
        include_sitemap=True,
        create_visualizations=True
    )
    
    print("\nüéâ Full analysis completed successfully!")
    
    # Display key metrics
    print(f"\nüìä Key Metrics:")
    print(f"  - Crawlability Score: {full_results.get('crawlability_score', 'N/A')}/100")
    print(f"  - Articles Extracted: {len(full_results.get('articles', []))}")
    print(f"  - RSS Entries: {len(full_results.get('rss_entries', []))}")
    
    sitemap_info = full_results.get('sitemap_analysis', {})
    if sitemap_info:
        print(f"  - Sitemap URLs: {sitemap_info.get('total_urls', 'N/A')}")
    
    # Show export information
    if 'visualization_path' in full_results:
        print(f"  - Visualization: {full_results['visualization_path']}")
        
    print(f"\nüíæ All results have been exported to the output directory")
    
except Exception as e:
    print(f"‚ùå Error during full analysis: {e}")
    print("This might be due to configuration issues or website accessibility problems")

Constructed sitemap dict:
{'News': ['Liveblog', '2025'], 'Politics': ['2025'], 'Videos': ['2025'], 'Opinions': ['2025'], 'Sport': ['2025'], 'Programs': ['Behindthenews'], 'Culture': ['2025'], 'Misc': ['2025'], 'Encyclopedia': ['2025'], 'Climate': ['2025'], 'Ebusiness': ['2025'], 'Science': ['2025'], 'Health': ['2025'], 'Tech': ['2025'], 'Lifestyle': ['2025'], 'Family': ['2025']}


In [None]:
# Get personalized recommendations
print("üí° Getting personalized scraping recommendations...")

try:
    recommendations = scraper.get_recommendations()
    
    print(f"\nüìã Recommendations for scraping {demo_config['base_url']}:")
    for i, rec in enumerate(recommendations, 1):
        print(f"  {i}. {rec}")
        
    print(f"\nüîß Configuration Tips:")
    print(f"  - Update CSS selectors in config.json to match your target website")
    print(f"  - Adjust crawl_delay based on robots.txt recommendations")
    print(f"  - Test with a small number of articles first")
    print(f"  - Monitor for website structure changes")
    print(f"  - Always respect robots.txt and terms of service")
    
except Exception as e:
    print(f"‚ùå Error getting recommendations: {e}")

In [None]:
# üéØ How to Use This Toolkit for Your Own Website

print("üéØ To use this toolkit for your own website:")
print()
print("1Ô∏è‚É£ Update Configuration:")
print("   - Copy config.example.json to config.json")
print("   - Change 'base_url' to your target website")
print("   - Update CSS selectors to match the website's structure")
print("   - Add RSS feed URLs if available")
print()
print("2Ô∏è‚É£ Test and Iterate:")
print("   - Start with a small number of articles (max_articles: 10)")
print("   - Test different CSS selectors")
print("   - Check robots.txt compliance")
print()
print("3Ô∏è‚É£ Scale Up:")
print("   - Increase max_articles gradually")
print("   - Implement proper error handling")
print("   - Add rate limiting and delays")
print()
print("4Ô∏è‚É£ Best Practices:")
print("   - Always respect robots.txt")
print("   - Use appropriate crawl delays")
print("   - Monitor website terms of service")
print("   - Test regularly for structure changes")
print()
print("üîó GitHub Repository: Upload your configured version to GitHub!")
print("üìö Documentation: See README.md for detailed instructions")

# Cleanup temporary files
import os
if os.path.exists(temp_config_path):
    os.remove(temp_config_path)
    print(f"\nüßπ Cleaned up temporary config file")

Crawlability Score: 26


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
## üöÄ Ready for GitHub!

This generic web scraping toolkit is now ready for GitHub upload. Here's what you have:

### ‚úÖ **What's Included:**
- **Modular Python code** in `src/` directory
- **Configuration system** with `config.example.json`
- **Requirements file** with all dependencies
- **Comprehensive README** with installation and usage instructions
- **Generic notebook** that works with any website

### üîß **Key Improvements Made:**
- ‚úÖ Removed hardcoded URLs (now configurable)
- ‚úÖ Added proper error handling and logging
- ‚úÖ Modular architecture for easy maintenance
- ‚úÖ Type hints and documentation
- ‚úÖ Configurable CSS selectors for any website
- ‚úÖ Rate limiting and robots.txt compliance
- ‚úÖ Multiple export formats (CSV, JSON)
- ‚úÖ Data visualization capabilities

### üìÅ **Repository Structure:**
```
your-repo/
‚îú‚îÄ‚îÄ src/                          # Modular Python code
‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îú‚îÄ‚îÄ web_scraper.py           # Main scraper class
‚îÇ   ‚îú‚îÄ‚îÄ config.py                # Configuration management
‚îÇ   ‚îú‚îÄ‚îÄ crawlability_analyzer.py # Robots.txt analysis
‚îÇ   ‚îú‚îÄ‚îÄ content_extractor.py     # Article extraction
‚îÇ   ‚îú‚îÄ‚îÄ js_handler.py           # JavaScript & RSS handling
‚îÇ   ‚îî‚îÄ‚îÄ report_generator.py     # Data export & visualization
‚îú‚îÄ‚îÄ config.example.json         # Example configuration
‚îú‚îÄ‚îÄ requirements.txt            # Python dependencies
‚îú‚îÄ‚îÄ README.md                   # Comprehensive documentation
‚îú‚îÄ‚îÄ IR_Project.ipynb           # This demo notebook
‚îî‚îÄ‚îÄ output/                    # Generated reports and data
```

### üåü **Next Steps:**
1. Test with your target website
2. Customize CSS selectors in `config.json`
3. Add your own features and improvements
4. Share with the community!

**Happy scraping! üï∑Ô∏è‚ú®**