# Web Crawling for Policy Analysis
## GCAP3226: Empowering Citizens through Data

**Learning Objectives:**
1. Understand ethical web crawling practices
2. Analyze robots.txt and sitemaps
3. Identify pages with quantitative data
4. Extract and organize data from government websites
5. Apply these techniques to policy research

**Case Study:** Cyberdefender.hk - Hong Kong Government Cybersecurity Portal

## Part 1: Understanding Web Crawling Ethics

### What is robots.txt?
The `robots.txt` file tells web crawlers which parts of a website they can access. It's a fundamental part of web crawling ethics.

**Key Rules:**
- Always check robots.txt before crawling
- Respect the directives (Disallow, Allow)
- Implement rate limiting to avoid overloading servers
- Identify your crawler with a descriptive User-Agent

Let's check cyberdefender.hk's robots.txt:

In [1]:
import requests
import time
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import re
import pandas as pd
from urllib.parse import urljoin, urlparse

# Target website
BASE_URL = "https://cyberdefender.hk"

# Prepare a session with a polite User-Agent
session = requests.Session()
session.headers.update({
    "User-Agent": "GCAP3226-Student-Crawler/1.0 (Educational purpose)"
})

# Fetch and analyze robots.txt
robots_url = f"{BASE_URL}/robots.txt"
print("robots.txt content:")
print("=" * 70)
try:
    response = session.get(robots_url, timeout=10)
    response.raise_for_status()
    text = response.text
    print(text)
    print("=" * 70)

    # Parse robots.txt (simple parser focused on essentials)
    rules = {
        "user_agents": {},
        "sitemaps": []
    }
    current_ua = None

    for raw in text.splitlines():
        line = raw.strip()
        if not line or line.startswith('#'):
            continue
        key, _, value = line.partition(':')
        key = key.strip().lower()
        value = value.strip()

        if key == 'user-agent':
            current_ua = value
            rules["user_agents"].setdefault(current_ua, {"allow": [], "disallow": [], "crawl-delay": None})
        elif key in ("allow", "disallow"):
            if current_ua is None:
                continue
            rules["user_agents"].setdefault(current_ua, {"allow": [], "disallow": [], "crawl-delay": None})
            rules["user_agents"][current_ua][key].append(value)
        elif key == "crawl-delay":
            if current_ua is None:
                continue
            try:
                delay_val = float(value)
            except ValueError:
                delay_val = value  # keep as-is if not numeric
            rules["user_agents"][current_ua]["crawl-delay"] = delay_val
        elif key == 'sitemap':
            rules["sitemaps"].append(value)

    # Summarize for generic user agent '*'
    ua_all = rules["user_agents"].get('*')
    if ua_all:
        disallows = [p for p in ua_all['disallow'] if p is not None and p.strip() != '']
        allows = [p for p in ua_all['allow'] if p is not None and p.strip() != '']
        print("\nSummary for User-agent: *")
        print(f"  Disallow rules: {len(disallows)}")
        print(f"  Allow rules: {len(allows)}")
        print(f"  Crawl-delay: {ua_all['crawl-delay'] if ua_all['crawl-delay'] is not None else 'not specified'}")
    else:
        print("\nNo explicit section for User-agent: * found in robots.txt")

    # List sitemaps if present
    if rules["sitemaps"]:
        print("\nSitemaps found:")
        for sm in rules["sitemaps"][:10]:
            print(f"  - {sm}")
        if len(rules["sitemaps"]) > 10:
            print(f"  ... (+{len(rules['sitemaps']) - 10} more)")

    # Analysis message (non-committal, based on Disallow count)
    if ua_all and len([p for p in ua_all['disallow'] if p.strip() != '']) == 0:
        print("\n✓ Analysis: No Disallow directives for User-agent: *. Crawling appears permitted, subject to site terms and ethical rate limiting.")
    elif ua_all:
        print(f"\n⚠ Analysis: Found {len([p for p in ua_all['disallow'] if p.strip() != ''])} Disallow rule(s) for User-agent: *. Respect these paths.")
    else:
        print("\nℹ Analysis: robots.txt does not specify rules for User-agent: *. Proceed cautiously and respect site terms.")

    # Recommend a conservative delay
    recommended_delay = 1.0
    if ua_all and isinstance(ua_all.get('crawl-delay'), (int, float)):
        recommended_delay = max(1.0, float(ua_all['crawl-delay']))
    print(f"Recommended delay between requests: {recommended_delay} seconds")

except requests.RequestException as e:
    print("Error fetching robots.txt:", e)
    print("=" * 70)
    print("ℹ Tip: Check your internet connection or try again later.")

robots.txt content:
# START YOAST BLOCK
# ---------------------------
User-agent: *
Disallow:

Sitemap: https://cyberdefender.hk/sitemap_index.xml
# ---------------------------
# END YOAST BLOCK

Summary for User-agent: *
  Disallow rules: 0
  Allow rules: 0
  Crawl-delay: not specified

Sitemaps found:
  - https://cyberdefender.hk/sitemap_index.xml

✓ Analysis: No Disallow directives for User-agent: *. Crawling appears permitted, subject to site terms and ethical rate limiting.
Recommended delay between requests: 1.0 seconds


### Exercise 1.1: Check robots.txt for other government websites

Try checking robots.txt for these Hong Kong government sites:
- https://www.info.gov.hk
- https://www.censtatd.gov.hk (Census and Statistics Department)
- https://www.epd.gov.hk (Environmental Protection Department)

**Question:** Do they all allow crawling? Are there any restrictions?

In [3]:
# Exercise 1.1 solution: fetch and summarize robots.txt for multiple HK government sites
import requests
from typing import Dict, Any

sites = [
    "https://www.info.gov.hk",
    "https://www.censtatd.gov.hk",
    "https://www.epd.gov.hk",
]

session = requests.Session()
session.headers.update({
    "User-Agent": "GCAP3226-Student-Crawler/1.0 (Educational purpose)"
})


def summarize_robots(base_url: str) -> Dict[str, Any]:
    url = base_url.rstrip('/') + "/robots.txt"
    out: Dict[str, Any] = {
        "site": base_url,
        "status": "ok",
        "ua_star": None,
        "disallow_count": None,
        "allow_count": None,
        "crawl_delay": None,
        "sitemaps": [],
        "note": "",
    }
    try:
        r = session.get(url, timeout=12)
        r.raise_for_status()
        text = r.text
        rules = {"user_agents": {}, "sitemaps": []}
        current_ua = None
        for raw in text.splitlines():
            line = raw.strip()
            if not line or line.startswith('#'):
                continue
            key, _, value = line.partition(':')
            key = key.strip().lower()
            value = value.strip()
            if key == 'user-agent':
                current_ua = value
                rules["user_agents"].setdefault(current_ua, {"allow": [], "disallow": [], "crawl-delay": None})
            elif key in ("allow", "disallow"):
                if current_ua is None:
                    continue
                rules["user_agents"].setdefault(current_ua, {"allow": [], "disallow": [], "crawl-delay": None})
                rules["user_agents"][current_ua][key].append(value)
            elif key == 'crawl-delay':
                if current_ua is None:
                    continue
                try:
                    delay_val = float(value)
                except ValueError:
                    delay_val = value
                rules["user_agents"][current_ua]["crawl-delay"] = delay_val
            elif key == 'sitemap':
                rules["sitemaps"].append(value)
        ua_all = rules["user_agents"].get('*')
        out["ua_star"] = bool(ua_all)
        if ua_all:
            out["disallow_count"] = len([p for p in ua_all['disallow'] if p and p.strip()])
            out["allow_count"] = len([p for p in ua_all['allow'] if p and p.strip()])
            out["crawl_delay"] = ua_all.get('crawl-delay')
        out["sitemaps"] = rules["sitemaps"]
    except requests.RequestException as e:
        out["status"] = "error"
        out["note"] = str(e)
    return out


results = [summarize_robots(s) for s in sites]

print("Robots.txt summary for HK government sites:\n")
for res in results:
    print(f"Site: {res['site']}")
    if res["status"] != "ok":
        print(f"  ✗ Error: {res['note']}")
        print()
        continue
    if res["ua_star"]:
        print("  User-agent '*': present")
        print(f"  Disallow rules: {res['disallow_count']}")
        print(f"  Allow rules: {res['allow_count']}")
        print(f"  Crawl-delay: {res['crawl_delay'] if res['crawl_delay'] is not None else 'not specified'}")
        if (res['disallow_count'] or 0) == 0:
            print("  ✓ Crawling generally permitted for '*', subject to terms.")
        else:
            print("  ⚠ Restrictions present — respect Disallow paths.")
    else:
        print("  ℹ No explicit rules for User-agent '*'. Proceed cautiously.")
    if res["sitemaps"]:
        print("  Sitemaps:")
        for sm in res["sitemaps"][:5]:
            print(f"    - {sm}")
        if len(res["sitemaps"]) > 5:
            print(f"    ... (+{len(res['sitemaps'])-5} more)")
    print()

Robots.txt summary for HK government sites:

Site: https://www.info.gov.hk
  User-agent '*': present
  Disallow rules: 1
  Allow rules: 0
  Crawl-delay: not specified
  ⚠ Restrictions present — respect Disallow paths.

Site: https://www.censtatd.gov.hk
  User-agent '*': present
  Disallow rules: 1
  Allow rules: 0
  Crawl-delay: not specified
  ⚠ Restrictions present — respect Disallow paths.

Site: https://www.epd.gov.hk
  ✗ Error: 404 Client Error: Not Found for url: https://www.epd.gov.hk/robots.txt



## Part 2: Discovering Content with Sitemaps

Sitemaps are XML files that list all pages on a website. They make crawling much more efficient!

### Sitemap Index
Large websites often have a `sitemap_index.xml` that points to multiple sitemaps.

In [4]:
# Fetch sitemap index
sitemap_index_url = f"{BASE_URL}/sitemap_index.xml"
response = requests.get(sitemap_index_url)

# Parse XML
root = ET.fromstring(response.content)
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

# Extract sitemap URLs
sitemaps = []
for sitemap in root.findall('.//ns:sitemap', namespace):
    loc = sitemap.find('ns:loc', namespace)
    lastmod = sitemap.find('ns:lastmod', namespace)
    if loc is not None:
        sitemaps.append({
            'url': loc.text,
            'last_modified': lastmod.text if lastmod is not None else 'N/A',
            'name': loc.text.split('/')[-1]
        })

# Display as DataFrame
df_sitemaps = pd.DataFrame(sitemaps)
print(f"Found {len(sitemaps)} sitemaps:\n")
print(df_sitemaps.to_string(index=False))

Found 6 sitemaps:

                                               url             last_modified                      name
         https://cyberdefender.hk/post-sitemap.xml 2025-04-07T15:28:23+00:00          post-sitemap.xml
         https://cyberdefender.hk/page-sitemap.xml 2025-10-14T03:20:38+00:00          page-sitemap.xml
https://cyberdefender.hk/gm_menu_block-sitemap.xml 2021-06-21T09:48:00+00:00 gm_menu_block-sitemap.xml
          https://cyberdefender.hk/r3d-sitemap.xml 2022-06-27T05:26:00+00:00           r3d-sitemap.xml
https://cyberdefender.hk/sdm_downloads-sitemap.xml 2025-10-09T02:33:07+00:00 sdm_downloads-sitemap.xml
     https://cyberdefender.hk/category-sitemap.xml 2025-10-14T03:20:38+00:00      category-sitemap.xml


### Key Observation
Notice the `sdm_downloads-sitemap.xml` - this contains downloadable files, which often include quantitative data like reports, statistics, and datasets!

In [5]:
# Function to fetch and parse a sitemap
def fetch_sitemap(sitemap_url):
    """Fetch and parse a sitemap XML file"""
    response = requests.get(sitemap_url)
    root = ET.fromstring(response.content)
    namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
    
    urls = []
    for url_elem in root.findall('.//ns:url', namespace):
        loc = url_elem.find('ns:loc', namespace)
        if loc is not None:
            urls.append(loc.text)
    
    return urls

# Fetch all URLs from all sitemaps
all_urls = {}
for sitemap in sitemaps:
    time.sleep(1)  # Rate limiting
    urls = fetch_sitemap(sitemap['url'])
    all_urls[sitemap['name']] = urls
    print(f"{sitemap['name']}: {len(urls)} URLs")

total_urls = sum(len(urls) for urls in all_urls.values())
print(f"\nTotal URLs to crawl: {total_urls}")

post-sitemap.xml: 6 URLs
page-sitemap.xml: 102 URLs
page-sitemap.xml: 102 URLs
gm_menu_block-sitemap.xml: 1 URLs
gm_menu_block-sitemap.xml: 1 URLs
r3d-sitemap.xml: 4 URLs
r3d-sitemap.xml: 4 URLs
sdm_downloads-sitemap.xml: 21 URLs
sdm_downloads-sitemap.xml: 21 URLs
category-sitemap.xml: 15 URLs

Total URLs to crawl: 149
category-sitemap.xml: 15 URLs

Total URLs to crawl: 149


## Part 3: Detecting Quantitative Data

Not all pages have useful quantitative data. We need to be selective!

### Detection Criteria:
1. **Numbers:** Percentages, currency, statistics
2. **Tables:** Structured data
3. **Keywords:** "statistics", "data", "analysis", "report", "survey"
4. **Charts/Graphs:** Visual data representations
5. **Downloadable Files:** PDFs, Excel files, CSV files

In [7]:
def detect_quantitative_data(html_content, url):
    """
    Analyze HTML content to detect quantitative data
    Returns: (has_data, metadata)
    """
    soup = BeautifulSoup(html_content, 'html.parser')
    text = soup.get_text()
    
    metadata = {
        'url': url,
        'has_numbers': False,
        'has_tables': False,
        'has_charts': False,
        'number_count': 0,
        'table_count': 0,
        'keywords': []
    }
    
    # 1. Count meaningful numbers
    number_patterns = [
        r'\d+(?:\.\d+)?%',  # Percentages
        r'\$\d+(?:,\d{3})*',  # Currency
        r'HK\$\d+(?:,\d{3})*',  # HK currency
        r'\d{1,3}(?:,\d{3})+',  # Numbers with commas
    ]
    
    for pattern in number_patterns:
        matches = re.findall(pattern, text)
        metadata['number_count'] += len(matches)
    
    if metadata['number_count'] > 5:
        metadata['has_numbers'] = True
    
    # 2. Check for tables
    tables = soup.find_all('table')
    metadata['table_count'] = len(tables)
    metadata['has_tables'] = len(tables) > 0
    
    # 3. Check for keywords
    keywords = ['statistics', 'data', 'analysis', 'report', 'survey',
                'percentage', 'total', 'average', 'rate']
    
    for keyword in keywords:
        if keyword.lower() in text.lower():
            metadata['keywords'].append(keyword)
    
    # 4. Check for charts
    chart_indicators = ['chart', 'graph', 'diagram']
    for indicator in chart_indicators:
        if indicator.lower() in text.lower():
            metadata['has_charts'] = True
            break
    
    # Determine if page has quantitative data
    has_data = (
        metadata['has_numbers'] or 
        metadata['has_tables'] or
        (metadata['has_charts'] and len(metadata['keywords']) > 0)
    )
    
    return has_data, metadata

print("✓ Quantitative data detection function defined")

✓ Quantitative data detection function defined


### Example: Test the Detection Function

In [8]:
# Test on a sample page
test_url = all_urls['post-sitemap.xml'][0] if 'post-sitemap.xml' in all_urls else BASE_URL

print(f"Testing URL: {test_url}\n")
response = requests.get(test_url)
has_data, metadata = detect_quantitative_data(response.content, test_url)

print(f"Has quantitative data: {has_data}")
print(f"\nDetection details:")
print(f"  Numbers found: {metadata['number_count']}")
print(f"  Tables found: {metadata['table_count']}")
print(f"  Has charts: {metadata['has_charts']}")
print(f"  Keywords: {', '.join(metadata['keywords'][:5])}")

Testing URL: https://cyberdefender.hk/%e3%80%8c%e8%b5%b7%e5%ba%95%e3%80%8d%e8%a1%8c%e7%82%ba%e5%88%91%e4%ba%8b%e5%8c%96%e2%9d%97%ef%b8%8f/

Has quantitative data: True

Detection details:
  Numbers found: 0
  Tables found: 1
  Has charts: False
  Keywords: data, rate
Has quantitative data: True

Detection details:
  Numbers found: 0
  Tables found: 1
  Has charts: False
  Keywords: data, rate


## Part 4: Implementing Ethical Crawling

### Best Practices:
1. **Rate Limiting:** Wait between requests (1-2 seconds)
2. **User-Agent:** Identify your crawler
3. **Error Handling:** Handle 404s, timeouts gracefully
4. **Logging:** Track what you're doing
5. **Respect robots.txt:** Always follow the rules

In [9]:
def crawl_with_ethics(urls, max_pages=10):
    """
    Ethically crawl a list of URLs
    """
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'GCAP3226-Student-Crawler/1.0 (Educational Purpose)'
    })
    
    results = []
    
    for i, url in enumerate(urls[:max_pages], 1):
        print(f"[{i}/{min(len(urls), max_pages)}] Crawling: {url}")
        
        try:
            response = session.get(url, timeout=10)
            response.raise_for_status()
            
            has_data, metadata = detect_quantitative_data(response.content, url)
            results.append({
                'url': url,
                'has_data': has_data,
                'numbers': metadata['number_count'],
                'tables': metadata['table_count'],
                'keywords': ', '.join(metadata['keywords'][:3])
            })
            
            if has_data:
                print(f"  ✓ Found quantitative data!")
            
        except Exception as e:
            print(f"  ✗ Error: {str(e)}")
            results.append({
                'url': url,
                'has_data': False,
                'numbers': 0,
                'tables': 0,
                'keywords': f'Error: {str(e)}'
            })
        
        # Rate limiting - wait 2 seconds between requests
        time.sleep(2)
    
    return pd.DataFrame(results)

print("✓ Ethical crawler function defined")

✓ Ethical crawler function defined


### Exercise 4.1: Crawl Sample Pages

In [10]:
# Crawl first 5 pages from the posts sitemap
sample_urls = all_urls.get('post-sitemap.xml', [])[:5]

print(f"Crawling {len(sample_urls)} sample pages...\n")
results_df = crawl_with_ethics(sample_urls, max_pages=5)

print("\n" + "="*80)
print("RESULTS")
print("="*80)
print(results_df.to_string(index=False))

# Summary statistics
print(f"\nPages with data: {results_df['has_data'].sum()} / {len(results_df)}")
print(f"Success rate: {results_df['has_data'].sum() / len(results_df) * 100:.1f}%")

Crawling 5 sample pages...

[1/5] Crawling: https://cyberdefender.hk/%e3%80%8c%e8%b5%b7%e5%ba%95%e3%80%8d%e8%a1%8c%e7%82%ba%e5%88%91%e4%ba%8b%e5%8c%96%e2%9d%97%ef%b8%8f/
  ✓ Found quantitative data!
  ✓ Found quantitative data!
[2/5] Crawling: https://cyberdefender.hk/whatsapp-scam/
[2/5] Crawling: https://cyberdefender.hk/whatsapp-scam/
  ✓ Found quantitative data!
  ✓ Found quantitative data!
[3/5] Crawling: https://cyberdefender.hk/sms-scam/
[3/5] Crawling: https://cyberdefender.hk/sms-scam/
  ✓ Found quantitative data!
  ✓ Found quantitative data!
[4/5] Crawling: https://cyberdefender.hk/2024-dragon-year-fortune/
[4/5] Crawling: https://cyberdefender.hk/2024-dragon-year-fortune/
  ✗ Error: 404 Client Error: Not Found for url: https://cyberdefender.hk/2024-dragon-year-fortune/
  ✗ Error: 404 Client Error: Not Found for url: https://cyberdefender.hk/2024-dragon-year-fortune/
[5/5] Crawling: https://cyberdefender.hk/carousell-scam/
[5/5] Crawling: https://cyberdefender.hk/carousell-sc

## Part 5: Comparing with Google Site-Specific Search

Let's compare our crawler with Google's site-specific search.

### Google Site Search Syntax:
```
site:cyberdefender.hk "statistics" OR "data" OR "report"
```

### Advantages of Crawler:
1. **Systematic:** Covers all pages via sitemap
2. **Programmable:** Automated data extraction
3. **Customizable:** Your own detection criteria
4. **Downloadable:** Save pages and files locally
5. **Reproducible:** Can rerun with same parameters

### Advantages of Google Search:
1. **Fast:** Instant results
2. **Smart:** Better keyword matching
3. **No coding:** User-friendly interface
4. **No restrictions:** Google handles rate limiting

**Conclusion:** Use both! Google for quick discovery, crawlers for systematic data collection.

## Part 6: Your Assignment - Crawl a Government Website

Choose one of these Hong Kong government websites and apply what you've learned:

1. **Census and Statistics Department:** https://www.censtatd.gov.hk
   - Rich in quantitative data
   - Population, economy, social indicators

2. **Environmental Protection Department:** https://www.epd.gov.hk
   - Air quality data
   - Waste statistics
   - Environmental reports

3. **Transport Department:** https://www.td.gov.hk
   - Traffic statistics
   - Public transport data
   - Road safety reports

4. **Food and Health Bureau:** https://www.fhb.gov.hk
   - Healthcare statistics
   - Disease surveillance data
   - Hospital data

### Assignment Tasks:

1. **Check robots.txt** - Are you allowed to crawl?
2. **Find sitemap** - Does the site have a sitemap?
3. **Identify target pages** - What pages likely have quantitative data?
4. **Crawl ethically** - Implement rate limiting and proper User-Agent
5. **Extract data** - Download pages and files with quantitative data
6. **Analyze results** - What did you find? What data is useful for policy analysis?
7. **Document** - Write a brief report on your findings

### Deliverables:
1. Python code (use the functions from this notebook)
2. Data files collected
3. Summary report (500-1000 words)
4. Reflection on ethical considerations

In [13]:
# Assignment solution: FHB (Food and Health Bureau) site crawl
# Goal: Follow the 7 tasks and produce artifacts under ./fhb_data
import os
import time
import json
import re
import requests
import xml.etree.ElementTree as ET
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlparse

BASE = "https://www.fhb.gov.hk"
DATA_DIR = "fhb_data"
FILES_DIR = os.path.join(DATA_DIR, "files")
REPORT_PATH = os.path.join(DATA_DIR, "report.md")

os.makedirs(FILES_DIR, exist_ok=True)

session = requests.Session()
session.headers.update({
    "User-Agent": "GCAP3226-Student-Crawler/1.0 (Educational purpose)"
})

# 1) Check robots.txt
robots_url = f"{BASE}/robots.txt"
robots_text = ""
robots_summary = {}
try:
    r = session.get(robots_url, timeout=12)
    if r.status_code == 200:
        robots_text = r.text
    else:
        robots_text = f"<HTTP {r.status_code}>"
except Exception as e:
    robots_text = f"<error: {e}>"

# Very simple robots parser
rules = {"user_agents": {}, "sitemaps": []}
current_ua = None
for raw in robots_text.splitlines():
    line = raw.strip()
    if not line or line.startswith('#'):
        continue
    key, _, value = line.partition(':')
    key = key.strip().lower()
    value = value.strip()
    if key == 'user-agent':
        current_ua = value
        rules["user_agents"].setdefault(current_ua, {"allow": [], "disallow": [], "crawl-delay": None})
    elif key in ("allow", "disallow"):
        if current_ua is None:
            continue
        rules["user_agents"].setdefault(current_ua, {"allow": [], "disallow": [], "crawl-delay": None})
        rules["user_agents"][current_ua][key].append(value)
    elif key == 'crawl-delay':
        if current_ua is None:
            continue
        try:
            delay_val = float(value)
        except ValueError:
            delay_val = value
        rules["user_agents"][current_ua]["crawl-delay"] = delay_val
    elif key == 'sitemap':
        rules["sitemaps"].append(value)

ua_all = rules["user_agents"].get('*')
recommended_delay = 1.0
if ua_all and isinstance(ua_all.get('crawl-delay'), (int, float)):
    recommended_delay = max(1.0, float(ua_all['crawl-delay']))

robots_summary = {
    "robots_url": robots_url,
    "ua_star_present": bool(ua_all),
    "disallow_count": len([p for p in (ua_all['disallow'] if ua_all else []) if p and p.strip()]),
    "allow_count": len([p for p in (ua_all['allow'] if ua_all else []) if p and p.strip()]),
    "crawl_delay": ua_all.get('crawl-delay') if ua_all else None,
    "sitemaps": rules["sitemaps"],
    "recommended_delay": recommended_delay,
}

print("Robots.txt summary for FHB:")
print(json.dumps(robots_summary, indent=2, ensure_ascii=False))

# 2) Find sitemap(s)
sitemap_candidates = [
    f"{BASE}/sitemap.xml",
    f"{BASE}/sitemap_index.xml",
]

found_sitemaps = []
for u in set(robots_summary["sitemaps"] + sitemap_candidates):
    try:
        resp = session.get(u, timeout=12)
        if resp.status_code == 200 and resp.text.strip().startswith("<?xml"):
            found_sitemaps.append(u)
    except Exception:
        pass

print("\nSitemap endpoints detected:")
for sm in found_sitemaps:
    print(" -", sm)

# Helper to parse XML sitemap and return URLs
NS = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

def parse_sitemap(url: str) -> list:
    try:
        r = session.get(url, timeout=15)
        r.raise_for_status()
        root = ET.fromstring(r.content)
        urls = []
        # urlset
        for url_elem in root.findall('.//ns:url', NS):
            loc = url_elem.find('ns:loc', NS)
            if loc is not None and loc.text:
                urls.append(loc.text)
        # nested sitemap index
        for sm in root.findall('.//ns:sitemap', NS):
            loc = sm.find('ns:loc', NS)
            if loc is not None and loc.text:
                urls.extend(parse_sitemap(loc.text))
        return list(dict.fromkeys(urls))
    except Exception:
        return []

# 3) Identify target pages likely to have quantitative data
keywords = [
    "statistics", "data", "report", "survey", "indicator", "table",
    "health", "disease", "hospital", "performance", "cases", "rate", "percentage"
]

all_page_urls = []
for sm in found_sitemaps[:3]:  # keep it bounded
    time.sleep(recommended_delay)
    all_page_urls.extend(parse_sitemap(sm))

all_page_urls = list(dict.fromkeys(all_page_urls))
print(f"\nTotal URLs from sitemaps (deduped): {len(all_page_urls)}")

# Filter by keyword in URL to prioritize likely quantitative content
candidate_urls = [u for u in all_page_urls if any(k in u.lower() for k in keywords)]
# Fallback: if no candidates, sample first N
if not candidate_urls:
    candidate_urls = all_page_urls[:30]

print(f"Candidate URLs to crawl: {len(candidate_urls)} (showing up to 10)")
for u in candidate_urls[:10]:
    print(" -", u)

# 4) Crawl ethically + 5) Extract data/files
number_patterns = [
    r"\d+(?:\.\d+)?%",           # percentages
    r"\$\d+(?:,\d{3})*",         # currency
    r"HK\$\d+(?:,\d{3})*",       # HK$
    r"\b\d{1,3}(?:,\d{3})+\b",  # numbers with commas
]

file_exts = (".pdf", ".xlsx", ".xls", ".csv")

def detect_quantitative_data(html: bytes, url: str):
    soup = BeautifulSoup(html, 'html.parser')
    text = soup.get_text(" ")
    meta = {
        'url': url,
        'number_count': 0,
        'table_count': len(soup.find_all('table')),
        'keywords': [],
        'has_charts': any(w in text.lower() for w in ["chart", "graph", "diagram"]),
        'files': [],
    }
    for pattern in number_patterns:
        meta['number_count'] += len(re.findall(pattern, text))
    for kw in keywords:
        if kw in text.lower():
            meta['keywords'].append(kw)
    # Find downloadable files
    for a in soup.find_all('a', href=True):
        href = a['href']
        if any(href.lower().endswith(ext) for ext in file_exts):
            # Only keep absolute URLs on same domain
            if href.startswith('http'):
                if urlparse(href).netloc.endswith(urlparse(BASE).netloc.split('.')[-2] + ".gov.hk") or urlparse(href).netloc.endswith("gov.hk"):
                    meta['files'].append(href)
    has_data = meta['number_count'] > 5 or meta['table_count'] > 0 or (meta['has_charts'] and meta['keywords'])
    return has_data, meta

results = []
MAX_PAGES = 20
for i, u in enumerate(candidate_urls[:MAX_PAGES], 1):
    print(f"[{i}/{min(len(candidate_urls), MAX_PAGES)}] {u}")
    try:
        resp = session.get(u, timeout=15)
        resp.raise_for_status()
        has_data, meta = detect_quantitative_data(resp.content, u)
        results.append({
            'url': u,
            'has_data': has_data,
            'numbers': meta['number_count'],
            'tables': meta['table_count'],
            'keywords': ','.join(meta['keywords'][:5]),
            'file_count': len(meta['files'])
        })
        # Download files (limit to a few to be polite)
        for j, furl in enumerate(meta['files'][:2], 1):
            try:
                time.sleep(recommended_delay)
                fr = session.get(furl, timeout=20)
                fr.raise_for_status()
                fname = os.path.join(FILES_DIR, re.sub(r"[^A-Za-z0-9_.-]", "_", os.path.basename(urlparse(furl).path)))
                with open(fname, 'wb') as fh:
                    fh.write(fr.content)
                print(f"    ↳ downloaded: {fname}")
            except Exception as e:
                print(f"    ↳ file download error: {e}")
    except Exception as e:
        print("  ✗ Error:", e)
    time.sleep(recommended_delay)

results_df = pd.DataFrame(results)
print("\nCrawl summary:")
if not results_df.empty:
    print(results_df.head(10).to_string(index=False))
    print(f"Pages with data: {int(results_df['has_data'].sum())} / {len(results_df)}")
else:
    print("No pages processed.")

# 6) Analyze results: simple aggregates
analysis = {}
if not results_df.empty:
    analysis = {
        'pages_scanned': int(len(results_df)),
        'pages_with_data': int(results_df['has_data'].sum()),
        'avg_numbers': float(results_df['numbers'].mean()),
        'avg_tables': float(results_df['tables'].mean()),
        'top_keywords': (results_df['keywords']
                         .str.split(',')
                         .explode()
                         .str.strip()
                         .replace('', pd.NA)
                         .dropna()
                         .value_counts()
                         .head(10)
                         .to_dict()),
        'total_files_found': int(results_df['file_count'].sum()),
    }

print("\nAnalysis summary:")
print(json.dumps(analysis, indent=2, ensure_ascii=False))

# 7) Document: write a brief report
report_lines = []
report_lines.append("# FHB Crawl Report\n")
report_lines.append(f"Date: {pd.Timestamp.utcnow()} UTC\n")
report_lines.append("\n## Robots.txt\n")
report_lines.append(f"URL: {robots_url}\n")
if robots_text:
    report_lines.append("```\n" + robots_text + "\n```\n")
else:
    report_lines.append("(robots.txt not available)\n")
report_lines.append("\n### Summary\n")
report_lines.append(json.dumps(robots_summary, indent=2, ensure_ascii=False) + "\n")

report_lines.append("\n## Sitemaps\n")
if found_sitemaps:
    for sm in found_sitemaps:
        report_lines.append(f"- {sm}\n")
else:
    report_lines.append("No sitemap endpoints detected.\n")

report_lines.append("\n## Crawl Results (sample)\n")
if not results_df.empty:
    report_lines.append("```\n" + results_df.head(20).to_string(index=False) + "\n```\n")
else:
    report_lines.append("No pages processed.\n")

report_lines.append("\n## Analysis\n")
report_lines.append(json.dumps(analysis, indent=2, ensure_ascii=False) + "\n")

report_lines.append("\n## Reflection on Ethical Considerations\n")
report_lines.append("- Identified ourselves via a clear User-Agent.\n")
report_lines.append("- Applied at least 1s delay; would increase if server load or crawl-delay specified.\n")
report_lines.append("- Limited to 20 pages and at most 2 file downloads per page.\n")
report_lines.append("- Focused on publicly accessible content and avoided personal data.\n")
report_lines.append("- Preferred sitemaps and conservative filtering; would stop if errors or signs of rate limiting occur.\n")

with open(REPORT_PATH, 'w', encoding='utf-8') as f:
    f.write('\n'.join(report_lines))

print(f"\nReport written to: {REPORT_PATH}")

Robots.txt summary for FHB:
{
  "robots_url": "https://www.fhb.gov.hk/robots.txt",
  "ua_star_present": false,
  "disallow_count": 0,
  "allow_count": 0,
  "crawl_delay": null,
  "sitemaps": [],
  "recommended_delay": 1.0
}

Sitemap endpoints detected:

Total URLs from sitemaps (deduped): 0
Candidate URLs to crawl: 0 (showing up to 10)

Crawl summary:
No pages processed.

Analysis summary:
{}

Report written to: fhb_data/report.md


## Part 7: Ethical Considerations and Best Practices

### Legal and Ethical Framework:

1. **Copyright:** Respect intellectual property rights
2. **Privacy:** Don't collect personal data
3. **Terms of Service:** Read and follow website terms
4. **Server Load:** Don't overwhelm servers
5. **Purpose:** Use data for legitimate research/analysis

### Rate Limiting Guidelines:
- Small sites: 2-3 seconds between requests
- Medium sites: 1-2 seconds
- Large sites (e.g., government): 0.5-1 second
- **Never** faster than 0.5 seconds

### When NOT to Crawl:
- robots.txt explicitly disallows
- Site requires login
- Terms of service prohibit it
- Data is available via API
- Site is slow or unstable

### Alternative: APIs
Many government sites offer APIs for data access:
- **Hong Kong Open Data:** https://data.gov.hk
- Usually better than crawling!
- Structured data, legal access, official support

## Summary

**What You Learned:**
1. ✓ How to check robots.txt and respect crawling rules
2. ✓ How to discover content using sitemaps
3. ✓ How to detect quantitative data in web pages
4. ✓ How to implement ethical crawling practices
5. ✓ How to compare crawler vs. Google search approaches
6. ✓ How to apply these skills to government websites

**Next Steps:**
- Complete the assignment
- Explore Hong Kong Open Data portal
- Learn about web scraping libraries (Scrapy, Selenium)
- Study data analysis techniques for policy research

**Resources:**
- [Web Scraping Best Practices](https://www.scrapehero.com/web-scraping-best-practices/)
- [Hong Kong Open Data](https://data.gov.hk)
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/)
- [Robots.txt Specification](https://www.robotstxt.org/)