# 00 - Scrape UNCTAD Investment Dispute Settlement Database

This notebook scrapes case data directly from the UNCTAD website, including ITA Law links.
This is an alternative to using the UNCTAD Excel file, which may be outdated.

**Output:** `data/unctad_cases.csv` with one row per UNCTAD case.

In [None]:
import pandas as pd
from scraper.unctad_scraper import scrape_all_cases, scrape_single_case

## Test Single Case
Verify the scraper is working before running the full scrape.

In [None]:
# Test with a single case first
test_result = scrape_single_case(1)
test_result

## Full Scrape

Scrape all UNCTAD cases. With 1-second delays, this takes ~25-30 minutes for 1500 cases.

**Parameters:**
- `start_id`: First case ID (default: 1)
- `end_id`: Last case ID (default: 1500, adjust if more cases exist)
- `delay`: Seconds between requests (default: 1.0 - be polite!)
- `save_interval`: Save intermediate results every N cases (default: 100)

In [None]:
# Scrape all cases - adjust end_id if needed
# As of 2024, there are ~1440 cases, so 1500 gives headroom
df = scrape_all_cases(
    start_id=1,
    end_id=1500,
    delay=1.0,
    save_interval=100,
    output_path='data/unctad_cases.csv'
)

## Inspect Results

In [None]:
print(f"Total cases: {len(df)}")
print(f"Successfully fetched: {(df['fetch_status'] == 'success').sum()}")
print(f"Cases with ITA Law link: {df['italaw_link'].notna().sum()}")
print(f"Cases without ITA Law link: {df['italaw_link'].isna().sum()}")

In [None]:
# Sample of cases with ITA Law links
df[df['italaw_link'].notna()][['unctad_case_id', 'case_name', 'italaw_link', 'respondent_state']].head(10)

In [None]:
# Sample of cases without ITA Law links
df[df['italaw_link'].isna() & (df['fetch_status'] == 'success')][['unctad_case_id', 'case_name', 'respondent_state']].head(10)