# 1.8a: Thai Wikipedia Corpus

This notebook builds a test corpus of Thai Wikipedia articles to test whether cluster tokens appear in natural Thai text.

## The Question

We've discovered that:
- 71.4% of cluster tokens (1,579 / 2,212) are Thai script
- This represents 61.4% of ALL Thai tokens in the vocabulary
- Cluster is 47.9× enriched for Thai compared to full vocabulary

**But are these tokens actually USED in Thai text?**

Hypothesis: Maybe cluster tokens are legitimate but rare (technical terms, archaic language, etc.)

Test: Sample 100 random Thai Wikipedia articles and check if cluster tokens appear.

## Method

1. **Sampling strategy**: Use Wikipedia's random article API
   - Fetch 100 random Thai Wikipedia articles
   - Save page IDs for reproducibility
   - On subsequent runs, re-fetch same articles by ID

2. **Data storage**: Save each article as JSON with metadata
   - Page ID, title, URL
   - Fetch date
   - Raw text (cleaned)
   - Character/word counts

3. **Reproducibility**: First run is non-deterministic (random sampling), but page IDs are saved so subsequent runs fetch identical articles

## Why 100 articles?

- Large enough for statistical validity
- Covers diverse topics (science, culture, history, etc.)
- Expected token count: ~200k-500k tokens
- If cluster tokens don't appear in 100 diverse articles, they're genuinely absent (not just "rare")

## Parameters

In [1]:
# Sampling parameters
NUM_ARTICLES = 100
LANGUAGE = 'th'  # Thai Wikipedia

# File paths
PAGE_IDS_FILE = '../data/1.8a_thai_wiki_page_ids.txt'
OUTPUT_DIR = '../data/1.8a_thai_wiki_corpus/'

# Wikipedia API
WIKI_API_URL = 'https://th.wikipedia.org/w/api.php'
WIKI_BASE_URL = 'https://th.wikipedia.org/wiki/'

# User-Agent (Wikipedia requires this for API requests)
USER_AGENT = 'AzimuthTokenResearch/1.0 (https://github.com/yourname/azimuth; contact@example.com)'

# Rate limiting (be nice to Wikipedia)
REQUEST_DELAY = 0.1  # seconds between requests

## Imports

In [2]:
import requests
import json
import time
from pathlib import Path
from datetime import datetime
from tqdm import tqdm
import re

## Helper Functions

In [3]:
def fetch_random_page_ids(n, delay=0.1):
    """
    Fetch N random Thai Wikipedia page IDs using the random API.
    
    Returns list of page IDs (integers).
    """
    page_ids = []
    headers = {'User-Agent': USER_AGENT}
    
    print(f"Fetching {n} random Thai Wikipedia page IDs...")
    
    for i in tqdm(range(n), desc="Fetching page IDs"):
        params = {
            'action': 'query',
            'list': 'random',
            'rnnamespace': 0,  # Main namespace only (articles)
            'rnlimit': 1,
            'format': 'json'
        }
        
        response = requests.get(WIKI_API_URL, params=params, headers=headers)
        data = response.json()
        
        page_id = data['query']['random'][0]['id']
        page_ids.append(page_id)
        
        # Be nice to Wikipedia
        time.sleep(delay)
    
    return page_ids


def fetch_article_by_id(page_id, delay=0.1):
    """
    Fetch article content by page ID.
    
    Returns dict with:
    - page_id
    - title
    - url
    - text (cleaned)
    - char_count
    - fetch_date
    """
    headers = {'User-Agent': USER_AGENT}
    
    params = {
        'action': 'query',
        'pageids': page_id,
        'prop': 'extracts|info',
        'exintro': False,  # Get full article, not just intro
        'explaintext': True,  # Plain text, no HTML
        'inprop': 'url',
        'format': 'json'
    }
    
    response = requests.get(WIKI_API_URL, params=params, headers=headers)
    data = response.json()
    
    page_data = data['query']['pages'][str(page_id)]
    
    # Extract text and clean
    text = page_data.get('extract', '')
    
    # Remove excessive whitespace
    text = re.sub(r'\n\n+', '\n\n', text)
    text = text.strip()
    
    article = {
        'page_id': page_id,
        'title': page_data.get('title', ''),
        'url': page_data.get('fullurl', ''),
        'text': text,
        'char_count': len(text),
        'fetch_date': datetime.now().isoformat()
    }
    
    time.sleep(delay)
    
    return article


def save_article(article, output_dir):
    """
    Save article as JSON file.
    
    Filename: {page_id}.json
    """
    output_path = Path(output_dir) / f"{article['page_id']}.json"
    
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(article, f, ensure_ascii=False, indent=2)


print("✓ Helper functions defined")

✓ Helper functions defined


## Setup Output Directory

In [4]:
# Create output directories
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
Path(PAGE_IDS_FILE).parent.mkdir(parents=True, exist_ok=True)

print(f"✓ Output directory: {OUTPUT_DIR}")
print(f"✓ Page IDs file: {PAGE_IDS_FILE}")

✓ Output directory: ../data/1.8a_thai_wiki_corpus/
✓ Page IDs file: ../data/1.8a_thai_wiki_page_ids.txt


## Get Page IDs (Reproducible)

In [5]:
print(f"\n{'='*70}")
print("SAMPLING STRATEGY")
print(f"{'='*70}\n")

if Path(PAGE_IDS_FILE).exists():
    print("✓ Found existing page IDs (REPRODUCIBLE MODE)")
    print(f"  Loading from: {PAGE_IDS_FILE}\n")
    
    with open(PAGE_IDS_FILE, 'r') as f:
        page_ids = [int(line.strip()) for line in f if line.strip()]
    
    print(f"✓ Loaded {len(page_ids)} page IDs")
    print(f"  Page ID range: [{min(page_ids)}, {max(page_ids)}]")
    
else:
    print("⚠ No existing page IDs found (FIRST RUN)")
    print(f"  Fetching {NUM_ARTICLES} random Thai Wikipedia articles...\n")
    
    # Fetch random page IDs
    page_ids = fetch_random_page_ids(NUM_ARTICLES, delay=REQUEST_DELAY)
    
    # Save for reproducibility
    with open(PAGE_IDS_FILE, 'w') as f:
        for page_id in page_ids:
            f.write(f"{page_id}\n")
    
    print(f"\n✓ Saved {len(page_ids)} page IDs to {PAGE_IDS_FILE}")
    print(f"  Page ID range: [{min(page_ids)}, {max(page_ids)}]")
    print(f"\n  Subsequent runs will use these same page IDs (reproducible)")


SAMPLING STRATEGY

⚠ No existing page IDs found (FIRST RUN)
  Fetching 100 random Thai Wikipedia articles...

Fetching 100 random Thai Wikipedia page IDs...


Fetching page IDs: 100%|██████████| 100/100 [00:33<00:00,  3.03it/s]


✓ Saved 100 page IDs to ../data/1.8a_thai_wiki_page_ids.txt
  Page ID range: [5449, 1489839]

  Subsequent runs will use these same page IDs (reproducible)





## Fetch Articles by ID

In [6]:
print(f"\n{'='*70}")
print("FETCHING ARTICLES")
print(f"{'='*70}\n")

print(f"Fetching {len(page_ids)} articles from Thai Wikipedia...\n")

articles = []
failed = []

for page_id in tqdm(page_ids, desc="Fetching articles"):
    try:
        article = fetch_article_by_id(page_id, delay=REQUEST_DELAY)
        save_article(article, OUTPUT_DIR)
        articles.append(article)
    except Exception as e:
        print(f"\n⚠ Failed to fetch page {page_id}: {e}")
        failed.append(page_id)

print(f"\n✓ Successfully fetched {len(articles)} articles")

if failed:
    print(f"⚠ Failed to fetch {len(failed)} articles: {failed}")
else:
    print("✓ All articles fetched successfully")


FETCHING ARTICLES

Fetching 100 articles from Thai Wikipedia...



Fetching articles: 100%|██████████| 100/100 [00:53<00:00,  1.85it/s]


✓ Successfully fetched 100 articles
✓ All articles fetched successfully





## Corpus Statistics

In [7]:
print(f"\n{'='*70}")
print("CORPUS STATISTICS")
print(f"{'='*70}\n")

# Count statistics
total_chars = sum(a['char_count'] for a in articles)
avg_chars = total_chars / len(articles) if articles else 0
min_chars = min(a['char_count'] for a in articles) if articles else 0
max_chars = max(a['char_count'] for a in articles) if articles else 0

print(f"Corpus size: {len(articles)} articles")
print(f"\nCharacter counts:")
print(f"  Total: {total_chars:,} characters")
print(f"  Average: {avg_chars:,.0f} characters per article")
print(f"  Range: [{min_chars:,}, {max_chars:,}]")

# Show sample of titles
print(f"\nSample of article titles (first 10):")
for i, article in enumerate(articles[:10], 1):
    print(f"  {i:2d}. {article['title']}")

if len(articles) > 10:
    print(f"  ... ({len(articles) - 10} more)")

# Character count distribution
print(f"\nArticle length distribution:")
bins = [0, 1000, 5000, 10000, 50000, float('inf')]
labels = ['<1k', '1-5k', '5-10k', '10-50k', '>50k']

for i, (lower, upper) in enumerate(zip(bins[:-1], bins[1:])):
    count = sum(1 for a in articles if lower <= a['char_count'] < upper)
    pct = 100 * count / len(articles) if articles else 0
    print(f"  {labels[i]:>6s}: {count:3d} articles ({pct:5.1f}%)")


CORPUS STATISTICS

Corpus size: 100 articles

Character counts:
  Total: 62,243 characters
  Average: 622 characters per article
  Range: [41, 5,534]

Sample of article titles (first 10):
   1. โอลด์เฟิร์ม
   2. ฤทธิ เบญจฤทธิ์
   3. แม่น้ำอามูร์
   4. อาทิศังกราจารย์
   5. พระเจ้าเฮนรีที่ 7 แห่งอังกฤษ
   6. พ.ศ. 421
   7. ทางหลวงแผ่นดินหมายเลข 1356
   8. อาสนวิหารการเสด็จขึ้นสู่สวรรค์ (อัลมาเตอ)
   9. เก๋งนุกิจราชบริหาร
  10. เร ดัง
  ... (90 more)

Article length distribution:
     <1k:  82 articles ( 82.0%)
    1-5k:  17 articles ( 17.0%)
   5-10k:   1 articles (  1.0%)
  10-50k:   0 articles (  0.0%)
    >50k:   0 articles (  0.0%)


## Save Corpus Metadata

In [8]:
# Create corpus-level metadata
metadata = {
    'corpus_name': '1.8a Thai Wikipedia Sample',
    'num_articles': len(articles),
    'language': LANGUAGE,
    'source': 'Thai Wikipedia (th.wikipedia.org)',
    'sampling_method': 'Random article API',
    'creation_date': datetime.now().isoformat(),
    'total_characters': total_chars,
    'page_ids_file': PAGE_IDS_FILE,
    'articles': [
        {
            'page_id': a['page_id'],
            'title': a['title'],
            'char_count': a['char_count'],
            'url': a['url']
        }
        for a in articles
    ]
}

metadata_path = Path(OUTPUT_DIR) / 'corpus_metadata.json'
with open(metadata_path, 'w', encoding='utf-8') as f:
    json.dump(metadata, f, ensure_ascii=False, indent=2)

print(f"\n✓ Saved corpus metadata to {metadata_path}")


✓ Saved corpus metadata to ../data/1.8a_thai_wiki_corpus/corpus_metadata.json


## Summary

This notebook has created a reproducible Thai Wikipedia corpus for testing cluster token usage.

**What we've done:**

1. Sampled 100 random Thai Wikipedia articles
2. Saved page IDs for reproducibility
3. Fetched and stored article text with metadata
4. Created corpus statistics

**Files created:**

- `{PAGE_IDS_FILE}`: List of 100 page IDs (one per line)
- `{OUTPUT_DIR}/*.json`: Individual article JSON files
- `{OUTPUT_DIR}/corpus_metadata.json`: Corpus-level metadata

**Reproducibility:**

- First run: Fetches random articles (non-deterministic)
- Subsequent runs: Re-fetches same articles by ID (deterministic)
- Page IDs committed to git for transparency

**Next step:**

**1.8b**: Tokenize corpus with Qwen tokenizer and count cluster token appearances.