# PhoneticPriorModel Training

This notebook runs the full paper-style experiments for the PhoneticPriorModel
(Luo et al. 2021, TACL) decipherment system.

**Experiments:**
- **Ugaritic** (Table 3): Ugaritic → Hebrew decipherment, P@1 metric
- **Gothic** (Table 2 + 4): Gothic → Proto-Germanic / Old Norse / Old English, P@10 metric
- **Iberian Names** (Figure 4a): Iberian personal names → Latin, P@K curves
- **Validation branches**: 55 language family branches from ancient-scripts-datasets

**Estimated training times** (with batch_size=8, 3 restarts):
| Experiment | Runs | Est. Time |
|---|---|---|
| Ugaritic | 6 (2 variants × 3 restarts) | ~2-3 hours |
| Iberian Names | 6 (2 variants × 3 restarts) | ~1-2 hours |
| Gothic | 36+ (3 variants × 4 WR × 3 restarts) | ~8-12 hours |

**Runtime:** Select **T4 GPU** or **CPU** (the model uses CPU-bound DP loops,
so GPU helps marginally with tensor ops but not the core bottleneck).

## 1. Setup: Clone repo and install dependencies

In [None]:
import os

# Clone the repository
if not os.path.exists('/content/ProjectPhaistos'):
    !git clone https://github.com/Nacryos/ProjectPhaistos.git /content/ProjectPhaistos
else:
    !cd /content/ProjectPhaistos && git pull

# Set working directory
os.chdir('/content/ProjectPhaistos/repro_decipher_phonetic_prior')

# Install dependencies
!pip install -q panphon pyyaml matplotlib scipy numpy torch

# Required for panphon IPA feature extraction
os.environ['PYTHONUTF8'] = '1'

print('Setup complete.')

## 2. Verify data files and dependencies

In [None]:
import sys
sys.path.insert(0, '.')

from pathlib import Path
from datasets.registry import get_corpus, list_corpora

# Check core data files exist
critical_files = [
    'third_party/NeuroDecipher/data/uga-heb.small.no_spe.cog',
    'third_party/DecipherUnsegmented/data/iberian.csv',
    'data_external/rodriguez_ramos_2014_personal_names.tsv',
    'data_external/wiktionary_descendants_pg.tsv',
    'data_external/wiktionary_descendants_on.tsv',
    'data_external/wiktionary_descendants_oe.tsv',
    'configs/gothic.yaml',
    'configs/ugaritic.yaml',
    'configs/iberian.yaml',
    'configs/validation.yaml',
]

print('Data file check:')
all_ok = True
for f in critical_files:
    exists = Path(f).exists()
    status = 'OK' if exists else 'MISSING'
    if not exists:
        all_ok = False
    print(f'  {status}: {f}')

# Check corpora load
print(f'\nAvailable corpora: {list_corpora()[:10]}...')

# Quick load test
for name in ['ugaritic', 'gothic', 'iberian']:
    c = get_corpus(name)
    print(f'  {name}: {len(c.lost_text)} training texts')

# Check imports
import torch
import panphon
print(f'\nPyTorch: {torch.__version__} (CUDA: {torch.cuda.is_available()})')
print(f'panphon: {panphon.__version__}')

if all_ok:
    print('\nAll checks passed. Ready to train.')
else:
    print('\nSome files missing. Check the output above.')

## 3. Smoke test (quick validation that everything works)

In [None]:
import time

# Run a quick smoke test on Ugaritic to verify the pipeline
print('Running Ugaritic smoke test...')
t0 = time.time()
!python -m repro.run_experiment ugaritic --smoke --restarts 1 --output-root outputs_smoke
print(f'Smoke test completed in {time.time() - t0:.1f}s')

# Show results
import json
summary_path = Path('outputs_smoke/ugaritic/run_summary.json')
if summary_path.exists():
    summary = json.loads(summary_path.read_text())
    print(f'\nSmoke test results:')
    for row in summary.get('rows', []):
        print(f"  {row.get('method', '?')}: P@1={row.get('score_best', 0):.3f}")
    print('\nPipeline is working correctly.')
else:
    print('Smoke test output not found. Check for errors above.')

## 4. Save results to Google Drive (recommended)

Colab sessions are **temporary** — when the session ends, all local files are deleted.
To keep your results, save them to Google Drive.

**How to use:** Change `SAVE_TO_DRIVE = False` to `SAVE_TO_DRIVE = True` in the cell below,
then run it. A popup will ask you to sign in and grant access — just click through it.

In [None]:
# ======================================================
# CHANGE THIS TO True TO SAVE RESULTS TO GOOGLE DRIVE
SAVE_TO_DRIVE = False
# ======================================================

import os

if SAVE_TO_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')
    OUTPUT_ROOT = '/content/drive/MyDrive/PhoneticPriorModel/outputs'
    print('Google Drive mounted. Results will persist across sessions.')
else:
    OUTPUT_ROOT = '/content/ProjectPhaistos/repro_decipher_phonetic_prior/outputs'
    print('Using local storage. Results will be LOST when this session ends.')
    print('To keep results, set SAVE_TO_DRIVE = True above and re-run this cell.')

os.makedirs(OUTPUT_ROOT, exist_ok=True)
print(f'\nOutput directory: {OUTPUT_ROOT}')

---
## 5. Ugaritic Experiment (Table 3)

The primary evaluation from the paper. Deciphers Ugaritic text using Hebrew as the known language.

- **Paper baselines**: Bayesian (P@1=0.054), NeuroCipher (P@1=0.654)
- **Paper result**: base=0.583, full=0.672
- **Variants**: `base` (no prior), `full` (full mapping prior)

In [None]:
import time

RESTARTS = 3  # Paper uses 5; reduce for faster initial runs

print(f'Starting Ugaritic experiment ({RESTARTS} restarts, ~2-3 hours)...')
print(f'Output: {OUTPUT_ROOT}/ugaritic/')
t0 = time.time()

!python -m repro.run_experiment ugaritic \
    --restarts {RESTARTS} \
    --output-root {OUTPUT_ROOT}

elapsed = time.time() - t0
print(f'\nUgaritic experiment completed in {elapsed/60:.1f} minutes')

In [None]:
# Display Ugaritic results
import json, csv
from pathlib import Path

table3_path = Path(f'{OUTPUT_ROOT}/ugaritic/table3_ugaritic.csv')
if table3_path.exists():
    print('=== Ugaritic Results (Table 3 style) ===')
    print(f'{"Method":<20} {"P@1 Best":>10} {"P@1 Mean":>10} {"P@1 Std":>10}')
    print('-' * 52)
    with table3_path.open() as f:
        for row in csv.DictReader(f):
            print(f"{row['method']:<20} {float(row['score_best']):>10.3f} {float(row['score_mean']):>10.3f} {float(row.get('score_std', 0)):>10.3f}")
else:
    print('Results not found. Check experiment output above.')

---
## 6. Iberian Personal Names Experiment (Figure 4a)

Evaluates the model on Iberian personal names from the Bronze of Ascoli,
compared against Latin cognates.

- **Metrics**: P@1, P@3, P@5, P@10 curves
- **Variants**: `base` (no prior), `full` (full mapping prior)

In [None]:
import time

print(f'Starting Iberian names experiment ({RESTARTS} restarts, ~1-2 hours)...')
print(f'Output: {OUTPUT_ROOT}/iberian_names/')
t0 = time.time()

!python -m repro.run_experiment iberian-names \
    --restarts {RESTARTS} \
    --output-root {OUTPUT_ROOT}

elapsed = time.time() - t0
print(f'\nIberian names experiment completed in {elapsed/60:.1f} minutes')

In [None]:
# Display Iberian P@K results
import csv
from pathlib import Path

pk_path = Path(f'{OUTPUT_ROOT}/iberian_names/p_at_k.csv')
if pk_path.exists():
    print('=== Iberian Names P@K Results ===')
    print(f'{"Variant":<15} {"K":>5} {"P@K Best":>10} {"P@K Mean":>10}')
    print('-' * 42)
    with pk_path.open() as f:
        for row in csv.DictReader(f):
            print(f"{row['variant']:<15} {int(row['k']):>5} {float(row['p_at_k_best']):>10.3f} {float(row['p_at_k_mean']):>10.3f}")
else:
    print('Results not found.')

# Show the P@K curve plot if generated
from IPython.display import Image, display
fig_path = Path(f'{OUTPUT_ROOT}/iberian_names/p_at_k_curve.png')
if fig_path.exists():
    display(Image(filename=str(fig_path), width=600))

---
## 7. Gothic Experiment (Table 2 + 4)

The most comprehensive experiment. Tests across 3 known languages (PG, ON, OE),
4 whitespace ratios (0%, 25%, 50%, 75%), and 3 model variants.

**This is the longest experiment.** Consider:
- Reducing variants: `--variants base,full` (skip partial)
- Reducing restarts: already using 3 instead of 5
- Running overnight on Colab Pro for longer session limits

In [None]:
import time

# For faster initial results, run with fewer variants
GOTHIC_VARIANTS = 'base,full'  # Skip 'partial' to save time; add it back with 'base,partial,full'

print(f'Starting Gothic experiment ({RESTARTS} restarts, variants={GOTHIC_VARIANTS})...')
print(f'Output: {OUTPUT_ROOT}/gothic/')
print('This will take several hours. Consider running overnight.')
t0 = time.time()

!python -m repro.run_experiment gothic \
    --restarts {RESTARTS} \
    --variants {GOTHIC_VARIANTS} \
    --output-root {OUTPUT_ROOT}

elapsed = time.time() - t0
print(f'\nGothic experiment completed in {elapsed/3600:.1f} hours')

In [None]:
# Display Gothic Table 2 results
import csv
from pathlib import Path

t2_path = Path(f'{OUTPUT_ROOT}/gothic/table2.csv')
if t2_path.exists():
    print('=== Gothic Results (Table 2 style) ===')
    print(f'{"WR%":<6} {"Known":<8} {"Base":>8} {"Partial":>8} {"Full":>8}')
    print('-' * 42)
    with t2_path.open() as f:
        for row in csv.DictReader(f):
            base = f"{float(row['base']):.3f}" if row.get('base') else '-'
            partial = f"{float(row['partial']):.3f}" if row.get('partial') else '-'
            full = f"{float(row['full']):.3f}" if row.get('full') else '-'
            print(f"{row['whitespace_ratio']:<6} {row['known_language']:<8} {base:>8} {partial:>8} {full:>8}")
else:
    print('Results not found.')

---
## 8. Generate visualizations

Auto-detect experiment outputs and generate all applicable plots:
- Character distribution heatmaps
- Training loss curves
- P@K curves
- Branch comparison bar charts

In [None]:
from pathlib import Path

# Generate visualizations for each completed experiment
for experiment in ['ugaritic', 'gothic', 'iberian_names']:
    exp_dir = Path(f'{OUTPUT_ROOT}/{experiment}')
    if exp_dir.exists():
        print(f'\nGenerating visualizations for {experiment}...')
        !python -m repro.run_experiment visualize {exp_dir}
    else:
        print(f'Skipping {experiment} (not yet run)')

In [None]:
# Display generated figures
from pathlib import Path
from IPython.display import Image, display

for experiment in ['ugaritic', 'gothic', 'iberian_names']:
    fig_dir = Path(f'{OUTPUT_ROOT}/{experiment}/figures')
    if fig_dir.exists():
        pngs = sorted(fig_dir.glob('*.png'))
        if pngs:
            print(f'\n=== {experiment} visualizations ===')
            for png in pngs:
                print(f'\n{png.name}:')
                display(Image(filename=str(png), width=700))

---
## 9. Validation branch experiments (optional)

Run the model against validation language family branches from
ancient-scripts-datasets. Each branch tests decipherment within a language
family (e.g., Germanic, Semitic, Celtic).

These experiments test generalization — how well the phonetic prior
works across diverse language families.

In [None]:
# List available validation branches
from datasets.registry import list_corpora

val_branches = [c.replace('validation_', '') for c in list_corpora() if c.startswith('validation_')]
print(f'Available validation branches ({len(val_branches)}):')
for i, b in enumerate(sorted(val_branches)):
    print(f'  {i+1:3d}. {b}')

In [None]:
import time

# Run a selection of key validation branches
# Uncomment or add branches you want to test
VALIDATION_BRANCHES = [
    'germanic_expanded',
    'semitic',
    # 'celtic',
    # 'romance',
    # 'slavic',
    # 'turkic',
]

for branch in VALIDATION_BRANCHES:
    print(f'\n{"="*60}')
    print(f'Running validation: {branch} ({RESTARTS} restarts)...')
    t0 = time.time()
    !python -m repro.run_experiment validation \
        --branch {branch} \
        --restarts {RESTARTS} \
        --output-root {OUTPUT_ROOT}
    print(f'Completed {branch} in {(time.time()-t0)/60:.1f} minutes')

---
## 10. Download results

Download the full outputs directory as a zip file.

In [None]:
import shutil
from pathlib import Path

output_dir = Path(OUTPUT_ROOT)
if output_dir.exists() and any(output_dir.iterdir()):
    zip_path = '/content/phaistos_results'
    shutil.make_archive(zip_path, 'zip', output_dir)
    print(f'Results archived to {zip_path}.zip')
    print(f'Size: {Path(zip_path + ".zip").stat().st_size / 1024 / 1024:.1f} MB')
    
    # Auto-download in Colab
    try:
        from google.colab import files
        files.download(f'{zip_path}.zip')
    except ImportError:
        print('Not in Colab. Download the zip manually.')
else:
    print('No results to download yet.')