# Aetheria O1 — Full Training Pipeline

This notebook runs the complete **Original Sin O1** pipeline:

| Step | What happens | Output |
|------|-------------|--------|
| 1 | Mount Drive + env setup | — |
| 2 | Install deps | — |
| 3 | GPU check | — |
| 4 | **Envy** — calls Groq LLM, generates seed conversations | `data/envy_conv.txt` |
| 5 | **Gluttony** — TinyLlama generates training pairs | `data/gluttony_conv.txt` |
| 6 | Clean both conv files | `data/envy_cleaned.txt`, `data/gluttony_cleaned.txt` |
| 7 | Train **SPM tokenizer** on combined data | `data/spm.model`, `data/spm.vocab` |
| 8 | Train model on envy data | `models/envy.pt` |
| 9 | Train model on gluttony data | `models/gluttony.pt` |
| 10 | **Merge** envy.pt + gluttony.pt (50/50) | `models/aetheria_o1.pt` |
| 11 | Download all artefacts | browser download |

**Runtime → Change runtime type → T4 GPU** before running.

Upload `Aetheria/` to `MyDrive/AetheriaAI/Aetheria/` in Google Drive first.

## Step 1 — Mount Google Drive & set paths

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os

# ── Edit this if your Drive path is different ──────────────────────
AETHERIA_ROOT = '/content/drive/MyDrive/AetheriaAI/Aetheria'
# ───────────────────────────────────────────────────────────────────

os.chdir(AETHERIA_ROOT)
print('Working directory:', os.getcwd())
print('Files:', os.listdir('.'))

## Step 2 — Install dependencies & set API key

In [None]:
# Pre-install sympy so torch.distributed doesn't stall on it during transformers import
!pip install -q sympy

# Install remaining deps (no version pin — let Colab resolve normally)
!pip install -q sentencepiece transformers accelerate requests

# Verify transformers imports cleanly
import transformers
print(f'transformers {transformers.__version__} ✓')

# ── Set your Groq API key (needed for Envy) ────────────────────────
# Get a free key at https://console.groq.com
import os
os.environ['GROQ_API_KEY'] = ''   # ← paste your key here

# Or load from .env file already in your Drive folder:
# from dotenv import load_dotenv; load_dotenv('.env')

if os.environ.get('GROQ_API_KEY'):
    print('Groq key: set ✓')
else:
    print('WARNING: GROQ_API_KEY not set — Envy will fail. Set it above.')

## Step 3 — Check GPU

In [None]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device: {device}')
if device == 'cuda':
    print(torch.cuda.get_device_name(0))
else:
    print('WARNING: No GPU. Go to Runtime → Change runtime type → T4 GPU.')

## Step 4 — Envy: call Groq → `data/envy_conv.txt`

Envy asks the free Groq LLM seed questions in Aetheria's style.  
Increase `--rounds` for more data (each round = 1 API call, ~1.5s delay).

In [None]:
ENVY_ROUNDS = 80   # ← adjust: more rounds = more data

!python Original_sin/envy/envy.py \
    --provider groq \
    --rounds {ENVY_ROUNDS} \
    --output data/envy_conv.txt

import os
size = os.path.getsize('data/envy_conv.txt') if os.path.exists('data/envy_conv.txt') else 0
print(f'envy_conv.txt: {size/1024:.1f} KB')

## Step 5 — Gluttony: devour TinyLlama → `data/gluttony_conv.txt`

Gluttony feeds seed prompts to TinyLlama and records (prompt, response) pairs.  
TinyLlama (~2.2 GB) downloads on first run. T4 handles 100 rounds in ~5 min.

In [None]:
GLUTTONY_ROUNDS = 100   # ← adjust

!python Original_sin/gluttony/gluttony.py \
    --model tinyllama \
    --mode generate \
    --rounds {GLUTTONY_ROUNDS} \
    --output data/gluttony_conv.txt

size = os.path.getsize('data/gluttony_conv.txt') if os.path.exists('data/gluttony_conv.txt') else 0
print(f'gluttony_conv.txt: {size/1024:.1f} KB')

## Step 6 — Clean both conv files

In [None]:
import sys, re
sys.path.insert(0, 'scripts')
from clean_data import clean_text
from pathlib import Path

def clean_file(src: str, dst: str):
    raw = Path(src).read_text(encoding='utf-8', errors='ignore')
    paras = clean_text(raw)
    Path(dst).write_text('\n\n'.join(paras), encoding='utf-8')
    print(f'{src} → {dst}  ({len(paras)} paragraphs, {Path(dst).stat().st_size/1024:.1f} KB)')

clean_file('data/envy_conv.txt',     'data/envy_cleaned.txt')
clean_file('data/gluttony_conv.txt', 'data/gluttony_cleaned.txt')

# Also write a combined file for SPM training
combined = Path('data/envy_cleaned.txt').read_text(encoding='utf-8') + '\n\n' + \
           Path('data/gluttony_cleaned.txt').read_text(encoding='utf-8')
Path('data/o1_combined.txt').write_text(combined, encoding='utf-8')
print(f'o1_combined.txt: {len(combined)/1024:.1f} KB (used for tokenizer training)')

## Step 7 — Train SentencePiece tokenizer on combined data

In [None]:
!python scripts/train_spm.py \
    --input data/o1_combined.txt \
    --model_prefix data/spm \
    --vocab_size 8000

print('spm.model exists:', os.path.exists('data/spm.model'))
print('spm.vocab exists:', os.path.exists('data/spm.vocab'))

## Step 8 — Train on Envy data → `models/envy.pt`

This model learns the voice/style that came from the Groq LLM (Envy's harvest).

In [None]:
ENVY_EPOCHS = 20   # ← adjust

!python scripts/prototype_model.py \
    --data    data/envy_cleaned.txt \
    --spm     data/spm.model \
    --vocab_size 8000 \
    --epochs  {ENVY_EPOCHS} \
    --batch_size 32 \
    --seq_len 128 \
    --device  cuda \
    --ckpt_name envy.pt

size = os.path.getsize('models/envy.pt') if os.path.exists('models/envy.pt') else 0
print(f'envy.pt: {size/1024/1024:.2f} MB')

## Step 9 — Train on Gluttony data → `models/gluttony.pt`

This model learns from what TinyLlama generated (Gluttony's harvest).

In [None]:
GLUTTONY_EPOCHS = 20   # ← adjust

!python scripts/prototype_model.py \
    --data    data/gluttony_cleaned.txt \
    --spm     data/spm.model \
    --vocab_size 8000 \
    --epochs  {GLUTTONY_EPOCHS} \
    --batch_size 32 \
    --seq_len 128 \
    --device  cuda \
    --ckpt_name gluttony.pt

size = os.path.getsize('models/gluttony.pt') if os.path.exists('models/gluttony.pt') else 0
print(f'gluttony.pt: {size/1024/1024:.2f} MB')

## Step 10 — Merge envy.pt + gluttony.pt → `models/aetheria_o1.pt`

Weight averaging (model soup): blends both checkpoints into one.
Adjust `--weight_a` to weight envy more (0.7) or gluttony more (0.3).

In [None]:
!python scripts/merge_models.py \
    --a        models/envy.pt \
    --b        models/gluttony.pt \
    --out      models/aetheria_o1.pt \
    --weight_a 0.5

size = os.path.getsize('models/aetheria_o1.pt') if os.path.exists('models/aetheria_o1.pt') else 0
print(f'aetheria_o1.pt: {size/1024/1024:.2f} MB')

## Step 11 — Copy to Drive & download everything

Files to download and place locally:
```
models/aetheria_o1.pt   → Aetheria/models/aetheria_o1.pt
models/envy.pt          → Aetheria/models/envy.pt       (keep for reference)
models/gluttony.pt      → Aetheria/models/gluttony.pt   (keep for reference)
data/spm.model          → Aetheria/data/spm.model
data/spm.vocab          → Aetheria/data/spm.vocab
```

Then talk to Aetheria:
```powershell
python Original_sin/aetheria_core.py talk --ckpt models\aetheria_o1.pt
```

In [None]:
import shutil

DRIVE_DEST = '/content/drive/MyDrive/AetheriaAI_output'
os.makedirs(DRIVE_DEST, exist_ok=True)

artefacts = [
    'models/aetheria_o1.pt',
    'models/envy.pt',
    'models/gluttony.pt',
    'data/spm.model',
    'data/spm.vocab',
    'data/envy_conv.txt',
    'data/gluttony_conv.txt',
]

for f in artefacts:
    if os.path.exists(f):
        dest = os.path.join(DRIVE_DEST, os.path.basename(f))
        shutil.copy(f, dest)
        print(f'  ✓ {f}  →  {dest}')
    else:
        print(f'  ✗ {f}  (not found)')

In [None]:
# Direct browser download (optional — Drive copy above is usually enough)
from google.colab import files

for f in ['models/aetheria_o1.pt', 'data/spm.model', 'data/spm.vocab']:
    if os.path.exists(f):
        files.download(f)
        print(f'Downloading {f}...')