# Aetheria — Colab Training Notebook

This notebook handles the **heavy** parts of the Original Sin pipeline:
- **Gluttony** — distil conversations from DialoGPT
- **Train** — train TinyTransformerLM on all collected data
- **Download** — saves the `.pt` checkpoint so you can copy it back to `models/`

**Runtime → Change runtime type → T4 GPU** before running.

---
Upload the `Aetheria/` folder to your Google Drive under `MyDrive/AetheriaAI/Aetheria/`,
or use the zip upload cell below.

## Step 1 — Mount Google Drive & set paths

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os

# ── Edit this if your Drive path is different ──────────────────────────────
AETHERIA_ROOT = '/content/drive/MyDrive/AetheriaAI/Aetheria'
# ──────────────────────────────────────────────────────────────────────────

os.chdir(AETHERIA_ROOT)
print('Working directory:', os.getcwd())
print('Files:', os.listdir('.'))

### Alternative — Upload a zip instead of using Drive
If you prefer to upload a zip of your `Aetheria/` folder, run the next cell.
Skip it if you already mounted Drive above.

In [None]:
# SKIP THIS CELL if you used Drive above
# from google.colab import files
# import zipfile, os
#
# uploaded = files.upload()          # upload Aetheria.zip
# zip_name = list(uploaded.keys())[0]
# with zipfile.ZipFile(zip_name, 'r') as z:
#     z.extractall('/content/')
# AETHERIA_ROOT = '/content/Aetheria'
# os.chdir(AETHERIA_ROOT)
# print('Extracted to', AETHERIA_ROOT)

## Step 2 — Install dependencies

In [None]:
!pip install -q sentencepiece transformers accelerate
print('Dependencies installed.')

## Step 3 — Check GPU

In [None]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device: {device}')
if device == 'cuda':
    print(torch.cuda.get_device_name(0))
else:
    print('WARNING: No GPU found. Go to Runtime → Change runtime type → T4 GPU')

## Step 4 — Gluttony: devour TinyLlama

**Two modes run back to back:**
1. `generate` — TinyLlama answers seed prompts → saved as text training pairs
2. `distill` — TRUE distillation: student trains on TinyLlama's full probability distributions via KL loss → saved as `models/aetheria_distilled.pt`

TinyLlama (~2.2 GB) downloads on first run. T4 GPU handles both in ~10 min.

In [None]:
import os

# ── 4a: generate text pairs (feeds into clean → train pipeline) ──────────────
!python Original_sin/gluttony/gluttony.py \
    --model tinyllama \
    --mode generate \
    --rounds 100 \
    --output data/gluttony_conversations.txt

size = os.path.getsize('data/gluttony_conversations.txt') if os.path.exists('data/gluttony_conversations.txt') else 0
print(f'gluttony_conversations.txt: {size/1024:.1f} KB')

# ── 4b: true KL distillation → aetheria_distilled.pt ────────────────────────
!python Original_sin/gluttony/gluttony.py \
    --model tinyllama \
    --mode distill \
    --epochs 5 \
    --temperature 3.0 \
    --alpha 0.7

size2 = os.path.getsize('models/aetheria_distilled.pt') if os.path.exists('models/aetheria_distilled.pt') else 0
print(f'aetheria_distilled.pt: {size2/1024/1024:.1f} MB')

## Step 5 — Merge and clean all data

In [None]:
# Merge envy + gluttony data into conversations.txt
raw_path = 'data/conversations.txt'
sources = ['data/envy_conversations.txt', 'data/gluttony_conversations.txt']

with open(raw_path, 'a', encoding='utf-8') as out:
    for src in sources:
        if os.path.exists(src):
            text = open(src, encoding='utf-8', errors='ignore').read()
            out.write(text + '\n')
            print(f'  merged: {src}')

!python scripts/clean_data.py

size = os.path.getsize('data/cleaned_conversations.txt')
print(f'cleaned_conversations.txt: {size/1024:.1f} KB')

## Step 6 — Train SentencePiece tokenizer

In [None]:
!python scripts/train_spm.py \
    --input data/cleaned_conversations.txt \
    --model_prefix data/spm \
    --vocab_size 8000

print('SPM model:', os.path.exists('data/spm.model'))

## Step 7 — Train TinyTransformerLM

Adjust `--epochs` and `--batch_size` to taste.  
With a T4 GPU, 20 epochs on ~1 MB of text takes ~10-20 minutes.

In [None]:
!python scripts/prototype_model.py \
    --data data/cleaned_conversations.txt \
    --spm  data/spm.model \
    --vocab_size 8000 \
    --epochs 20 \
    --batch_size 32 \
    --seq_len 128 \
    --device cuda

## Step 8 — Copy checkpoint back to Drive & download

The `.pt` file is saved inside `models/`. Copy to Drive root so you can grab it easily,
then also offer a direct browser download.

In [None]:
import shutil, glob

pts = sorted(glob.glob('models/*.pt'))
if not pts:
    print('No checkpoint found. Did training finish?')
else:
    latest = pts[-1]
    size_mb = os.path.getsize(latest) / (1024*1024)
    print(f'Found checkpoint: {latest}  ({size_mb:.1f} MB)')

    # Copy to Drive root for easy access
    drive_dest = '/content/drive/MyDrive/aetheria_latest.pt'
    shutil.copy(latest, drive_dest)
    print(f'Copied to Drive: {drive_dest}')

    # Also copy tokenizer files
    for tok_file in ['data/spm.model', 'data/spm.vocab']:
        if os.path.exists(tok_file):
            dest = f'/content/drive/MyDrive/{os.path.basename(tok_file)}'
            shutil.copy(tok_file, dest)
            print(f'Copied tokenizer: {dest}')

In [None]:
# Direct browser download
from google.colab import files

# Download trained model
pts = sorted(glob.glob('models/*.pt'))
for pt in pts:
    files.download(pt)
    print(f'Downloading: {pt}')

# Download tokenizer
for f in ['data/spm.model', 'data/spm.vocab']:
    if os.path.exists(f):
        files.download(f)
        print(f'Downloading: {f}')

print('\nPlace files locally:')
print('  models/aetheria_colab.pt')
print('  models/aetheria_distilled.pt')
print('  data/spm.model  (replace existing)')
print('  data/spm.vocab  (replace existing)')

---
## Done!

Once downloaded, place the files in your local project:
```
Aetheria/models/aetheria_latest.pt   ← rename if needed
Aetheria/data/spm.model
Aetheria/data/spm.vocab
```
Then talk to Aetheria locally:
```powershell
python Original_sin/aetheria_core.py talk
```