# Poetry-EEBO-BERT Training: Layer 2 Architecture

**This notebook trains the proper Layer 2 model:**
- ✅ Starts FROM EEBO-BERT (Layer 1)
- ✅ Fine-tunes on 17.7M lines of poetry
- ✅ Creates Poetry-EEBO-BERT (Layer 1 → Layer 2)

## Architecture Path:
```
bert-base-uncased
    ↓ Fine-tune on EEBO 1595-1700
EEBO-BERT (Layer 1) ✓ COMPLETE
    ↓ Fine-tune on poetry corpus (THIS NOTEBOOK)
Poetry-EEBO-BERT (Layer 2) ⏳ TRAINING
```

## Before you start:

**Required files in Google Drive:**
1. **EEBO-BERT model** (Layer 1): `MyDrive/EEBO_1595-1700/eebo_bert_finetuned/`
2. **Poetry database**: `MyDrive/poetry_unified.db` (or build it below)

**Estimated time:** 6-8 hours on Colab A100 GPU

**Cost:** Free tier sufficient (A100 available in free Colab)

---

## Step 1: Setup - Install Dependencies

In [None]:
!pip install transformers datasets accelerate tqdm -q
print("✓ Dependencies installed")

## Step 2: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')
print("✓ Google Drive mounted")

## Step 3: Check GPU

**IMPORTANT:** Request an **A100 GPU** for fastest training:
1. Runtime → Change runtime type
2. Hardware accelerator: GPU
3. GPU type: A100

In [None]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    
    if 'A100' in torch.cuda.get_device_name(0):
        print("\n✓ A100 detected! Training will be FAST (~6 hours)")
    elif 'T4' in torch.cuda.get_device_name(0):
        print("\n⚠ T4 detected. Training will be slower (~12-14 hours)")
        print("  Consider requesting A100 in Runtime settings")
else:
    print("\n❌ No GPU detected! Please enable GPU in Runtime settings")

## Step 4: Verify EEBO-BERT Model

Check that Layer 1 (EEBO-BERT) exists in Google Drive

In [None]:
from pathlib import Path

# EEBO-BERT path (Layer 1)
EEBO_BERT_PATH = Path("/content/drive/MyDrive/AI and Poetry/EEBO_1595-1700/eebo_bert_finetuned")

print("Checking for EEBO-BERT model...")
if EEBO_BERT_PATH.exists():
    config_file = EEBO_BERT_PATH / "config.json"
    model_file = EEBO_BERT_PATH / "pytorch_model.bin"
    
    if config_file.exists() and model_file.exists():
        size_mb = model_file.stat().st_size / (1024**2)
        print(f"✓ EEBO-BERT found: {size_mb:.1f} MB")
        print(f"  Path: {EEBO_BERT_PATH}")
    else:
        print("❌ EEBO-BERT folder exists but missing model files!")
        print("  Expected: config.json and pytorch_model.bin")
else:
    print(f"❌ EEBO-BERT not found at {EEBO_BERT_PATH}")
    print("  Please check Google Drive path!")
    print("  Expected: MyDrive/EEBO_1595-1700/eebo_bert_finetuned/")

## Step 5: Check/Prepare Poetry Database

Option A: Use existing `poetry_unified.db` from Google Drive

Option B: Build it (see Step 5B below)

In [None]:
import sqlite3

# Check for existing database
DB_PATH_DRIVE = Path("/content/drive/MyDrive/AI and Poetry/poetry_unified.db")
DB_PATH_LOCAL = Path("/content/poetry_unified.db")

print("Checking for poetry database...")

if DB_PATH_DRIVE.exists():
    size_gb = DB_PATH_DRIVE.stat().st_size / (1024**3)
    print(f"✓ Found in Google Drive: {size_gb:.2f} GB")
    
    # Copy to local for faster access
    print("  Copying to local storage for faster access...")
    !cp "{DB_PATH_DRIVE}" "{DB_PATH_LOCAL}"
    print("  ✓ Copied to local")
    
    # Check line count
    conn = sqlite3.connect(DB_PATH_LOCAL)
    cursor = conn.cursor()
    line_count = cursor.execute("SELECT COUNT(*) FROM lines WHERE is_blank = 0").fetchone()[0]
    conn.close()
    
    print(f"  ✓ Database contains {line_count:,} non-blank lines")
    
else:
    print("❌ poetry_unified.db not found in Google Drive")
    print("  Expected: MyDrive/poetry_unified.db")
    print("  Run Step 5B below to build it, or upload it manually")

## Step 5B (Optional): Build Poetry Database

**Only run this if you don't have `poetry_unified.db` in Google Drive**

Required JSONL files in `MyDrive/AI and Poetry/Data/Databases/poetry_corpus/`:
- `shakespeare_complete_works.jsonl`
- `gutenberg_reconstructed.jsonl`
- `core_poets_complete.jsonl`
- `poetrydb.jsonl`

This will take 2-4 hours. Skip if you already have the database!

In [None]:
# Uncomment and run ONLY if you need to build the database:

# !pip install tqdm
# 
# # Run the database builder from the previous notebook
# # (Copy the UnifiedDatabaseBuilder class and build code here)
# print("See poetry_bert_training_full_pipeline.ipynb for database building code")
# print("Or upload poetry_unified.db directly to Google Drive")

## Step 6: Export Training Corpus from Database

In [None]:
import sqlite3
from tqdm import tqdm

CORPUS_OUTPUT = "/content/poetry_training_corpus.txt"

print("Exporting training corpus from database...")
print(f"Database: {DB_PATH_LOCAL}")
print(f"Output: {CORPUS_OUTPUT}")

# Connect to database
conn = sqlite3.connect(DB_PATH_LOCAL)
cursor = conn.cursor()

# Get total line count
total_lines = cursor.execute("SELECT COUNT(*) FROM lines WHERE is_blank = 0").fetchone()[0]
print(f"\nExporting {total_lines:,} non-blank lines...")

# Export lines
with open(CORPUS_OUTPUT, 'w', encoding='utf-8') as f:
    cursor.execute("SELECT line_text FROM lines WHERE is_blank = 0 ORDER BY line_id")
    
    batch_size = 10000
    pbar = tqdm(total=total_lines, desc="Exporting")
    
    while True:
        rows = cursor.fetchmany(batch_size)
        if not rows:
            break
        
        for (line_text,) in rows:
            f.write(line_text + '\n')
        
        pbar.update(len(rows))
    
    pbar.close()

conn.close()

# Check file size
import os
size_gb = os.path.getsize(CORPUS_OUTPUT) / (1024**3)
print(f"\n✓ Corpus exported: {size_gb:.2f} GB")
print(f"✓ {total_lines:,} lines ready for training")

## Step 7: Load EEBO-BERT (Layer 1) as Starting Point

**Critical:** We load EEBO-BERT, NOT bert-base-uncased

In [None]:
from transformers import BertTokenizer, BertForMaskedLM
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

print("="*70)
print("LOADING EEBO-BERT (LAYER 1) AS STARTING POINT")
print("="*70)
print(f"Path: {EEBO_BERT_PATH}")
print()

# Load tokenizer and model from EEBO-BERT
tokenizer = BertTokenizer.from_pretrained(str(EEBO_BERT_PATH))
model = BertForMaskedLM.from_pretrained(str(EEBO_BERT_PATH))

print(f"✓ EEBO-BERT loaded: {model.num_parameters():,} parameters")
print(f"✓ Tokenizer vocab size: {len(tokenizer):,}")
print()
print("This model will now be fine-tuned on poetry (Layer 1 → Layer 2)")
print("="*70)

## Step 8: Load and Tokenize Poetry Corpus

In [None]:
from datasets import load_dataset

MAX_LENGTH = 512

# Load corpus
print(f"Loading corpus from {CORPUS_OUTPUT}...")
dataset = load_dataset('text', data_files={'train': CORPUS_OUTPUT}, split='train')
print(f"✓ Loaded {len(dataset):,} lines")

# Tokenize
print("\nTokenizing corpus...")
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=MAX_LENGTH,
        padding='max_length',
        return_special_tokens_mask=True
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['text'],
    desc="Tokenizing"
)
print("✓ Tokenization complete")
print(f"  Total examples: {len(tokenized_dataset):,}")

## Step 9: Configure Training

**Training parameters optimized for poetry fine-tuning:**

In [None]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

# Training configuration
BATCH_SIZE = 8  # Good for A100
NUM_EPOCHS = 3  # Standard for fine-tuning
LEARNING_RATE = 5e-5  # Standard BERT fine-tuning rate
SAVE_STEPS = 1000
LOGGING_STEPS = 100

print("Training Configuration:")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Max length: {MAX_LENGTH}")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Save every: {SAVE_STEPS} steps")
print(f"  Log every: {LOGGING_STEPS} steps")

# Calculate total steps
total_steps = len(tokenized_dataset) // BATCH_SIZE * NUM_EPOCHS
print(f"\nTotal training steps: {total_steps:,}")
print(f"Estimated time on A100: ~6 hours")
print(f"Estimated time on T4: ~12-14 hours")

## Step 10: Setup Trainer

In [None]:
# Data collator for MLM
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./poetry_eebo_bert_checkpoints",
    overwrite_output_dir=True,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    save_steps=SAVE_STEPS,
    save_total_limit=2,  # Keep only 2 most recent checkpoints
    logging_steps=LOGGING_STEPS,
    learning_rate=LEARNING_RATE,
    warmup_steps=500,
    weight_decay=0.01,
    fp16=True,  # Mixed precision for speed
    logging_dir='./logs',
    report_to='none',
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

print("✓ Training setup complete")

## Step 11: Start Training (6-8 hours)

⚠️ **This will take 6-8 hours on A100 GPU**

You can close the tab - training will continue in the background.

**The model will be saved automatically every 1000 steps to Google Drive.**

In [None]:
import time
from datetime import datetime

print("="*70)
print("STARTING POETRY-EEBO-BERT TRAINING (LAYER 1 → LAYER 2)")
print("="*70)
print(f"Start time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Estimated duration: 6-8 hours on A100")
print("You can close this tab - training will continue")
print("="*70)
print()

start_time = time.time()

# Train
trainer.train()

total_time = time.time() - start_time
print()
print("="*70)
print(f"✓ TRAINING COMPLETE!")
print(f"Total time: {total_time/3600:.2f} hours")
print(f"End time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*70)

## Step 12: Save Final Model

In [None]:
# Save to local Colab storage
output_dir = "./poetry_eebo_bert_trained"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"✓ Model saved to local: {output_dir}")

# Save to Google Drive (IMPORTANT!)
drive_output = "/content/drive/MyDrive/AI and Poetry/poetry_eebo_bert_trained"
!mkdir -p "{drive_output}"
!cp -r {output_dir}/* "{drive_output}/"
print(f"✓ Model saved to Google Drive: {drive_output}")
print()
print("IMPORTANT: Download this model to your local machine!")
print(f"  Location: MyDrive/poetry_eebo_bert_trained/")

## Step 13: Test the Model

Quick sanity check that the model works

In [None]:
from transformers import pipeline

# Load the trained model
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

print("Testing Poetry-EEBO-BERT (Layer 2)...")
print("="*60)

# Test on Shakespeare (historical + poetry)
test_text = "Shall I compare thee to a [MASK] day?"
print(f"\nTest 1: '{test_text}'")
predictions = fill_mask(test_text)
for i, pred in enumerate(predictions[:5], 1):
    print(f"  {i}. {pred['token_str']}: {pred['score']:.4f}")

# Test on iambic line
test_text2 = "The [MASK] of spirit in a waste of shame"
print(f"\nTest 2: '{test_text2}'")
predictions2 = fill_mask(test_text2)
for i, pred in enumerate(predictions2[:5], 1):
    print(f"  {i}. {pred['token_str']}: {pred['score']:.4f}")

# Test on rhyming couplet
test_text3 = "So long as men can breathe or eyes can [MASK]"
print(f"\nTest 3: '{test_text3}'")
predictions3 = fill_mask(test_text3)
for i, pred in enumerate(predictions3[:5], 1):
    print(f"  {i}. {pred['token_str']}: {pred['score']:.4f}")

print("\n" + "="*60)
print("✓ Model is working!")

## Step 14: Create Documentation File

In [None]:
from datetime import datetime

# Create README for the model
readme_content = f"""# Poetry-EEBO-BERT (Layer 2)

**Trained:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

## Architecture Path:

```
bert-base-uncased (110M parameters)
    ↓ Fine-tune on EEBO 1595-1700
EEBO-BERT (Layer 1 - Historical Semantics)
    ↓ Fine-tune on 17.7M poetry lines (THIS MODEL)
Poetry-EEBO-BERT (Layer 2 - Poetry + Historical)
```

## Training Details:

- **Base model:** EEBO-BERT (Layer 1)
- **Training corpus:** 17.7M lines of poetry
- **Sources:** Shakespeare, Gutenberg, Core 27 Poets, PoetryDB
- **Epochs:** {NUM_EPOCHS}
- **Batch size:** {BATCH_SIZE}
- **Learning rate:** {LEARNING_RATE}
- **Max length:** {MAX_LENGTH}
- **Total steps:** {total_steps:,}
- **Training time:** {total_time/3600:.2f} hours
- **GPU:** {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}

## Purpose:

This model captures both:
1. **Historical semantics** (from EEBO-BERT training on 1595-1700 texts)
2. **Poetic conventions** (from poetry corpus training)

## Usage:

```python
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('./poetry_eebo_bert_trained')
model = BertModel.from_pretrained('./poetry_eebo_bert_trained')
```

## Next Steps:

1. Run Layer 3 analysis with prosodic conditioning
2. Compare with Base BERT, EEBO-BERT, and Poetry-BERT
3. Analyze Shakespeare sonnets trajectory tortuosity

## Citation:

```bibtex
@unpublished{{stecher2025poetry_eebo_bert,
  title={{Poetry-EEBO-BERT: A Layered Architecture for Historical Poetry Analysis}},
  author={{Stecher, Justin}},
  year={{2025}},
  note={{Layer 2 of three-layer BERT architecture}}
}}
```
"""

# Save README
readme_path_local = f"{output_dir}/README.md"
readme_path_drive = f"{drive_output}/README.md"

with open(readme_path_local, 'w') as f:
    f.write(readme_content)

with open(readme_path_drive, 'w') as f:
    f.write(readme_content)

print("✓ README.md created in model directory")

## Summary

✅ **Loaded EEBO-BERT (Layer 1) as starting point**

✅ **Fine-tuned on 17.7M lines of poetry**

✅ **Saved Poetry-EEBO-BERT (Layer 2) to Google Drive**

✅ **Model tested and working**

---

## Next Steps:

1. **Download the model** from Google Drive:
   - `MyDrive/poetry_eebo_bert_trained/`

2. **Run Layer 3 analysis** on local machine:
   ```bash
   python scripts/layer3_bert_prosody.py --model poetry_eebo
   ```

3. **Compare all models**:
   - Base BERT (baseline)
   - EEBO-BERT (Layer 1 - historical)
   - Poetry-BERT (Layer 2 - independent poetry path)
   - **Poetry-EEBO-BERT (Layer 2 - proper layered path)** ← NEW!

4. **Analyze results** in `notebooks/complete_layered_analysis.ipynb`

5. **Start writing Paper 1** (DH venue) with complete architecture!

---

**Model location:** `MyDrive/poetry_eebo_bert_trained/`

**Training completed:** See timestamp above

**Ready for:** Shakespeare sonnets trajectory tortuosity analysis with Layer 3 prosodic conditioning