# Train BERT on Gutenberg Poetry Corpus

**Corpus:** 3 million lines of English poetry (15th-20th centuries)

**Input:** `/MyDrive/gutenberg_poetry_corpus.jsonl.gz`

**Output:** Fine-tuned BERT model saved to Google Drive

**Hardware:** A100 GPU (High-RAM recommended)

**Estimated time:** 1-2 hours (tokenization) + 3-4 hours (training)

---

## Setup Instructions

1. **Runtime:** GPU → A100, RAM → High-RAM
2. Run cells in order
3. Monitor training progress

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("✓ Google Drive mounted")

## Step 2: Install Dependencies

In [None]:
!pip install -q transformers datasets accelerate

print("✓ Dependencies installed")

## Step 3: Configuration

In [None]:
import os
import torch

# Paths
CORPUS_PATH = "/content/drive/MyDrive/AI and Poetry/Historical Embeddings/gutenberg_poetry_corpus.jsonl.gz"
OUTPUT_DIR = "/content/drive/MyDrive/AI and Poetry/Historical Embeddings/gutenberg_bert_finetuned"
CHECKPOINT_DIR = "/content/drive/MyDrive/AI and Poetry/Historical Embeddings/gutenberg_bert_checkpoints"

# Training parameters
BATCH_SIZE = 8
EPOCHS = 3
LEARNING_RATE = 5e-5
MAX_LENGTH = 512
SAVE_STEPS = 5000

# Check GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## Step 4: Load BERT Model and Tokenizer

In [None]:
from transformers import BertTokenizer, BertForMaskedLM

print("Loading BERT model...")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

print(f"✓ Model loaded: {sum(p.numel() for p in model.parameters()):,} parameters")

## Step 5: Load and Prepare Corpus

In [None]:
import gzip
import json
from datasets import Dataset

print(f"\nLoading corpus from {CORPUS_PATH}...")

# Read compressed JSONL
lines = []
with gzip.open(CORPUS_PATH, 'rt', encoding='utf-8') as f:
    for line in f:
        data = json.loads(line)
        lines.append({'text': data['s']})  # 's' is the poetry line

print(f"✓ Loaded {len(lines):,} lines")

# Create HuggingFace dataset
dataset = Dataset.from_dict({'text': [item['text'] for item in lines]})
print(f"✓ Dataset created: {len(dataset):,} examples")

## Step 6: Tokenize Corpus (with Parallel Processing)

In [None]:
print("\nTokenizing corpus...")
print("This will take ~1-2 hours with parallel processing\n")

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=MAX_LENGTH,
        padding=False  # Handled by data collator
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    num_proc=4,  # Parallel processing
    remove_columns=['text']
)

print(f"✓ Tokenized {len(tokenized_dataset):,} examples")

## Step 7: Prepare Data Collator

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

print("✓ Data collator configured")

## Step 8: Configure Training

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir=CHECKPOINT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    save_steps=SAVE_STEPS,
    save_total_limit=2,
    logging_steps=100,
    fp16=True,  # Mixed precision training
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

print("✓ Trainer configured")

## Step 9: Start Training

**This will take 3-4 hours on A100 GPU**

In [None]:
import time

print("="*60)
print("STARTING TRAINING")
print("="*60)
print("This will take approximately 3-4 hours.")
print("You can close this tab - training will continue.")
print("="*60)

start_time = time.time()

trainer.train()

training_time = (time.time() - start_time) / 3600
print(f"\n✓ Training complete! Total time: {training_time:.2f} hours")

## Step 10: Save Final Model

In [None]:
print(f"\nSaving model to {OUTPUT_DIR}...")

model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print("✓ Model saved successfully!")
print(f"\nModel location: {OUTPUT_DIR}")
print("\nYou can now use this model for:")
print("- Phase 2: Prosody-conditioned training")
print("- Semantic analysis of poetry across centuries")
print("- Comparative studies of poetic language evolution")

## Step 11: Test the Model (Optional)

In [None]:
# Test masked language modeling
from transformers import pipeline

print("Testing model...\n")

fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer=tokenizer
)

test_sentences = [
    "The [MASK] is sweet",
    "O [MASK], where art thou?",
    "I wandered lonely as a [MASK]"
]

for sentence in test_sentences:
    print(f"Input: {sentence}")
    results = fill_mask(sentence, top_k=3)
    for r in results:
        print(f"  {r['token_str']}: {r['score']:.3f}")
    print()

---

## Training Complete!

Your Gutenberg Poetry BERT model is now trained and saved to Google Drive.

**Next steps:**
1. Download model to local machine (when M4 Max arrives)
2. Proceed to Phase 2: Prosody-conditioned training
3. Run pilot studies on canonical poems

**Model details:**
- Base model: BERT-base-uncased
- Training data: 3 million lines of poetry (15th-20th centuries)
- Epochs: 3
- Final loss: Check training output above