# Hierarchical Multi-Objective Poetry-EEBO-BERT Training

Train BERT with hierarchical losses on Google Colab:
- **0.5** √ó MLM (token level)
- **0.2** √ó Line contrastive
- **0.2** √ó Quatrain contrastive
- **0.1** √ó Sonnet contrastive

## Requirements
- Google Colab with GPU runtime (T4/A100)
- Google Drive mounted with EEBO-BERT checkpoint
- Training data (Shakespeare sonnets)

## 1. Setup Environment

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Install dependencies
!pip install -q transformers==4.36.0 torch==2.1.0 datasets==2.15.0 tensorboard

In [None]:
# Authenticate with HuggingFace
!pip install -q huggingface_hub

from huggingface_hub import login
login(token='YOUR_HF_TOKEN_HERE')
print("‚úì Authenticated with HuggingFace")

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

## 2. Clone Repository and Upload Data

In [None]:
### Upload Required Files

Use the file upload button in Colab's left sidebar (üìÅ icon) to upload:

**Training Modules:**
1. `training/hierarchical_dataset.py`
2. `training/hierarchical_losses.py`
3. `training/hierarchical_trainer.py`

**Training Data:**
4. `Data/eebo_sonnets_hierarchical_train.jsonl`
5. `Data/eebo_sonnets_hierarchical_val.jsonl`

**Where to find these files:**
All files are in `/Users/justin/Repos/AI Project/`

**Upload location in Colab:**
- Upload the 3 `.py` files to `/content/training/`
- Upload the 2 `.jsonl` files to `/content/Data/`

Or just drag and drop all 5 files to the Files panel and run:
```python
!mkdir -p training Data
!mv hierarchical_*.py training/
!mv *.jsonl Data/
```

### Upload Training Module Files

Upload these files from your local machine:
1. `training/hierarchical_dataset.py`
2. `training/hierarchical_losses.py`
3. `training/hierarchical_trainer.py`
4. `Data/eebo_sonnets_hierarchical_train.jsonl`
5. `Data/eebo_sonnets_hierarchical_val.jsonl`

Use the file upload button in Colab's left sidebar.

## 3. Configuration

In [None]:
# Configuration
CONFIG = {
    # Model - using HuggingFace hosted model
    'base_model': 'jts3et/eebo-bert',
    'hf_token': 'YOUR_HF_TOKEN_HERE',  # Your HuggingFace token
    
    # Data paths (will be uploaded to Colab)
    'train_data': 'Data/eebo_sonnets_hierarchical_train.jsonl',
    'val_data': 'Data/eebo_sonnets_hierarchical_val.jsonl',
    
    # Output
    'output_dir': 'models/poetry_eebo_hierarchical_bert',
    'save_to_drive': '/content/drive/MyDrive/AI and Poetry/poetry_eebo_hierarchical_bert',
    
    # Training hyperparameters (optimized for GPU)
    'batch_size': 8,  # Good for A100/T4
    'num_epochs': 10,
    'learning_rate': 2e-5,
    'warmup_steps': 100,
    'max_length': 128,
    
    # Loss weights
    'mlm_weight': 0.5,
    'line_weight': 0.2,
    'quatrain_weight': 0.2,
    'sonnet_weight': 0.1,
    'temperature': 0.07,
    
    # Other
    'seed': 42
}

print("Configuration:")
for key, value in CONFIG.items():
    if key == 'hf_token':
        print(f"  {key}: {'*' * 20} (hidden)")
    else:
        print(f"  {key}: {value}")

## 4. Import Training Modules

In [None]:
import sys
sys.path.append('.')

import torch
from transformers import BertTokenizer, TrainingArguments
from training.hierarchical_dataset import HierarchicalPoetryDataset, collate_hierarchical
from training.hierarchical_losses import HierarchicalLoss
from training.hierarchical_trainer import HierarchicalBertModel, HierarchicalTrainer

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 5. Load Data and Model

In [None]:
# Set random seed
torch.manual_seed(CONFIG['seed'])

# Load tokenizer
print(f"Loading tokenizer from {CONFIG['base_model']}...")
tokenizer = BertTokenizer.from_pretrained(CONFIG['base_model'])
print("‚úì Tokenizer loaded")

# Load datasets
print(f"\nLoading training data...")
train_dataset = HierarchicalPoetryDataset(
    data_path=CONFIG['train_data'],
    tokenizer=tokenizer,
    max_length=CONFIG['max_length'],
    mlm_probability=0.15
)

print(f"Loading validation data...")
val_dataset = HierarchicalPoetryDataset(
    data_path=CONFIG['val_data'],
    tokenizer=tokenizer,
    max_length=CONFIG['max_length'],
    mlm_probability=0.15
)

print(f"\nDataset sizes:")
print(f"  Train: {len(train_dataset)} sonnets")
print(f"  Val: {len(val_dataset)} sonnets")

In [None]:
# Initialize model
print(f"\nInitializing model from {CONFIG['base_model']}...")
model = HierarchicalBertModel(base_model_path=CONFIG['base_model'])
print("‚úì Model initialized")

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nModel parameters:")
print(f"  Total: {total_params:,}")
print(f"  Trainable: {trainable_params:,}")

## 6. Setup Training

In [None]:
# Initialize loss function
loss_fn = HierarchicalLoss(
    temperature=CONFIG['temperature'],
    mlm_weight=CONFIG['mlm_weight'],
    line_weight=CONFIG['line_weight'],
    quatrain_weight=CONFIG['quatrain_weight'],
    sonnet_weight=CONFIG['sonnet_weight']
)

print("Loss configuration:")
print(f"  MLM weight: {CONFIG['mlm_weight']}")
print(f"  Line weight: {CONFIG['line_weight']}")
print(f"  Quatrain weight: {CONFIG['quatrain_weight']}")
print(f"  Sonnet weight: {CONFIG['sonnet_weight']}")
print(f"  Temperature: {CONFIG['temperature']}")

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir=CONFIG['output_dir'],
    num_train_epochs=CONFIG['num_epochs'],
    per_device_train_batch_size=CONFIG['batch_size'],
    per_device_eval_batch_size=CONFIG['batch_size'],
    learning_rate=CONFIG['learning_rate'],
    warmup_steps=CONFIG['warmup_steps'],
    weight_decay=0.01,
    logging_dir=f"{CONFIG['output_dir']}/logs",
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    fp16=True,  # Mixed precision training
    dataloader_num_workers=2,
    remove_unused_columns=False,
    report_to=["tensorboard"],
    seed=CONFIG['seed']
)

# Initialize trainer
trainer = HierarchicalTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=collate_hierarchical,
    loss_fn=loss_fn
)

print("‚úì Trainer initialized")

## 7. Start Training

In [None]:
print("="*70)
print("STARTING TRAINING")
print("="*70)
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Batch size: {CONFIG['batch_size']}")
print(f"Epochs: {CONFIG['num_epochs']}")
print(f"Learning rate: {CONFIG['learning_rate']}")
print(f"Device: {training_args.device}")
print("="*70)

# Train
trainer.train()

## 8. Save Model

In [None]:
# Save final model locally
final_model_path = f"{CONFIG['output_dir']}/final"
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)
print(f"‚úì Model saved locally to {final_model_path}")

# Copy to Google Drive
import shutil
if CONFIG['save_to_drive']:
    print(f"\nCopying model to Google Drive: {CONFIG['save_to_drive']}")
    shutil.copytree(final_model_path, CONFIG['save_to_drive'], dirs_exist_ok=True)
    print("‚úì Model saved to Google Drive")

## 9. View Training Metrics

In [None]:
# Load tensorboard
%load_ext tensorboard
%tensorboard --logdir {CONFIG['output_dir']}/logs

In [None]:
# Print loss history
import matplotlib.pyplot as plt

if trainer.loss_history['total']:
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    axes[0, 0].plot(trainer.loss_history['total'])
    axes[0, 0].set_title('Total Loss')
    axes[0, 0].set_xlabel('Step')
    
    axes[0, 1].plot(trainer.loss_history['mlm'], label='MLM', color='blue')
    axes[0, 1].set_title('MLM Loss')
    axes[0, 1].set_xlabel('Step')
    
    axes[1, 0].plot(trainer.loss_history['line'], label='Line', color='green')
    axes[1, 0].plot(trainer.loss_history['quatrain'], label='Quatrain', color='orange')
    axes[1, 0].set_title('Line & Quatrain Contrastive Loss')
    axes[1, 0].set_xlabel('Step')
    axes[1, 0].legend()
    
    axes[1, 1].plot(trainer.loss_history['sonnet'], label='Sonnet', color='red')
    axes[1, 1].set_title('Sonnet Contrastive Loss')
    axes[1, 1].set_xlabel('Step')
    
    plt.tight_layout()
    plt.savefig(f"{CONFIG['output_dir']}/loss_curves.png", dpi=300)
    plt.show()
    
    print("\nFinal losses (last 10 steps average):")
    print(f"  Total: {sum(trainer.loss_history['total'][-10:]) / 10:.4f}")
    print(f"  MLM: {sum(trainer.loss_history['mlm'][-10:]) / 10:.4f}")
    print(f"  Line: {sum(trainer.loss_history['line'][-10:]) / 10:.4f}")
    print(f"  Quatrain: {sum(trainer.loss_history['quatrain'][-10:]) / 10:.4f}")
    print(f"  Sonnet: {sum(trainer.loss_history['sonnet'][-10:]) / 10:.4f}")

## 10. Test Trained Model

In [None]:
# Quick test: encode a sonnet line
test_line = "Shall I compare thee to a summer's day?"

model.eval()
with torch.no_grad():
    inputs = tokenizer(test_line, return_tensors='pt').to(training_args.device)
    outputs = model.bert(**inputs)
    embedding = outputs.last_hidden_state.mean(dim=1)
    
print(f"Test line: {test_line}")
print(f"Embedding shape: {embedding.shape}")
print(f"Embedding norm: {embedding.norm().item():.4f}")
print("\n‚úì Model is working correctly!")

## Training Complete!

Your hierarchical Poetry-EEBO-BERT model is now trained and saved to:
- Local: `models/poetry_eebo_hierarchical_bert/final`
- Google Drive: (path specified in CONFIG)

Next steps:
1. Download model from Google Drive to local machine
2. Run validation scripts to compare with baseline models
3. Analyze trajectory tortuosity on Shakespeare's sonnets
4. Generate results for Paper 1