# Poetry BERT Training: Complete Pipeline

**This notebook does everything:**
1. ✅ Build unified poetry database from 4 JSONL sources
2. ✅ Export 6.2M line training corpus
3. ✅ Train Poetry BERT on unified corpus
4. ✅ Save and test trained model

## Before you start:

**All files should be in Google Drive at:** `MyDrive/AI and Poetry/Data/Databases/poetry_corpus/`
- `shakespeare_complete_works.jsonl`
- `gutenberg_reconstructed.jsonl`
- `core_poets_complete.jsonl`
- `poetrydb.jsonl`

**Estimated total time:** 8-12 hours on Colab GPU
- Database build: 2-4 hours
- BERT training: 6-8 hours

## Step 1: Setup - Install Dependencies

In [None]:
!pip install transformers datasets accelerate tqdm -q
print("✓ Dependencies installed")

## Step 2: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')
print("✓ Google Drive mounted")

## Step 3: Check GPU

In [None]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## Step 4: Build Unified Poetry Database

This combines all 4 sources into one SQLite database.

In [None]:
import json
import sqlite3
import logging
from pathlib import Path
from typing import Dict, Optional
from tqdm import tqdm

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')

# Configure paths - corrected folder name
DRIVE_BASE = Path("/content/drive/MyDrive/AI and Poetry/Data/Databases/poetry_corpus")
DB_PATH = "/content/poetry_unified.db"

SOURCES = {
    'shakespeare': DRIVE_BASE / "shakespeare_complete_works.jsonl",
    'gutenberg': DRIVE_BASE / "gutenberg_reconstructed.jsonl",
    'core_poets': DRIVE_BASE / "core_poets_complete.jsonl",
    'poetrydb': DRIVE_BASE / "poetrydb.jsonl"
}

print("Checking source files...")
for name, path in SOURCES.items():
    if path.exists():
        size_mb = path.stat().st_size / (1024**2)
        print(f"  ✓ {name}: {size_mb:.1f} MB")
    else:
        print(f"  ✗ {name}: NOT FOUND at {path}")
        print(f"     Please check Google Drive path!")

In [None]:
class UnifiedDatabaseBuilder:
    """Build unified SQLite database from all poetry sources."""

    def __init__(self, db_path: str):
        self.db_path = Path(db_path)
        self.conn = None
        self.cursor = None

    def normalize_text(self, text: str) -> str:
        """Normalize text for searching."""
        import re
        text = text.lower()
        text = re.sub(r'[^\w\s]', '', text)
        return text.strip()

    def generate_work_id(self, source: str, title: str, author: str = "", index: int = 0) -> str:
        """Generate a work_id if one doesn't exist."""
        import hashlib
        # Create a deterministic ID from title + author
        id_base = f"{source}_{self.normalize_text(title)}_{self.normalize_text(author)}"
        if index > 0:
            id_base += f"_{index}"
        # Use first 16 chars of hash for uniqueness
        hash_suffix = hashlib.md5(id_base.encode()).hexdigest()[:8]
        return f"{source}_{hash_suffix}"

    def create_schema(self):
        """Create unified database schema."""
        logging.info("Creating database schema...")

        # Works table
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS works (
                work_id TEXT PRIMARY KEY,
                title TEXT NOT NULL,
                author TEXT,
                first_appearance_date INTEGER,
                publication_date INTEGER,
                composition_date INTEGER,
                period TEXT,
                genre TEXT,
                source TEXT NOT NULL,
                line_count INTEGER,
                metadata_complete BOOLEAN,
                full_text TEXT,
                career_period TEXT,
                gutenberg_id INTEGER,
                poetrydb_id TEXT,
                title_normalized TEXT,
                author_normalized TEXT
            )
        """)

        # Lines table
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS lines (
                line_id INTEGER PRIMARY KEY AUTOINCREMENT,
                work_id TEXT NOT NULL,
                line_num INTEGER NOT NULL,
                line_text TEXT NOT NULL,
                is_blank BOOLEAN DEFAULT FALSE,
                meter TEXT,
                feet INTEGER,
                stresses TEXT,
                rhyme_scheme TEXT,
                FOREIGN KEY (work_id) REFERENCES works(work_id)
            )
        """)

        # Authors table
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS authors (
                author_id INTEGER PRIMARY KEY AUTOINCREMENT,
                name TEXT UNIQUE NOT NULL,
                name_normalized TEXT,
                birth_year INTEGER,
                death_year INTEGER,
                period TEXT,
                work_count INTEGER
            )
        """)

        # Metadata table
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS metadata (
                work_id TEXT NOT NULL,
                key TEXT NOT NULL,
                value TEXT,
                PRIMARY KEY (work_id, key),
                FOREIGN KEY (work_id) REFERENCES works(work_id)
            )
        """)

        # Create indexes
        self.cursor.execute("CREATE INDEX IF NOT EXISTS idx_works_author ON works(author)")
        self.cursor.execute("CREATE INDEX IF NOT EXISTS idx_works_period ON works(period)")
        self.cursor.execute("CREATE INDEX IF NOT EXISTS idx_works_genre ON works(genre)")
        self.cursor.execute("CREATE INDEX IF NOT EXISTS idx_works_source ON works(source)")
        self.cursor.execute("CREATE INDEX IF NOT EXISTS idx_lines_work ON lines(work_id)")
        self.cursor.execute("CREATE INDEX IF NOT EXISTS idx_lines_text ON lines(line_text)")

        self.conn.commit()
        logging.info("✓ Schema created")

    def import_shakespeare(self, input_path: str):
        """Import Shakespeare corpus."""
        input_path = Path(input_path)
        if not input_path.exists():
            logging.warning(f"Shakespeare corpus not found at {input_path}")
            return

        logging.info(f"Importing Shakespeare from {input_path}...")

        works = []
        with open(input_path, 'r', encoding='utf-8') as f:
            for line in f:
                works.append(json.loads(line))

        generated_ids = 0
        for idx, work in enumerate(tqdm(works, desc="Shakespeare")):
            # Generate work_id if missing
            if 'work_id' not in work or not work['work_id']:
                work['work_id'] = self.generate_work_id('shakespeare', work['title'], 'William Shakespeare', idx)
                generated_ids += 1

            career_period = self._guess_shakespeare_career_period(work.get('date'))
            date = work.get('date')
            
            self.cursor.execute("""
                INSERT OR IGNORE INTO works (
                    work_id, title, author, first_appearance_date, publication_date,
                    composition_date, period, genre, source, line_count,
                    metadata_complete, full_text, career_period,
                    title_normalized, author_normalized
                ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                work['work_id'], work['title'], 'William Shakespeare',
                date, date, date, 'early_modern', work.get('genre', 'drama'),
                work.get('source', 'shakespeare'), work.get('line_count', 0), True, work.get('text'),
                career_period, self.normalize_text(work['title']),
                self.normalize_text('William Shakespeare')
            ))

            if 'lines' in work:
                for i, line_text in enumerate(work['lines'], 1):
                    is_blank = len(line_text.strip()) == 0
                    self.cursor.execute("""
                        INSERT INTO lines (work_id, line_num, line_text, is_blank)
                        VALUES (?, ?, ?, ?)
                    """, (work['work_id'], i, line_text, is_blank))

        self.conn.commit()
        if generated_ids > 0:
            logging.info(f"✓ Imported {len(works)} Shakespeare works ({generated_ids} work_ids generated)")
        else:
            logging.info(f"✓ Imported {len(works)} Shakespeare works")

    def _guess_shakespeare_career_period(self, year: Optional[int]) -> str:
        if not year:
            return 'unknown'
        if year < 1594:
            return 'early'
        elif year < 1601:
            return 'middle'
        elif year < 1608:
            return 'late'
        else:
            return 'final'

    def import_gutenberg(self, input_path: str):
        """Import Gutenberg reconstructed works."""
        input_path = Path(input_path)
        if not input_path.exists():
            logging.warning(f"Gutenberg not found at {input_path}")
            return

        logging.info(f"Importing Gutenberg from {input_path}...")

        works = []
        with open(input_path, 'r', encoding='utf-8') as f:
            for line in tqdm(f, desc="Reading Gutenberg"):
                works.append(json.loads(line))

        generated_ids = 0
        for idx, work in enumerate(tqdm(works, desc="Gutenberg")):
            # Generate work_id if missing
            if 'work_id' not in work or not work['work_id']:
                work['work_id'] = self.generate_work_id('gutenberg', work['title'], work.get('author', ''), idx)
                generated_ids += 1

            pub_date = work.get('publication_date')
            
            self.cursor.execute("""
                INSERT OR IGNORE INTO works (
                    work_id, title, author, first_appearance_date, publication_date,
                    composition_date, period, genre, source, line_count,
                    metadata_complete, full_text, gutenberg_id,
                    title_normalized, author_normalized
                ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                work['work_id'], work['title'], work.get('author'),
                pub_date, pub_date, work.get('composition_date'),
                work.get('period', 'unknown'), 'poetry', work.get('source', 'gutenberg'),
                work['line_count'], work.get('metadata_complete', False),
                work.get('text'), work.get('gutenberg_id'),
                self.normalize_text(work['title']),
                self.normalize_text(work.get('author', ''))
            ))

            for i, line_text in enumerate(work['lines'], 1):
                is_blank = len(line_text.strip()) == 0
                self.cursor.execute("""
                    INSERT INTO lines (work_id, line_num, line_text, is_blank)
                    VALUES (?, ?, ?, ?)
                """, (work['work_id'], i, line_text, is_blank))

            if 'subjects' in work:
                for subject in work['subjects']:
                    self.cursor.execute("""
                        INSERT OR IGNORE INTO metadata (work_id, key, value)
                        VALUES (?, ?, ?)
                    """, (work['work_id'], 'subject', subject))

        self.conn.commit()
        if generated_ids > 0:
            logging.info(f"✓ Imported {len(works)} Gutenberg works ({generated_ids} work_ids generated)")
        else:
            logging.info(f"✓ Imported {len(works)} Gutenberg works")

    def import_core_poets(self, input_path: str):
        """Import Core 27 Poets corpus."""
        input_path = Path(input_path)
        if not input_path.exists():
            logging.warning(f"Core poets not found at {input_path}")
            return

        logging.info(f"Importing Core Poets from {input_path}...")

        works = []
        with open(input_path, 'r', encoding='utf-8') as f:
            for line in f:
                works.append(json.loads(line))

        generated_ids = 0
        for idx, work in enumerate(tqdm(works, desc="Core Poets")):
            # Generate work_id if missing
            if 'work_id' not in work or not work['work_id']:
                work['work_id'] = self.generate_work_id('core_poets', work['title'], work['author'], idx)
                generated_ids += 1

            pub_date = work.get('publication_date')
            
            self.cursor.execute("""
                INSERT OR IGNORE INTO works (
                    work_id, title, author, first_appearance_date, publication_date,
                    composition_date, period, genre, source, line_count,
                    metadata_complete, full_text, title_normalized, author_normalized
                ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                work['work_id'], work['title'], work['author'],
                pub_date, pub_date, work.get('composition_date'),
                work.get('period', 'unknown'), work.get('genre', 'poetry'),
                work.get('source', 'core_poets'), work.get('line_count', 0),
                work.get('metadata_complete', True), work.get('text'),
                self.normalize_text(work['title']),
                self.normalize_text(work['author'])
            ))

            if 'lines' in work:
                for i, line_text in enumerate(work['lines'], 1):
                    is_blank = len(line_text.strip()) == 0
                    self.cursor.execute("""
                        INSERT INTO lines (work_id, line_num, line_text, is_blank)
                        VALUES (?, ?, ?, ?)
                    """, (work['work_id'], i, line_text, is_blank))

        self.conn.commit()
        if generated_ids > 0:
            logging.info(f"✓ Imported {len(works)} Core Poets works ({generated_ids} work_ids generated)")
        else:
            logging.info(f"✓ Imported {len(works)} Core Poets works")

    def import_poetrydb(self, input_path: str):
        """Import PoetryDB corpus."""
        input_path = Path(input_path)
        if not input_path.exists():
            logging.warning(f"PoetryDB not found at {input_path}")
            return

        logging.info(f"Importing PoetryDB from {input_path}...")

        works = []
        with open(input_path, 'r', encoding='utf-8') as f:
            for line in f:
                works.append(json.loads(line))

        generated_ids = 0
        for idx, work in enumerate(tqdm(works, desc="PoetryDB")):
            # Generate work_id if missing
            if 'work_id' not in work or not work['work_id']:
                work['work_id'] = self.generate_work_id('poetrydb', work['title'], work['author'], idx)
                generated_ids += 1

            pub_date = work.get('publication_date')
            
            self.cursor.execute("""
                INSERT OR IGNORE INTO works (
                    work_id, title, author, first_appearance_date, publication_date,
                    period, genre, source, line_count, metadata_complete, full_text,
                    poetrydb_id, title_normalized, author_normalized
                ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                work['work_id'], work['title'], work['author'],
                pub_date, pub_date, work.get('period', 'unknown'),
                'poetry', work.get('source', 'poetrydb'), work.get('line_count', 0),
                True, work.get('text'), work.get('poetrydb_id'),
                self.normalize_text(work['title']),
                self.normalize_text(work['author'])
            ))

            if 'lines' in work:
                for i, line_text in enumerate(work['lines'], 1):
                    is_blank = len(line_text.strip()) == 0
                    self.cursor.execute("""
                        INSERT INTO lines (work_id, line_num, line_text, is_blank)
                        VALUES (?, ?, ?, ?)
                    """, (work['work_id'], i, line_text, is_blank))

        self.conn.commit()
        if generated_ids > 0:
            logging.info(f"✓ Imported {len(works)} PoetryDB works ({generated_ids} work_ids generated)")
        else:
            logging.info(f"✓ Imported {len(works)} PoetryDB works")

    def build_author_table(self):
        """Populate authors table from works."""
        logging.info("Building authors table...")

        self.cursor.execute("""
            INSERT OR IGNORE INTO authors (name, name_normalized, work_count)
            SELECT author, author_normalized, COUNT(*) as work_count
            FROM works
            WHERE author IS NOT NULL
            GROUP BY author, author_normalized
        """)

        self.conn.commit()
        author_count = self.cursor.execute("SELECT COUNT(*) FROM authors").fetchone()[0]
        logging.info(f"✓ Built authors table with {author_count} authors")

    def print_summary(self):
        """Print database summary statistics."""
        logging.info("\n" + "="*60)
        logging.info("DATABASE SUMMARY")
        logging.info("="*60)

        total_works = self.cursor.execute("SELECT COUNT(*) FROM works").fetchone()[0]
        total_lines = self.cursor.execute("SELECT COUNT(*) FROM lines").fetchone()[0]
        total_authors = self.cursor.execute("SELECT COUNT(*) FROM authors").fetchone()[0]

        logging.info(f"\nTotal works: {total_works:,}")
        logging.info(f"Total lines: {total_lines:,}")
        logging.info(f"Total authors: {total_authors:,}")

        logging.info("\nBy source:")
        sources = self.cursor.execute("""
            SELECT source, COUNT(*), SUM(line_count)
            FROM works
            GROUP BY source
            ORDER BY COUNT(*) DESC
        """).fetchall()

        for source, count, lines in sources:
            logging.info(f"  {source}: {count:,} works, {lines:,} lines")

        logging.info("\n" + "="*60)

    def build(self, sources: Dict[str, str]):
        """Build the unified database from all sources."""
        logging.info(f"Creating database at {self.db_path}...")
        
        # Delete existing database if it exists
        if self.db_path.exists():
            self.db_path.unlink()
            logging.info("✓ Deleted existing database")
        
        self.conn = sqlite3.connect(self.db_path)
        self.cursor = self.conn.cursor()

        try:
            self.create_schema()

            # Import in order: Shakespeare, Gutenberg, Core Poets, PoetryDB
            if 'shakespeare' in sources:
                self.import_shakespeare(sources['shakespeare'])

            if 'gutenberg' in sources:
                self.import_gutenberg(sources['gutenberg'])

            if 'core_poets' in sources:
                self.import_core_poets(sources['core_poets'])

            if 'poetrydb' in sources:
                self.import_poetrydb(sources['poetrydb'])

            self.build_author_table()
            self.print_summary()

            logging.info(f"\n✓ Database built successfully: {self.db_path}")

        finally:
            self.conn.close()

print("✓ UnifiedDatabaseBuilder class defined")

### Run Database Build

In [None]:
# Build the database
builder = UnifiedDatabaseBuilder(DB_PATH)

print("="*60)
print("BUILDING UNIFIED POETRY DATABASE")
print("="*60)
print("")

builder.build(SOURCES)

print("\n" + "="*60)
print("BUILD COMPLETE")
print("="*60)

## Step 5: Export Training Corpus

Export all non-blank lines from the database as a text file for BERT training.

In [None]:
import sqlite3
from tqdm import tqdm

CORPUS_OUTPUT = "/content/poetry_training_corpus.txt"

print("Exporting training corpus...")
print(f"Database: {DB_PATH}")
print(f"Output: {CORPUS_OUTPUT}")

# Connect to database
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()

# Get total line count
total_lines = cursor.execute("SELECT COUNT(*) FROM lines WHERE is_blank = 0").fetchone()[0]
print(f"\nExporting {total_lines:,} non-blank lines...")

# Export lines
with open(CORPUS_OUTPUT, 'w', encoding='utf-8') as f:
    cursor.execute("SELECT line_text FROM lines WHERE is_blank = 0 ORDER BY line_id")
    
    batch_size = 10000
    pbar = tqdm(total=total_lines, desc="Exporting")
    
    while True:
        rows = cursor.fetchmany(batch_size)
        if not rows:
            break
        
        for (line_text,) in rows:
            f.write(line_text + '\n')
        
        pbar.update(len(rows))
    
    pbar.close()

conn.close()

# Check file size
import os
size_gb = os.path.getsize(CORPUS_OUTPUT) / (1024**3)
print(f"\n✓ Corpus exported: {size_gb:.2f} GB")
print(f"✓ Location: {CORPUS_OUTPUT}")

## Step 6: Configure BERT Training

In [None]:
# Training configuration
BATCH_SIZE = 8
MAX_LENGTH = 512
NUM_EPOCHS = 3
LEARNING_RATE = 5e-5
SAVE_STEPS = 1000
LOGGING_STEPS = 100

print("Training Configuration:")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Max length: {MAX_LENGTH}")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Save every: {SAVE_STEPS} steps")
print(f"  Log every: {LOGGING_STEPS} steps")

## Step 7: Load and Tokenize Corpus

In [None]:
from transformers import BertTokenizer, BertForMaskedLM, DataCollatorForLanguageModeling
from datasets import load_dataset
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Load model and tokenizer
print("Loading BERT model...")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
print(f"✓ Model loaded: {model.num_parameters():,} parameters")

# Load corpus
print(f"\nLoading corpus from {CORPUS_OUTPUT}...")
dataset = load_dataset('text', data_files={'train': CORPUS_OUTPUT}, split='train')
print(f"✓ Loaded {len(dataset):,} lines")

# Tokenize
print("\nTokenizing corpus...")
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=MAX_LENGTH,
        padding='max_length',
        return_special_tokens_mask=True
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['text'],
    desc="Tokenizing"
)
print("✓ Tokenization complete")

## Step 8: Setup Training

In [None]:
from transformers import Trainer, TrainingArguments

# Data collator for MLM
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./poetry_bert_checkpoints",
    overwrite_output_dir=True,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    save_steps=SAVE_STEPS,
    save_total_limit=2,
    logging_steps=LOGGING_STEPS,
    learning_rate=LEARNING_RATE,
    warmup_steps=500,
    weight_decay=0.01,
    fp16=True,  # Mixed precision
    logging_dir='./logs',
    report_to='none',
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

print("✓ Training setup complete")
print(f"Total training steps: {len(tokenized_dataset) // BATCH_SIZE * NUM_EPOCHS:,}")

## Step 9: Start Training (6-8 hours)

⚠️ **This will take 6-8 hours on Colab GPU**

You can close the tab - training will continue in the background.

In [None]:
import time

print("="*60)
print("STARTING POETRY BERT TRAINING")
print("="*60)
print("Estimated time: 6-8 hours on Colab GPU")
print("You can close this tab - training will continue")
print("="*60)

start_time = time.time()

# Train
trainer.train()

total_time = time.time() - start_time
print(f"\n✓ Training complete! Total time: {total_time/3600:.2f} hours")

## Step 10: Save Final Model

In [None]:
# Save to local Colab storage
output_dir = "./poetry_bert_trained"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"✓ Model saved to {output_dir}")

# Also save to Google Drive
drive_output = "/content/drive/MyDrive/poetry_bert_trained"
!cp -r {output_dir} {drive_output}
print(f"✓ Model also saved to Google Drive: {drive_output}")

# Save the database too
!cp {DB_PATH} /content/drive/MyDrive/poetry_unified.db
print(f"✓ Database saved to Google Drive")

## Step 11: Test the Model

In [None]:
from transformers import pipeline

# Load the trained model
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# Test on Shakespeare
test_text = "Shall I compare thee to a [MASK] day?"
print(f"Test: '{test_text}'\n")

predictions = fill_mask(test_text)
for i, pred in enumerate(predictions[:5], 1):
    print(f"{i}. {pred['token_str']}: {pred['score']:.4f}")

print("\n---\n")

# Test on Modernist-style line
test_text2 = "The [MASK] hangs in fragments over the broken town"
print(f"Test: '{test_text2}'\n")

predictions2 = fill_mask(test_text2)
for i, pred in enumerate(predictions2[:5], 1):
    print(f"{i}. {pred['token_str']}: {pred['score']:.4f}")

## Step 12: Download Files (Optional)

Download the trained model and database to your local machine.

In [None]:
# Zip the model for easier download
!zip -r poetry_bert_trained.zip poetry_bert_trained/

# Download via Colab files panel (left sidebar)
from google.colab import files

print("To download:")
print("  1. Model: poetry_bert_trained.zip (~500MB)")
print("  2. Database: poetry_unified.db (~3-4GB)")
print("\nUse the Files panel on the left to download, or run:")
print("  files.download('poetry_bert_trained.zip')")
print("  files.download('poetry_unified.db')")

## Summary

✅ Built unified poetry database (4,444 works, 6.2M lines)
✅ Exported training corpus
✅ Trained Poetry BERT
✅ Saved model to Google Drive

**Next steps:**
1. Download model from Google Drive
2. Use for semantic trajectory analysis
3. Extract embeddings for specific poems
4. Compare with EEBO-BERT for diachronic analysis