<a href="https://colab.research.google.com/github/Text-Machine/data-processing-code/blob/main/colab_training_gpt2_sentences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT-2 Pretraining on Historical Texts (Google Colab)

This notebook allows you to pretrain GPT-2 on historical text data (EEBO, ECCO, EVAN) using Google Colab's free GPU.

**Key difference from chunk-based approach:** This uses sentence-based tokenization for more natural language boundaries.

## Setup Steps:
1. **Enable GPU**: Runtime → Change runtime type → GPU (T4 recommended)
2. **Upload Data**: Upload your CSV files to Google Drive or upload directly
3. **Run All Cells**: Runtime → Run all

## What this does:
- Installs required packages
- Loads CSV data with columns: `author`, `place`, `date`, `page_text`
- Splits text into sentences
- Creates training sequences with `<date> [TIME]` prefix followed by one or more sentences
- Trains GPT-2 with causal language modeling (next-token prediction)
- Saves trained model to Google Drive

## Why sentences instead of chunks?
- **Natural boundaries**: Sentences respect linguistic structure
- **Better context**: Models learn from semantically complete units
- **Improved quality**: Fewer artificial breaks in the middle of clauses

In [None]:
# Check GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️ No GPU detected. Go to Runtime → Change runtime type → GPU")

In [None]:
# Install required packages
!pip install -q transformers datasets pandas accelerate nltk

## Option 1: Download Data from Google Drive

If you have CSV files in Google Drive, uncomment and run the cells below.

In [None]:
# Mount Google Drive (optional)
# from google.colab import drive
# drive.mount('/content/drive')
# !gdown YOUR_FILE_ID_HERE
# !unzip data.zip
!gdown 11wfdV7j1TBv_i9XOiT8G8V4NxnJTxezz

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from pathlib import Path
from transformers import GPT2TokenizerFast, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset
import logging
import re

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("✓ Libraries imported successfully")

In [None]:
# Download NLTK punkt tokenizer for sentence splitting
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize

print("✓ NLTK punkt downloaded")

In [None]:
# Define preprocessing functions

def load_csv_as_dataset(csv_paths):
    """Load CSV files and convert to Hugging Face Dataset."""
    all_data = []
    
    for csv_path in csv_paths:
        logger.info(f"Loading {Path(csv_path).name}...")
        df = pd.read_csv(csv_path)
        logger.info(f"  Rows: {len(df)}, Columns: {list(df.columns)}")
        all_data.append(df)
    
    combined_df = pd.concat(all_data, ignore_index=True)
    logger.info(f"Total rows: {len(combined_df)}")
    
    dataset = Dataset.from_pandas(combined_df)
    return dataset


def tokenize_and_sentence_function(examples, tokenizer, max_length=512):
    """
    Tokenize text and create training examples based on sentences.
    
    Format: <date> [TIME] <sentence1> <sentence2> ...
    
    Sentences are combined until we reach near max_length.
    Each example has the date prefix for temporal context.
    """
    input_ids_list = []
    attention_masks_list = []

    time_id = tokenizer.convert_tokens_to_ids("[TIME]")

    for date, text in zip(examples["date"], examples["page_text"]):

        if not text or pd.isna(text) or pd.isna(date):
            continue

        date_str = str(date).strip()
        text_str = str(text).strip()

        # Create prefix: "<date> [TIME] "
        prefix = f"{date_str} [TIME] "
        prefix_ids = tokenizer.encode(prefix, add_special_tokens=False)
        reserved_tokens = len(prefix_ids)  # Space reserved for prefix

        # Split text into sentences
        try:
            sentences = sent_tokenize(text_str)
        except:
            # Fallback: split on periods if sent_tokenize fails
            sentences = [s.strip() for s in text_str.split('.') if s.strip()]

        if not sentences:
            continue

        # Group sentences to create training examples
        current_ids = prefix_ids.copy()
        
        for sentence in sentences:
            sentence_ids = tokenizer.encode(sentence, add_special_tokens=False)
            
            # Check if adding this sentence would exceed max_length
            if len(current_ids) + len(sentence_ids) + 1 > max_length:  # +1 for space
                # Save current example and start new one
                if len(current_ids) > reserved_tokens:  # Only save if we have sentences
                    input_ids_list.append(current_ids[:max_length])
                    attention_masks_list.append([1] * len(input_ids_list[-1]))
                
                # Start new example with prefix and current sentence
                current_ids = prefix_ids.copy()
                current_ids.extend(sentence_ids)
                current_ids.append(tokenizer.encode(" ", add_special_tokens=False)[0])
            else:
                # Add sentence to current example
                current_ids.extend(sentence_ids)
                current_ids.append(tokenizer.encode(" ", add_special_tokens=False)[0])
        
        # Save last example
        if len(current_ids) > reserved_tokens:
            input_ids_list.append(current_ids[:max_length])
            attention_masks_list.append([1] * len(input_ids_list[-1]))

    return {
        "input_ids": input_ids_list,
        "attention_mask": attention_masks_list,
    }


print("✓ Preprocessing functions defined")

In [None]:
# Configuration
BATCH_SIZE = 8      # Smaller due to GPT-2's memory requirements
EPOCHS = 3
LEARNING_RATE = 5e-5
MAX_SAMPLES = None  # Set to e.g., 10000 for quick testing
MAX_LENGTH = 512    # Max tokens per training example

print("Training Configuration:")
print(f"  Max length: {MAX_LENGTH} tokens")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Epochs: {EPOCHS}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Max samples: {MAX_SAMPLES or 'All'}")

In [None]:
# Load data
DATA_DIR = '.'
csv_files = list(Path(DATA_DIR).glob('*.csv'))

if not csv_files:
    raise FileNotFoundError(f"No CSV files found in {DATA_DIR}. Please upload data first.")

print(f"Found {len(csv_files)} CSV file(s):")
for f in csv_files:
    size_mb = f.stat().st_size / (1024 * 1024)
    print(f"  - {f.name} ({size_mb:.1f} MB)")

dataset = load_csv_as_dataset(csv_files)
print(f"\nDataset loaded: {len(dataset)} rows")

# Limit samples if specified
if MAX_SAMPLES and MAX_SAMPLES < len(dataset):
    dataset = dataset.select(range(MAX_SAMPLES))
    print(f"Limited to {MAX_SAMPLES} samples for testing")

In [None]:
# Load tokenizer and model
print("Loading GPT-2 tokenizer and model...")
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

# Add [TIME] special token
if "[TIME]" not in tokenizer.vocab:
    tokenizer.add_tokens(["[TIME]"])
    print("Added [TIME] token to vocabulary")

# Set padding token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token
print(f"Set pad_token to: {tokenizer.pad_token}")

print(f"Vocabulary size: {len(tokenizer)}")

model = GPT2LMHeadModel.from_pretrained('gpt2')
model.resize_token_embeddings(len(tokenizer))
print(f"Model loaded: {model.num_parameters():,} parameters")

In [None]:
# Preprocess data (tokenize by sentences)
print("Tokenizing and creating sentence-based training examples (this may take several minutes)...")

tokenized_dataset = dataset.map(
    lambda examples: tokenize_and_sentence_function(
        examples,
        tokenizer,
        max_length=MAX_LENGTH
    ),
    batched=True,
    batch_size=50,
    remove_columns=dataset.column_names,
    num_proc=1,
    desc="Tokenizing by sentences"
)

print(f"Tokenized dataset size: {len(tokenized_dataset)} samples")

# Show sample
if len(tokenized_dataset) > 0:
    sample = tokenized_dataset[0]
    print(f"\nSample input (first 100 tokens):")
    print(tokenizer.decode(sample['input_ids'][:100]))

In [None]:
# Split train/validation
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset['train']
val_dataset = split_dataset['test']

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

In [None]:
# Setup training
OUTPUT_DIR = 'gpt2_ecco_pretrained'
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    weight_decay=0.01,
    warmup_steps=500,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to="none",
)

# Data collator for causal language modeling (NOT masked LM)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # False = causal LM (GPT-2), True = masked LM (BERT)
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

print("✓ Trainer configured")
print(f"\nStarting training with {EPOCHS} epochs...")

In [None]:
# Train the model
trainer.train()
trainer.push_to_hub()

In [None]:
# Evaluate on validation set
print("Evaluating on validation set...")
eval_results = trainer.evaluate()

print("\nValidation Results:")
for key, value in eval_results.items():
    print(f"  {key}: {value:.4f}")

## Test the Trained Model

Let's test the model with text generation.

In [None]:
# Test text generation
from transformers import pipeline

# Create text generation pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

# Test prompts with historical context
test_prompts = [
    "1650 [TIME] The king",
    "1800 [TIME] The parliament",
    "1700 [TIME] In London",
]

print("Testing text generation:\n")
for prompt in test_prompts:
    print(f"Prompt: {prompt}")
    outputs = generator(
        prompt,
        max_length=60,
        num_return_sequences=2,
        temperature=0.8,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )
    for i, output in enumerate(outputs, 1):
        generated_text = output['generated_text']
        print(f"  {i}. {generated_text}")
    print()

## Download Model (Optional)

If you want to download the trained model to your local machine, run the cell below.

In [None]:
# Zip and download model
import shutil
from google.colab import files

# Create zip file
zip_path = '/content/gpt2_pretrained'
shutil.make_archive(zip_path, 'zip', OUTPUT_DIR)

print(f"Model zipped. Size: {Path(f'{zip_path}.zip').stat().st_size / 1e6:.1f} MB")
print("Downloading...")

# Download
files.download(f'{zip_path}.zip')