# Penn Treebank Language Modeling

## Overview

This notebook enables training a language model on the Penn Treebank dataset using GPU acceleration in Google Colab.

**Features:**
- Automatic environment setup and dependency installation
- **Runtime data extraction** to minimize Google Drive upload time
- GPU verification and optimization
- Two training configurations: quick test and full training
- Automatic model checkpointing to Google Drive

**Requirements:**
- Google Colab Pro (recommended for longer training sessions)
- Project folder uploaded to Google Drive (with compressed data only)
- GPU runtime enabled in Colab

In [None]:
# Mount Google Drive
from google.colab import drive
import os

drive.mount('/content/drive')

# Verify mount was successful
if os.path.exists('/content/drive/MyDrive'):
    print("✅ Google Drive mounted successfully!")
    print(f"Drive contents: {os.listdir('/content/drive/MyDrive')[:5]}...")  # Show first 5 items
else:
    print("❌ Failed to mount Google Drive. Please try again.")

## 🔧 Setup Project Directory

**Instructions:**
1. Locate your `Proj-2-Penn Treebank (PTB)` folder in Google Drive
2. Update the path in the next cell
3. **Important:** Remove the extracted `LDC99T42/` folder from your project before uploading to save time - we'll extract it at runtime!

**Common paths:**
- If uploaded to root: `/content/drive/MyDrive/Proj-2-Penn Treebank (PTB)`
- If in a subfolder: `/content/drive/MyDrive/YourFolder/Proj-2-Penn Treebank (PTB)`
- If in Colab Notebooks folder: `/content/drive/MyDrive/Colab Notebooks/Proj-2-Penn Treebank (PTB)`

In [None]:
import os
import sys

project_path = '/content/drive/MyDrive/Proj-2-Penn Treebank (PTB)'

# Verify the project directory exists
if not os.path.exists(project_path):
    print(f"❌ Project directory not found: {project_path}")
    print("\n🔍 Searching for the project folder...")

    # Search for the project folder
    drive_root = '/content/drive/MyDrive'
    for root, dirs, files in os.walk(drive_root):
        if 'Proj-2-Penn Treebank (PTB)' in dirs:
            suggested_path = os.path.join(root, 'Proj-2-Penn Treebank (PTB)')
            print(f"Found project at: {suggested_path}")
            break

    print("\nPlease update the 'project_path' variable above with the correct path.")
    sys.exit(1)

# Change to project directory
os.chdir(project_path)
print(f"✅ Current working directory: {os.getcwd()}")

# Verify project structure - now only requiring compressed archive
required_items = ['src', 'config', 'data', 'requirements.txt']
required_data_files = ['data/ptb/LDC99T42_Penn_Treebank_3.tar.zst']

missing_items = [item for item in required_items if not os.path.exists(item)]
missing_data = [item for item in required_data_files if not os.path.exists(item)]

if missing_items:
    print(f"⚠️ Missing required directories: {missing_items}")
if missing_data:
    print(f"⚠️ Missing required data files: {missing_data}")
    print("   Make sure LDC99T42_Penn_Treebank_3.tar.zst is in the data/ptb/ directory")

if not missing_items and not missing_data:
    print("✅ Project structure verified!")
    print(f"Project contents: {os.listdir('.')[:10]}...")  # Show first 10 items

    # Check if data is already extracted
    if os.path.exists('data/ptb/LDC99T42'):
        print("📁 Data already extracted")
    else:
        print("📦 Data needs to be extracted (will be done in next step)")

## 📦 Extract Penn Treebank Data

This step extracts the compressed Penn Treebank data at runtime to avoid uploading large extracted files to Google Drive. This significantly reduces upload time while maintaining full functionality.

**Process:**
1. Extract `LDC99T42_Penn_Treebank_3.tar.zst` to temporary local storage
2. Process raw data files to create train/valid/test splits
3. Keep only the processed text files (much smaller)

**Benefits:**
- ⚡ Faster Google Drive uploads (no large extracted data)
- 💾 Efficient storage usage
- 🔄 Fresh extraction each session ensures data integrity

In [None]:
import tarfile
import subprocess
import time
import shutil
from pathlib import Path

# Check if data extraction is needed
data_dir = Path('data/ptb')
archive_path = data_dir / 'LDC99T42_Penn_Treebank_3.tar.zst'
extracted_dir = data_dir / 'LDC99T42'
processed_files = ['ptb.train.txt', 'ptb.valid.txt', 'ptb.test.txt']

# Check if processed files already exist
all_processed_exist = all((data_dir / f).exists() for f in processed_files)

if all_processed_exist:
    print("✅ Processed Penn Treebank data already exists")
    for f in processed_files:
        file_path = data_dir / f
        size_mb = file_path.stat().st_size / (1024 * 1024)
        print(f"   📄 {f}: {size_mb:.1f} MB")
else:
    print("🚀 Extracting and processing Penn Treebank data...")
    start_time = time.time()

    # Check if archive exists
    if not archive_path.exists():
        print(f"❌ Archive not found: {archive_path}")
        print("Please ensure LDC99T42_Penn_Treebank_3.tar.zst is in the data/ptb/ directory")
        sys.exit(1)

    print(f"📦 Found archive: {archive_path.name} ({archive_path.stat().st_size / (1024*1024):.1f} MB)")

    # Install zstd if not available (for .zst decompression)
    try:
        subprocess.run(['zstd', '--version'], capture_output=True, check=True)
        print("✅ zstd already installed")
    except:
        print("📥 Installing zstd for archive extraction...")
        !apt-get update -qq && apt-get install -y zstd

    # Extract the archive
    print("🔄 Extracting archive...")
    try:
        # First decompress .zst to .tar
        tar_path = archive_path.with_suffix('')  # Remove .zst extension
        subprocess.run(['zstd', '-d', str(archive_path), '-o', str(tar_path)], check=True)

        # Then extract the tar file
        with tarfile.open(tar_path, 'r') as tar:
            tar.extractall(path=data_dir)

        # Clean up the intermediate .tar file
        tar_path.unlink()

        print(f"✅ Archive extracted to: {extracted_dir}")

    except Exception as e:
        print(f"❌ Extraction failed: {e}")
        sys.exit(1)

    # Process the extracted data using the preprocessing script
    print("🔄 Processing raw data into train/valid/test splits...")
    try:
        subprocess.run([
            'python', 'scripts/preprocess_ptb.py',
            '--ptb_root', str(extracted_dir),
            '--output_dir', str(data_dir),
            '--verify'
        ], check=True)

        print("✅ Data processing completed")

    except Exception as e:
        print(f"❌ Data processing failed: {e}")
        sys.exit(1)

    # Clean up extracted directory to save space (keep only processed files)
    if extracted_dir.exists():
        print("🧹 Cleaning up extracted files to save space...")
        shutil.rmtree(extracted_dir)
        print("✅ Cleanup completed")

    end_time = time.time()
    print(f"\n⏱️ Data extraction and processing completed in {end_time - start_time:.1f} seconds")

    # Show final processed files
    print("\n📊 Final processed data files:")
    for f in processed_files:
        file_path = data_dir / f
        if file_path.exists():
            size_mb = file_path.stat().st_size / (1024 * 1024)
            print(f"   📄 {f}: {size_mb:.1f} MB")
        else:
            print(f"   ❌ {f}: Missing")

print("\n🎉 Penn Treebank data is ready for training!")

## 🚀 Environment Setup

### GPU Verification
First, let's verify that GPU is available and properly configured:

In [None]:
import torch
print("GPU Available: ", torch.cuda.is_available())
print("GPU: ", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU detected")

In [None]:
import torch
import subprocess

# Check GPU availability
print("🔍 GPU Information:")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"CUDA version: {torch.version.cuda}")
else:
    print("❌ No GPU detected. Make sure to enable GPU runtime in Colab!")
    print("Go to Runtime > Change runtime type > Hardware accelerator > GPU")

# Check nvidia-smi for additional info
try:
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
    print("\n📊 GPU Status:")
    print(result.stdout)
except:
    print("Could not run nvidia-smi")

### Install Dependencies

In [None]:
# Install project dependencies
print("📦 Installing dependencies from requirements.txt...")
!pip install -q -r requirements.txt

# Verify key installations
import torch
import numpy as np
import yaml
from tqdm import tqdm

print("\n✅ Key packages verified:")
print(f"PyTorch: {torch.__version__}")
print(f"NumPy: {np.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 🧪 Quick Test Run

This test will validate your environment before running the full training.

In [None]:
import time
import os

# Pre-run validation
print("🔍 Pre-run validation:")
required_files = [
    'src/train.py',
    'config/config_colab_quick.yaml',
    'data/ptb/ptb.train.txt'
]

for file in required_files:
    if os.path.exists(file):
        print(f"✅ {file}")
    else:
        print(f"❌ {file} - Missing!")

print("\n🚀 Starting quick test training...")
start_time = time.time()

# Run training with enhanced output
!python src/train.py --config config/config_colab_quick.yaml --device cuda

end_time = time.time()
print(f"\n⏱️ Quick test completed in {end_time - start_time:.1f} seconds")
print(f"📁 Check the 'checkpoints' directory for saved models")
print(f"📊 TensorBoard logs saved in 'runs' directory")

In [None]:
print("\n🚀 Starting full training run...")
start_time = time.time()

# Run training with enhanced output
!python src/train.py --config config/config_colab_full.yaml --device cuda

end_time = time.time()
print(f"\n⏱️ Full training completed in {end_time - start_time:.1f} seconds")
print(f"📁 Check the 'checkpoints' directory for saved models")
print(f"📊 TensorBoard logs saved in 'runs' directory")

### Model Evaluation
Evaluate your trained model on the test set:

In [None]:
# Evaluate the best model
import glob

# Find the latest checkpoint
checkpoint_files = glob.glob('checkpoints_full/*.pth')
if checkpoint_files:
    latest_checkpoint = max(checkpoint_files, key=os.path.getctime)
    print(f"📁 Latest checkpoint: {latest_checkpoint}")

    # Run evaluation
    print("\n🧮 Evaluating model on test set...")
    !python src/evaluate.py --checkpoint {latest_checkpoint} --config config/config_colab_full.yaml --device cuda
else:
    print("❌ No checkpoints found. Make sure training completed successfully.")