# Download Gutenberg Poetry Corpus to Google Drive

**Corpus:** 3 million lines of English poetry from Project Gutenberg

**Source:** Hugging Face - biglam/gutenberg-poetry-corpus

**Size:** ~500MB compressed

**Output:** Saved to Google Drive at `/MyDrive/gutenberg_poetry_corpus.jsonl.gz`

---

## Instructions

1. Upload this notebook to Google Colab
2. Select **Runtime → Change runtime type → GPU: None** (CPU is fine for downloading)
3. Run cells in order
4. Download will take ~10-15 minutes

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("✓ Google Drive mounted successfully")

## Step 2: Install Dependencies

In [None]:
!pip install -q datasets

print("✓ Dependencies installed")

## Step 3: Download Gutenberg Poetry Corpus

This will download the corpus from Hugging Face and save it to your Google Drive.

In [None]:
from datasets import load_dataset
import json
import gzip
from tqdm import tqdm

# Load the dataset
print("Downloading Gutenberg Poetry Corpus from Hugging Face...")
print("This will take ~10-15 minutes\n")

dataset = load_dataset("biglam/gutenberg-poetry-corpus")

print(f"✓ Dataset loaded: {len(dataset['train']):,} lines of poetry")
print(f"\nSample line: {dataset['train'][0]}\n")

## Step 4: Save to Google Drive (Compressed)

In [None]:
# Output path
output_path = "/content/drive/MyDrive/gutenberg_poetry_corpus.jsonl.gz"

print(f"Saving to Google Drive: {output_path}")
print("This may take 5-10 minutes...\n")

# Write compressed JSONL
with gzip.open(output_path, 'wt', encoding='utf-8') as f:
    for item in tqdm(dataset['train'], desc="Writing lines"):
        f.write(json.dumps(item) + '\n')

print("\n✓ Download complete!")
print(f"✓ Saved to: {output_path}")

## Step 5: Verify File Size and Sample Data

In [None]:
import os

# Check file size
file_size_mb = os.path.getsize(output_path) / (1024 * 1024)
print(f"File size: {file_size_mb:.2f} MB")

# Read first 5 lines to verify
print("\nFirst 5 lines:")
print("=" * 50)

with gzip.open(output_path, 'rt', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i >= 5:
            break
        data = json.loads(line)
        print(f"{i+1}. {data}")

print("\n✓ Corpus ready for BERT training!")

## Optional: Create Plain Text Version

If you want a plain text file (one line per poem line) instead of JSON:

In [None]:
# Uncomment to create plain text version

# output_txt = "/content/drive/MyDrive/gutenberg_poetry_corpus.txt"

# print(f"Creating plain text version: {output_txt}")

# with open(output_txt, 'w', encoding='utf-8') as f:
#     for item in tqdm(dataset['train'], desc="Writing lines"):
#         # Extract just the text of each line
#         f.write(item['s'] + '\n')

# print("✓ Plain text version created!")

---

## Next Steps

1. **View in Google Drive:** Check your Drive to confirm the file is there
2. **Prepare for BERT Training:** Use this corpus to train period-specific BERT models
3. **Clean/Filter (optional):** You may want to filter by time period or poet

---

## Corpus Metadata

Each line in the corpus contains:
- `s`: The line of poetry (string)
- `gid`: Project Gutenberg book ID (integer)

You can use `gid` to group lines by book/poet or to filter specific works.