# Building a Hugging Face Dataset

## Why `datasets.Dataset`?

Hugging Face's `datasets` library provides:
- **Efficient data loading:** Lazy loading, memory mapping, streaming
- **Built-in splits:** Easy train/validation splitting
- **Hub integration:** Push/pull datasets seamlessly
- **Tokenization helpers:** Works seamlessly with tokenizers

For our hybrid workflow, converting to `DatasetDict` lets us:
1. Push to Hub from CPU
2. Pull in Colab for GPU training
3. Maintain train/val splits automatically

## Train/Validation Split Strategy

We use a small validation set (5% default) because:
- Language models need large training sets
- Validation is mainly for monitoring overfitting
- Small val set is sufficient for perplexity checks

Use a **fixed random seed** for reproducibility.


In [None]:
# === TODO (you code this) ===
# Convert DataFrame to DatasetDict with train/validation split.
# Hints:
#   - Use datasets.Dataset.from_pandas() to convert DataFrame
#   - Use .train_test_split() with test_size=val_split
#   - Set seed for reproducibility
# Acceptance:
#   - DatasetDict with expected split sizes (val ~ config.val_split)

from datasets import Dataset, DatasetDict
import pandas as pd

def to_hf_dataset(df, val_split: float, seed: int=42):
    """
    Convert DataFrame to Hugging Face DatasetDict with train/val split.
    
    Args:
        df: DataFrame with 'text' column
        val_split: Fraction for validation (e.g., 0.05)
        seed: Random seed for splitting
        
    Returns:
        DatasetDict: Dictionary with 'train' and 'validation' splits
    """
    raise NotImplementedError

# Load cleaned data and convert
df = pd.read_csv("data/processed/frankenstein_cleaned.csv")
dset = to_hf_dataset(df, val_split=0.05, seed=42)
print(f"Train: {len(dset['train'])}, Val: {len(dset['validation'])}")


## Pushing to Hugging Face Hub

Pushing to the Hub makes the dataset portable:
- Pull it in Colab without file transfers
- Share with collaborators
- Version control your data

**Note:** You'll need a Hugging Face token. Set it as an environment variable or use `huggingface_hub.login()`.


In [None]:
# === TODO (you code this) ===
# (Optional) Push dataset to the HF Hub.
# Hints:
#   - Require env var or prompt for token; use push_to_hub with repo_id
#   - Handle authentication (huggingface_hub.login() or token from env)
#   - Use private=True if you want to keep it private
# Acceptance:
#   - dataset appears on the Hub; or skipped cleanly if not configured

import os
from huggingface_hub import login

def maybe_push_dataset(dset, repo_id: str):
    """
    Optionally push dataset to Hugging Face Hub.
    
    Args:
        dset: DatasetDict to push
        repo_id: Hub repository ID (e.g., "username/dataset-name")
    """
    raise NotImplementedError

# Push if configured
repo_id = "YOURUSER/frankenstein-fanfic-snippets"  # Update this!
maybe_push_dataset(dset, repo_id)
