# WikiText-103 with Hugging Face + Needle

This notebook does the following:

1. Uses **Hugging Face `datasets`** to download WikiText-103 (`wikitext-103-v1`).
2. Writes the splits into `wiki.train.tokens`, `wiki.valid.tokens`, `wiki.test.tokens` in a local folder.
3. Uses your existing **`needle.data.datasets.wikitext_dataset`** `Corpus` + `batchify` utilities.
4. Trains and evaluates a language model using your **`train_wikitext`** and **`evaluate_wikitext`** functions from `apps/simple_ml.py`.

In [None]:
import os, math, time, sys
import numpy as np

# Make `python/` visible as a package root
sys.path.append("python")

import needle as ndl
import needle.nn as nn
from needle import Tensor

# Training / evaluation helpers from your homework
from apps.simple_ml import train_wikitext, evaluate_wikitext

# Your language model definition (adjust class / args as needed)
from apps.models import LanguageModel  # change if your LM class has a different name

# Your WikiText dataset helpers
from needle.data.datasets import wikitext_dataset as wt

device = ndl.cpu()  # or ndl.cuda() if you wired up a GPU backend


## Download WikiText-103 using Hugging Face `datasets`

This defines a small helper that

- calls `load_dataset("wikitext", "wikitext-103-v1")`
- writes `wiki.train.tokens`, `wiki.valid.tokens`, `wiki.test.tokens`
  into `data_dir`.

In [None]:
def download_wikitext103_hf(data_dir: str = "./wikitext-103", overwrite: bool = False) -> str:
    """Download WikiText-103 via Hugging Face `datasets`.

    Creates three files in `data_dir`:
        - wiki.train.tokens
        - wiki.valid.tokens
        - wiki.test.tokens

    Returns:
        data_dir (str): directory containing the .tokens files.
    """
    os.makedirs(data_dir, exist_ok=True)

    train_f = os.path.join(data_dir, "wiki.train.tokens")
    valid_f = os.path.join(data_dir, "wiki.valid.tokens")
    test_f  = os.path.join(data_dir, "wiki.test.tokens")

    if (not overwrite
        and os.path.exists(train_f)
        and os.path.exists(valid_f)
        and os.path.exists(test_f)):
        print(f"[wikitext] Files already exist in {data_dir}, skipping download.")
        return data_dir

    try:
        from datasets import load_dataset
    except ImportError as e:
        raise RuntimeError(
            "Hugging Face `datasets` is not installed. "
            "Install it with `pip install datasets`."
        ) from e

    print("[wikitext] Downloading WikiText-103 via Hugging Face datasets...")
    ds = load_dataset("wikitext", "wikitext-103-v1")

    def _write_split(split_name: str, out_path: str):
        with open(out_path, "w", encoding="utf-8") as f:
            for row in ds[split_name]:
                # HF can give None for empty lines
                text = row["text"] if row["text"] is not None else ""
                f.write(text.rstrip() + "\n")

    _write_split("train", train_f)
    _write_split("validation", valid_f)
    _write_split("test", test_f)

    print(f"[wikitext] Saved splits to {data_dir}")
    return data_dir


## `Corpus` and batchify

We now:

1. Call the downloader (only downloads the first time).
2. Use `wt.Corpus` + `wt.batchify` to get language-model training data.

In [None]:
# Directory where .tokens files will live
data_dir = "./wikitext-103"  # you can change this

# 1) Download (does nothing if files already exist and overwrite=False)
download_wikitext103_hf(data_dir, overwrite=False)

# 2) Build Corpus
# NOTE: adjust use_subword / vocab size to match your wikitext_dataset implementation.
corpus = wt.Corpus(
    data_dir,
    max_lines=None,          # set to a small int to debug on fewer lines
    use_subword=False,       # True if you added BPE/subword support there
)

vocab_size = corpus.vocab_size
print("Vocab size:", vocab_size)

batch_size = 32

train_data = wt.batchify(corpus.train, batch_size, device=device, dtype="float32")
valid_data = wt.batchify(corpus.valid, batch_size, device=device, dtype="float32")
test_data  = wt.batchify(corpus.test,  batch_size, device=device, dtype="float32")

print("Train data shape:", train_data.shape)
print("Valid data shape:", valid_data.shape)
print("Test  data shape:", test_data.shape)


## Language model

We have a `LanguageModel` class in `apps/models.py` taking
`vocab_size`, `embedding_size`, `hidden_size`, `num_layers`, `device`, `dtype`.

Change the constructor / class name if your implementation differs.

In [None]:
embedding_size = 512
hidden_size = 512
num_layers = 2

model = LanguageModel(
    vocab_size=vocab_size,
    embedding_size=embedding_size,
    hidden_size=hidden_size,
    num_layers=num_layers,
    device=device,
    dtype="float32",
)

print(model)


## Train on WikiText-103

Call `train_wikitext` function from `apps/simple_ml.py`.
Feel free to tweak `n_epochs`, `lr`, optimizer, etc.

In [None]:
seq_len = 40         # BPTT length
n_epochs = 5
learning_rate = 4.0
weight_decay = 0.0
clip = 0.25

start_time = time.time()
train_acc, train_loss = train_wikitext(
    model,
    train_data,
    seq_len=seq_len,
    n_epochs=n_epochs,
    optimizer=ndl.optim.SGD,   # or ndl.optim.Adam
    lr=learning_rate,
    weight_decay=weight_decay,
    loss_fn=nn.SoftmaxLoss,
    clip=clip,
    device=device,
    dtype="float32",
)
end_time = time.time()

print(f"Training finished in {end_time - start_time:.2f} seconds.")
print(f"Final train loss: {train_loss:.4f}, train acc: {train_acc:.4f}")
print(f"Train perplexity: {math.exp(train_loss):.4f}")


## Evaluate on validation and test


In [None]:
val_acc, val_loss = evaluate_wikitext(
    model,
    valid_data,
    seq_len=seq_len,
    loss_fn=nn.SoftmaxLoss,
    device=device,
    dtype="float32",
)
print(f"Valid loss: {val_loss:.4f}, valid acc: {val_acc:.4f}")
print(f"Valid perplexity: {math.exp(val_loss):.4f}")

test_acc, test_loss = evaluate_wikitext(
    model,
    test_data,
    seq_len=seq_len,
    loss_fn=nn.SoftmaxLoss,
    device=device,
    dtype="float32",
)
print(f"Test loss: {test_loss:.4f}, test acc: {test_acc:.4f}")
print(f"Test perplexity: {math.exp(test_loss):.4f}")
