# NL-HOPE: Nested Learning & HOPE Sequence Model (Colab)

This notebook shows how to:

1. Set up the **NL-HOPE** PyTorch implementation.
2. Train the HOPE model on **any Hugging Face text dataset** (or on WikiText-2 by default).
3. Evaluate perplexity.
4. Generate samples from a trained checkpoint.

Repository: `https://github.com/AMIRMOHAMMAD-OSS/Google_NestedLearning`

> **Tip:** You can copy this notebook file into your repo as `notebooks/NL_HOPE_colab.ipynb` and add an "Open in Colab" badge in your README.

In [None]:
import torch
print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

if torch.cuda.is_available():
    !nvidia-smi
else:
    print("No GPU detected. Training will be slow. In Colab, go to Runtime -> Change runtime type -> GPU.")

## 1. Setup: Clone repo & install dependencies

This cell will:

- Clone the `Google_NestedLearning` repository (if not already cloned)
- Install the Python dependencies listed in `requirements.txt`
- Set `PYTHONPATH` so that `import nl` works correctly

In [None]:
import os, sys

%cd /content

if not os.path.exists("Google_NestedLearning"):
    !git clone https://github.com/AMIRMOHAMMAD-OSS/Google_NestedLearning.git
else:
    print("Repo already cloned.")

%cd /content/Google_NestedLearning
%env PYTHONPATH=/content/Google_NestedLearning

!pip install -q -r requirements.txt
print("Setup complete.")

## 2. Choose your dataset

This implementation uses **Hugging Face Datasets** under the hood. You have two main options:

1. **Use a built-in HF dataset** (recommended)
   - e.g. `wikitext`, `imdb`, `bookcorpus`, or any other dataset whose samples contain at least one text field.
2. **Wrap your own data as a HF dataset**
   - e.g. upload text files, then use `datasets.load_dataset("text", data_files=...)` in your own script, and point the config at that dataset.

In this notebook, we will generate a simple **data config** YAML file that tells the training script which dataset and text field to use.

> By default, we use **WikiText-2** with GPT-2 tokenizer, which is a safe, small starting point.

In [None]:
%%writefile configs/data_custom.yaml
data:
  # Hugging Face tokenizer name
  tokenizer_name: gpt2

  # Hugging Face dataset name and config
  # Example: WikiText-2
  dataset_name: wikitext
  dataset_config: wikitext-2-raw-v1

  # Name of the text field in the dataset
  text_field: text

  # You can change these to use another dataset. For example:
  #  - dataset_name: imdb
  #  - dataset_config: plain_text
  #  - text_field: text
  # Make sure the dataset exists on Hugging Face and that `text_field` matches a column name.

## 3. Model & training config (HOPE + CMS)

Next, we create a model/training config suitable for Colab.

- **`d_model`**, **`n_layers`**, and **`max_seq_len`** are kept moderate so that training fits on a Colab GPU.
- **`max_steps`** is small by default so you can quickly verify everything works.

You can always increase these values later for more serious experiments.

In [None]:
%%writefile configs/hope_colab_user.yaml
exp_name: hope_colab_user
seed: 1337

model:
  vocab_size: 50257          # GPT-2 vocab size
  d_model: 256               # model hidden size
  d_ff: 1024                 # feedforward size for highest CMS level
  n_layers: 2                # number of HOPE/CMS layers
  max_seq_len: 128           # context length
  dropout: 0.1
  d_kv: 64                   # key/value dimension for fast memory

  # Continuum Memory System (CMS) levels
  # Each level has its own feedforward size and update frequency
  cms_levels:
    - { d_ff: 512,  update_every: 1 }
    - { d_ff: 1024, update_every: 8 }

  # Inner (fast) learning rule hyperparameters
  inner_lr: 5.0e-4
  inner_scale_xtx: 1.0
  inner_apply_during_eval: true
  inner_apply_during_sampling: false

train:
  batch_size: 2
  grad_accum_steps: 1
  lr: 3.0e-4
  weight_decay: 0.1

  # Keep these small at first; increase after you verify training works
  max_steps: 500
  warmup_steps: 50

  eval_every: 100
  ckpt_every: 500
  log_every: 10


## 4. Train HOPE on your chosen dataset

Now we call the repository's training script with:

- `--config configs/hope_colab_user.yaml` for model & training hyperparameters
- `--data configs/data_custom.yaml` for dataset & tokenizer

If you changed the dataset in `data_custom.yaml`, it will train on that instead of WikiText-2.

In [None]:
%cd /content/Google_NestedLearning
%env PYTHONPATH=/content/Google_NestedLearning

!python scripts/train.py \
  --config configs/hope_colab_user.yaml \
  --data configs/data_custom.yaml

## 5. Evaluate perplexity

After training, you can evaluate the model on the validation set using the same data config.

This expects a checkpoint at `outputs/hope_colab_user/last.ckpt` (the default location used by the training script).

In [None]:
%cd /content/Google_NestedLearning
%env PYTHONPATH=/content/Google_NestedLearning

ckpt_path = "outputs/hope_colab_user/last.ckpt"
print("Using checkpoint:", ckpt_path)

!python scripts/eval_ppl.py \
  --checkpoint "$ckpt_path" \
  --config_model configs/hope_colab_user.yaml \
  --data configs/data_custom.yaml

## 6. Generate text samples

Use `scripts/sample.py` to generate text from your trained HOPE model.

- `--prompt` provides a starting text
- `--max_new_tokens` controls generation length
- You can adjust `temperature`, `top_k`, and `top_p` for different sampling behaviors.

In [None]:
%cd /content/Google_NestedLearning
%env PYTHONPATH=/content/Google_NestedLearning

ckpt_path = "outputs/hope_colab_user/last.ckpt"
prompt = "Nested learning suggests"

!python scripts/sample.py \
  --checkpoint "$ckpt_path" \
  --prompt "$prompt" \
  --max_new_tokens 100 \
  --temperature 0.8 \
  --top_k 50 \
  --top_p 0.95

## 7. Training on your own data

To train HOPE on **your own dataset**:

### Option A: Use an existing Hugging Face dataset

1. Find a dataset on [https://huggingface.co/datasets](https://huggingface.co/datasets).
2. Identify the `dataset_name`, optional `dataset_config`, and the text column name.
3. Edit `configs/data_custom.yaml` and set:
   - `dataset_name` to the HF dataset name (e.g. `imdb`, `ag_news`, `bookcorpusopen`)
   - `dataset_config` to the appropriate config (or leave blank if none)
   - `text_field` to the column containing the text (e.g. `text`, `content`, etc.)
4. Rerun the **Train** cell.

---

### Option B: Wrap your own text into a HF dataset (advanced)

If you have raw text files, one workflow is:

1. Create a small Python script or notebook that uses `datasets.load_dataset` with the `"text"` builder, for example:

   ```python
   from datasets import load_dataset

   dataset = load_dataset(
       "text",
       data_files={
           "train": "path/to/train.txt",
           "validation": "path/to/val.txt",
           "test": "path/to/test.txt",
       }
   )
   print(dataset)
   ```

2. Either:
   - Use this custom dataset directly in your own training script, **or**
   - Push it to the Hugging Face Hub and reference it with `dataset_name` and `dataset_config` in `configs/data_custom.yaml`.

3. Ensure that `text_field` matches the name of the column that contains your text (often `"text"`).

Once your dataset is exposed via `datasets` in this way, you can train HOPE with the **same training and evaluation commands** used in this notebook.