## Reformer - Pushing the Limits of Language Modeling

Earlier this year, Nikita Kitaev, Łukasz Kaiser and Anselm Levskaya published the [**Reformer**](https://arxiv.org/abs/2001.04451), a transformer model variant with astounishing low memory consumption.

In this notebook, we will show how Reformer can be used in [`transformers`](https://github.com/huggingface/transformers).
To highlight its low memory consumption, we reduce the novel [**Crime and Punishment**](https://en.wikipedia.org/wiki/Crime_and_Punishment) to a single example containing over half a million tokens and use it train Reformer with the conventional languge modeling objective.

Thanks to the recent releases of [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) and [`nlp`](https://github.com/huggingface/nlp), it is easier than ever to train any model in `transformers` on the dataset of your choice.


### ***Disclaimer***:

This notebook is essentially a translation of the official reformer notebook from `trax` to `pytorch`: https://colab.research.google.com/github/google/trax/blob/master/trax/models/reformer/text_generation.ipynb#scrollTo=bzQ7G9uGSga5

First, let's check whether we are given the full portion of the GPU.

# 1. Problem
Transformers have become the backbone of modern NLP, but they struggle with long sequences due to quadratic memory and time complexity with respect to sequence length. This limitation makes it infeasible to train standard transformers on very long texts, such as books or entire documents. The problem, therefore, is to scale Transformer models to handle longer sequences efficiently without significantly compromising model quality or expressiveness.

# 2. Method Proposed
The paper introduces Reformer, a Transformer architecture designed to handle long sequences efficiently by addressing two major bottlenecks:

## Efficient Attention using LSH (Locality-Sensitive Hashing):
Instead of computing full self-attention (which is quadratic in time and space), Reformer uses LSH attention to approximate nearest neighbor attention, reducing the time and space complexity to O(L log L), where L is the sequence length.

## Reversible Layers:
To save memory during training, Reformer uses reversible residual layers, which allow recomputing activations during backpropagation instead of storing them, significantly reducing memory usage.

## Axial Positional Embeddings:
Reformer applies axial embeddings (splitting positional dimensions into multiple axes) to efficiently encode positions for very long sequences.

## Mixed Attention (Local + LSH):
The model alternates between local attention and LSH attention layers to preserve locality while scaling to longer dependencies.

# 3. Claims
Reformer significantly reduces memory usage (by orders of magnitude) compared to standard transformers.

It maintains comparable performance on standard language modeling tasks despite its approximations.

Reformer can process sequences with over 500,000 tokens on a single GPU — something previously not feasible.

It enables training of language models on extremely long documents with limited computational resources.

# 4. Why this is interesting
This work pushes the boundaries of what Transformers can handle in terms of input length. It introduces practical techniques for training language models on entire books or lengthy sequences, making it particularly compelling for tasks like document summarization, story generation, and long-context reasoning.
Reformer is a step toward resource-efficient deep learning, which is vital for real-world applications and democratizing access to large-scale models. It bridges the gap between model power and computational constraints, making it a key innovation in the field of scalable NLP.

# 5. Learning Logs

- **Locality-Sensitive Hashing (LSH) Attention**  
  - *What it is:* Instead of computing full O(L²) dot-product attention over all pairs of an L-length sequence, queries and keys are projected via random rotations into hash buckets so that each query attends only to keys in its own (and neighboring) buckets.  
  - *Why it matters:* This reduces attention’s time and memory complexity from O(L²) to O(L log L), making long-sequence modeling (e.g. L = 64 K tokens) tractable on limited-memory hardware.

- **Reversible Residual Layers**  
  - *What it is:* Each layer splits its activations into two halves `(x1, x2)` and applies  
    ```  
    y1 = x1 + Attention(x2)  
    y2 = x2 + FeedForward(y1)  
    ```  
    During back-prop the inputs `(x1, x2)` are reconstructed from `(y1, y2)`, so no per-layer activations need to be stored.  
  - *Why it matters:* It removes the factor of N in memory use for an N-layer Transformer, enabling very deep models (e.g. 20+ layers) on a single accelerator without blowing out memory.

- **Chunked Feed-Forward Activations**  
  - *What it is:* The position-wise feed-forward layer (which normally has intermediate dimension `d_ff ≫ d_model`) is applied in smaller “chunks” of sequence positions sequentially rather than all at once.  
  - *Why it matters:* Since feed-forward computations across positions are independent, chunking drops peak activation memory from O(b L d_ff) to O(b L d_model) without changing any numerical results.


# 6. Claim Identification

**Claim:**  
> With multi-round LSH attention using _n_<sub>rounds</sub> = 8, a one-layer Reformer model matches the 100% train and evaluation accuracy of full dot-product attention on the 64-token synthetic duplication task, while reducing attention complexity from O(L²) to O(L log L).  

#### Why this claim is **clear and testable**

|  | Specification |
|---|---|
| **Task & data** | Enwik8 character-level LM (Fig. 5 left) **and** the synthetic length-sweep benchmark (Fig. 5 right). |
| **Model sizes** | 6-layer, d<sub>model</sub>=512, 8-head Reformer vs. identically-sized Transformer. |
| **Context lengths** | 16 384 and 32 768 tokens. |
| **Metrics** | • Bits-per-dim (bpd) after 25 k steps • Wall-clock **seconds / step** |
| **Thresholds** | (i) Δ bpd ≤ 0.05 (ii) speedup ≥ 40 × at 32 k tokens |

---

#### Evidence from the paper

| Figure / Table | Observation | Supports |
|---|---|---|
| **Fig. 5 left** | 6-layer Reformer reaches **≈ 1.30 bpd**, matching full attention (≈ 1.31 bpd). | Quality parity |
| **Fig. 5 right** | At 32 k tokens: full attn ≈ 8 s/step, 4-hash LSH ≈ 0.18 s/step → **44 × faster**. | Speed advantage |
| **Fig. 4** | On ImageNet-64, 4–16 hashes stay within 0.03 bpd of full attn. | Robustness |
| **Table 2** | WMT14 En→De: Reformer-base 28.0 BLEU vs. Vaswani-base 27.3. | Cross-task parity |

---

#### How to **disprove** the claim

1. **Replicate** the enwik8 and synthetic-length experiments with the public JAX code.  
2. Use exactly the hyper-parameters in Appendix A.  
3. If 4-hash LSH ends > 0.05 bpd worse **or** the speedup at 32 k tokens is < 40 × on equivalent hardware, the claim is false.

---

*This phrasing ties the Reformer’s promise to concrete numbers, datasets, and reproduction steps—making it scientifically falsifiable.*

In [None]:
#@title Check availble memory of GPU
# Check that we are using 100% of GPU
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip -q install gputil
!pip -q install psutil
!pip -q install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
Gen RAM Free: 12.2 GB  | Proc size: 107.0 MB
GPU RAM Free: 15095MB | Used: 0MB | Util   0% | Total 15360MB


In case GPU utilisation (`Util`) is not at 0%, you can uncomment and run the following line to kill all processes to get the full GPU afterwards.
Make sure to comment out the line again to not constantly crash the notebook on purpose.

In [None]:
# !kill -9 -1

Let's install `nlp` and `transformers` and import the necessary classes from Reformer and Trainer.

## Change
1. Environment Upgrade
We swapped out the old, pin-locked packages for the latest supported tooling. Rather than installing nlp==0.2.0 and transformers==2.10.0, we now run the following.
-- This ensures we pull down pre-built wheels for tokenizers, leverage the actively maintained Datasets library (formerly nlp), and use the modern Transformers 4.x codebase with all recent bug-fixes and features.

In [None]:
!pip install --upgrade pip setuptools wheel
!pip install --upgrade datasets transformers



In case the notebook crashesh the wrong version of `pyarrow` was installed here. Simply rerun the cell to install the correct version.

In [None]:
from transformers import (
    ReformerModelWithLMHead,      # the LM‐head model
    ReformerTokenizerFast,        # the Rust‐based tokenizer
    ReformerConfig,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,  # replace old DataCollator
)
from datasets import load_dataset
import torch

First we download *Crime and Punish* which contains the content of a 800 page book using the convenient `nlp` library.

## Change
Replacing nlp with datasets
-- The original code used\
import nlp\
dataset = nlp.load_dataset("crime_and_punish", split="train")\
Under the hood, it’s the same Hugging Face data loader, but datasets is actively developed with improved performance, caching, and integration.

In [None]:
# download “Crime and Punishment” as a single split
dataset = load_dataset("crime_and_punish", split="train")

# inspect the first example
print(dataset[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/5.14k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/786k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/21969 [00:00<?, ? examples/s]

{'line': 'CRIME AND PUNISHMENT\r\n'}


Now let's get a pretrained sentence piece tokenizer that was trained on the Crime and Punishment dataset.

## Change
In Transformers 4.x, we switch to the Rust-backed “Fast” tokenizer:\
This change gives us faster tokenization, lower memory usage, and better compatibility with macOS M1/arm64 wheels.

In [None]:
tokenizer = ReformerTokenizerFast.from_pretrained("google/reformer-crime-and-punishment")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/242k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/323k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

To try to disprove the Reformer Paper claim, we can use the enwik8 dataset and pretrained sentence tokenizer for enwik8.

In [None]:
# tokenizer = ReformerTokenizerFast.from_pretrained("google/reformer-enwik8")

## Change
dill is not needed and is also crashing the runtime. So commenting it out.

In [None]:
# pin to a version that defines dill._dill.PY3
# %pip install --upgrade dill==0.3.6

Collecting dill==0.3.6
  Downloading dill-0.3.6-py3-none-any.whl.metadata (9.8 kB)
Downloading dill-0.3.6-py3-none-any.whl (110 kB)
Installing collected packages: dill
  Attempting uninstall: dill
    Found existing installation: dill 0.3.8
    Uninstalling dill-0.3.8:
      Successfully uninstalled dill-0.3.8
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
multiprocess 0.70.16 requires dill>=0.3.8, but you have dill 0.3.6 which is incompatible.[0m[31m
[0mSuccessfully installed dill-0.3.6


## Change
the Reformer tokenizer we load doesn’t have a pad_token defined, so when we ask it to pad to max_length, it doesn’t know which ID to use.\

## Ensure the tokenizer has a pad token

    if tokenizer.pad_token is None:
      tokenizer.pad_token = tokenizer.eos_token

Because want to pack all data into a **single** sample, we use the handy `map()` function to reduce the dataset into one sample and pad the sample to a length of 524288. We then expand the same sample to 8 training samples so that we can accumulate gradients during training. Finally, we make the dataset ready for training, by only keeping the columns needed for training.

In [None]:
# ensure a pad_token exists
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
sequence_length = 2 ** 19  # 524288

def flatten_and_tokenize(batch):
    # 1) Join all lines into one giant string
    text = "".join(batch["line"])
    # 2) Encode once, padding/truncating to sequence_length
    encodings = tokenizer(
        text,
        padding="max_length",     # replaces pad_to_max_length=True
        truncation=True,
        max_length=sequence_length
    )
    # 3) Duplicate that single example 8 times
    for k in encodings:
        encodings[k] = [encodings[k]] * 8

    return encodings

# apply it
dataset = dataset.map(
    flatten_and_tokenize,
    batched=True,
    batch_size=-1,
    remove_columns=["line"]
)

# switch to torch tensors
dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

Map:   0%|          | 0/21969 [00:00<?, ? examples/s]

With the Trainer framework of transformers, we can implement by using a Reformer specific DataCollator that randomely shifts the input_ids to the right and sets the labels correctly.

# Change
We removed the dependency on the now-undefined DataCollator base class and instead:\
	1.	Defined a plain Python class (ReformerCollator) with an __init__(max_roll_length) and a __call__(features) method.\
	2.	Switched from collate_batch to __call__, so it can be passed directly as a data_collator to the Hugging Face Trainer.\
	3.	Took the single example in features, applied a random circular shift (torch.roll) to both its input_ids and attention_mask, and unsqueezed to produce a batch of size 1.\
	4.	Returned a dictionary with keys "input_ids", "labels" (same as the rolled inputs for next-token prediction), and "attention_mask".




In [None]:
class ReformerCollator:
    def __init__(self, max_roll_length: int):
        self.max_roll_length = max_roll_length

    def __call__(self, features):
        """
        features: List of dicts, each with keys "input_ids" and "attention_mask"
        We take the first example, randomly roll its tokens & mask,
        then return a single‐element batch dict with input_ids, labels, and attention_mask.
        """
        # pick a random shift in [0, max_roll_length)
        shift = torch.randint(self.max_roll_length, (1,)).item()

        # grab the tensor from the first (and only) feature
        input_ids      = features[0]["input_ids"]
        attention_mask = features[0]["attention_mask"]

        # roll both tensors along the sequence dimension
        rolled_ids  = torch.roll(input_ids, shift).unsqueeze(0)        # shape [1, seq_len]
        rolled_mask = torch.roll(attention_mask, shift).unsqueeze(0)  # shape [1, seq_len]

        return {
            "input_ids":      rolled_ids,
            "labels":         rolled_ids,  # next‐token LM targets
            "attention_mask": rolled_mask,
        }

# Example of plugging into Trainer:
# data_collator = ReformerCollator(max_roll_length=sequence_length)
# trainer = Trainer(..., data_collator=data_collator, ...)

To instantiate the data collator the length of padded `input_ids` needs to be calculated.

In [None]:
# the non_padded_sequence_length defines the max shift for our data collator
non_padded_sequence_length = sequence_length - sum(
    dataset["attention_mask"][0]
)

# get the data collator
data_collator = ReformerCollator(non_padded_sequence_length)

Next, we will define our reformer model by defining the ReformerConfig. As can be seen we alternate between local attention layers and lsh attention layers to have a total of 6 layers. Also note that we factorize the num_buckets and use Axial Position Embeddings. For more insight on how the bucketing and Axial Position Embeddings work please refer to the Reformer docs.

In [None]:
config = {
    "attention_head_size": 64,
    "attn_layers": ["local", "lsh", "local", "lsh", "local", "lsh"],
    "axial_pos_embds": True,
    "sinusoidal_pos_embds": False,
    "axial_pos_embds_dim": [64, 192],
    "axial_pos_shape": [512, 1024],
    "lsh_attn_chunk_length": 64,
    "local_attn_chunk_length": 64,
    "feed_forward_size": 512,
    "hidden_act": "relu",
    "hidden_size": 256,
    "is_decoder": True,
    "max_position_embeddings": 524288,
    "num_attention_heads": 2,
    "num_buckets": [64, 128],
    "num_hashes": 1,
    "vocab_size": 320,
    "lsh_attention_probs_dropout_prob": 0.0,
    "lsh_num_chunks_before": 1,
    "lsh_num_chunks_after": 0,
    "local_num_chunks_before": 1,
    "local_num_chunks_after": 0,
    "local_attention_probs_dropout_prob": 0.025,
    "hidden_dropout_prob": 0.025,
}

config = ReformerConfig(**config)
model = ReformerModelWithLMHead(config)
model = model.train()

Lastly, let's set up the training args. Note: these training settings have not throughly been tested and might be tuned for better results.

# Change
Transformers 4.x the TrainingArguments signature has changed:\
	•	evaluate_during_training → replaced by evaluation_strategy\
  	•	per_gpu_* → renamed to per_device_train_batch_size, per_device_eval_batch_size

In [None]:
training_args = TrainingArguments(
    output_dir="./",                   # where to save checkpoints & logs
    learning_rate=1e-3,
    max_steps=2000,

    per_device_train_batch_size=1,     # was per_gpu_train_batch_size
    per_device_eval_batch_size=1,      # was per_gpu_eval_batch_size
    gradient_accumulation_steps=8,

    # legacy eval flags:
    do_eval=True,                      # run evaluation
    eval_steps=50,                     # how often to evaluate
    logging_steps=50,                  # how often to log

    warmup_steps=500,
    weight_decay=0.001,
    fp16=True,
    save_steps=50,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,           # your tokenized+formatted Dataset
    eval_dataset=dataset,            # optional, if you want eval
    data_collator=data_collator,     # your ReformerCollator or HF collator
    tokenizer=tokenizer,             # ensures `.save_model()` works properly
)

  trainer = Trainer(


We define a simple "accuracy" metric to keep track of how many samples are correctly predicted.

In [None]:
def compute_metrics(pred):
    non_padded_indices = (pred.label_ids != -100)

    # correctly shift labels and pred as it's done in forward()
    labels = pred.label_ids[..., 1:][non_padded_indices[..., 1:]]
    pred = np.argmax(pred.predictions[:, :-1], axis=-1)[non_padded_indices[..., :-1]]

    acc = np.mean(np.asarray(pred == labels), dtype=np.float)
    return {"accuracy": acc}

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,           # your tokenized+formatted Dataset
    eval_dataset=dataset,            # optional, if you want eval
    data_collator=data_collator,     # your ReformerCollator or HF collator
    tokenizer=tokenizer,             # ensures `.save_model()` works properly
)
trainer.train()

  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33manitejsri22[0m ([33manitejsri22-student[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
trainer.save_model("my-reformer-checkpoint")

The following code ensures that padding is handled properly during generation by giving the tokenizer a distinct PAD token (so padding isn’t treated as end-of-sequence), switching the model into evaluation mode, and producing an attention mask that distinguishes real tokens from padding. We then call model.generate(...) with that mask and appropriate sampling parameters to produce a coherent continuation of our prompt, and finally decode the output back into plain text.

In [None]:
# 1) Give the tokenizer a real PAD token (distinct from EOS)
if tokenizer.pad_token is None or tokenizer.pad_token == tokenizer.eos_token:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})
    # expand the model embeddings to include the new PAD token
    model.resize_token_embeddings(len(tokenizer))

# 2) Switch to eval mode
model.eval()

# 3) Tokenize with padding so we get an attention_mask
with torch.no_grad():
    inputs = tokenizer(
        "Later that day, he",
        return_tensors="pt",
        padding=True,            # pad to longest in batch (here just the prompt)
    ).to(model.device)

    # 4) Generate, explicitly passing the attention_mask
    generated = model.generate(
        inputs["input_ids"],
        attention_mask=inputs["attention_mask"],  # ensures the model knows what’s real vs pad
        max_length=inputs["input_ids"].shape[-1] + 100,
        pad_token_id=tokenizer.pad_token_id,      # now a distinct PAD ID
        eos_token_id=tokenizer.eos_token_id,
        do_sample=True,
        top_k=50,
        top_p=0.95,
    )

# 5) Decode
print(tokenizer.decode(generated[0], skip_special_tokens=True))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Later that day, he noadedroï as agight by sYou, manvenestay rez D wasI h man unar?” thatul, hadnhe“ing I butit?” d m cilL g arevaskoles will, And gnoiveherL isikov knowivevery knowlf B Ten areu in ButiveeuY And,im<s> sh ab abouticad neau’A B


Factual Question-Answering

Why? Checks whether the model has learned and can recall facts

In [None]:
prompts = [
    "Who wrote ‘Crime and Punishment’?",
    "In what year was the Declaration of Independence signed?",
    "What is the capital of France?"
]
for p in prompts:
    inputs = tokenizer(p, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs, max_length=inputs["input_ids"].shape[-1]+20,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    answer = tokenizer.decode(out[0], skip_special_tokens=True)[len(p):].strip()
    print(f"> {p}\n→ {answer}\n")

NameError: name 'tokenizer' is not defined

Cloze / Fill-in-the-Blank

Why? Tests local consistency and the model’s ability to predict missing words.

In [None]:
templates = [
    "Dostoyevsky’s most famous novel is Crime and ____.",
    "The mitochondrion is the powerhouse of the ____.",
    "To be, or not to be, that is the ____."
]
for t in templates:
    inputs = tokenizer(t, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_length=inputs["input_ids"].shape[-1]+5)
    fill = tokenizer.decode(out[0], skip_special_tokens=True)
    print(f"> {t}\n→ {fill}\n")

> Dostoyevsky’s most famous novel is Crime and ____.
→ Dostoyevsky’s most famous novel is Crime and ____.uomomeul about

> The mitochondrion is the powerhouse of the ____.
→ The mitochondrion is the powerhouse of the ____.DLnight g

> To be, or not to be, that is the ____.
→ To be, or not to be, that is the ____. gzikeerhin



Long-Context Understanding

Why? Verifies the Reformer really uses long context rather than only the last 512 tokens.
Simulation:

In [None]:
# 1) build a long context
seed_sent = "Alice was beginning to get very tired of sitting by her sister on the bank."
context = " ".join([seed_sent for _ in range(40)])  # ~2000 tokens
question = "Who was tired?"
prompt = context + "\n\nQuestion: " + question
# 2) generate
inputs = tokenizer(prompt, return_tensors="pt", truncation=False).to(model.device)
out = model.generate(
    **inputs, max_length=inputs["input_ids"].shape[-1]+20,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(out[0], skip_special_tokens=True)[-50:])

Input ids are automatically padded from 1415 to 1472 to be a multiple of `config.chunk_length`: 64


 was tired?ilalodast fromlfastppingl of( baskoling


Summarization of a Paragraph

Why? Even without fine-tuning, we can see if LM can compress information.

In [None]:
paragraph = (
    "In Russian literature, Fyodor Mikhailovich Dostoyevsky is famed for delving into human psychology. "
    "Crime and Punishment explores guilt, redemption, and the moral dilemmas of its protagonist, Raskolnikov."
)
prompt = "Summarize the following in one sentence:\n\n" + paragraph
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
summary = model.generate(
    **inputs, max_length=inputs["input_ids"].shape[-1]+30,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(summary[0], skip_special_tokens=True))

Adversarial / Nonsense Prompts

Why? Gauges robustness and whether the model hallucinates.

In [None]:
adversarials = [
    "Colorless green ideas sleep soundly.",
    "asdf qwer zxcv why do letters swim?",
    "The square root of avocado is banana because ___."
]
for p in adversarials:
    inputs = tokenizer(p, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_length=inputs["input_ids"].shape[-1]+20)
    print(f"> {p}\n→ {tokenizer.decode(out[0], skip_special_tokens=True)}\n")

> Colorless green ideas sleep soundly.
→ Colorless green ideas sleep soundly.hinia”DLnight g<s> gim guomomeul about“ had know

> asdf qwer zxcv why do letters swim?
→ asdf qwer zxcv why do letters swim?ia”DLnight g<s> gim guomomeul about“ had knowy

> The square root of avocado is banana because ___.
→ The square root of avocado is banana because ___.ight g<s> gim guomomeul about“ had knowyYou thatest butu

