# NLP LAB2 - Training with Hugging Face

This lab is focused on training models from scratch using the Hugging Face environment, touching a few topics important within NLP, like Tokenizers, in-context learning, and Encoder-Decoder models.

We will try to implement a simple modulo calculator using a Transformer.

To accelerate your training with a GPU, click on "Change runtime type" on the "Runtime" tab and select T4 GPU. Note, however, that Colab will allow you to use GPU only for a limited time.


### Install Dependencies

In [23]:
%pip install gradio langchain openai datasets tokenizers datasets transformers numpy accelerate transformers[torch]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


### Generate Data

The code given below in the
 `generate_data_demo` function returns two strings, one representing a calculation using `length=3` digits with random operators from `['+', "-"]` and the other representing the results of this calculation modulo `modulo=100`.

 Example `generate_data_demo()`:

```X: "2 + 2 + 2" Y: "6"```


Your task is to write a `generate_data()` function that will expand this calculation step-by-step, starting from the left side:

Example `generate_data()`:

```X: "2 + 2 + 2" Y: "4 + 2 = 6"```


But first, finish the notebook using demo implementation, by assigning `generate_data = generate_data_demo`.

The goal of the notebook is to observe how differently these implementations behave.

In [24]:
import numpy as np
from datasets import Dataset

legal_ops = ['+', "-"]
size = 10000
length = 3
max_num = 10
modulo = 100
test_size = 0.3

def format_operation(numbers, ops):
    while len(ops) < len(numbers):
        ops = list(ops) + [""]

    return " ".join(f"{x} {op}" for x, op in zip(numbers, ops))

def eval_code(code):
    return str(eval(f"({code})%{modulo}"))

def eval_part(code):
    code.split(' ')

    return 

# DEMO generate_data
def generate_data_demo():
    numbers = np.random.randint(0, max_num, length)
    ops = np.random.choice(legal_ops, length - 1)
    formatted = format_operation(numbers, ops)
    out = eval_code(formatted)
    return formatted, out

# Implement a generate_data function that has the same interface as generate_data_demo function
## INSERT YOUR CODE HERE  ---------------------

def generate_data():
    numbers = np.random.randint(0, max_num, length)
    ops = np.random.choice(legal_ops, length - 1)
    formatted = format_operation(numbers[:2], ops[0])
    formatted_end = format_operation(numbers[2:], ops[1:])
    print(formatted, formatted_end)
    out = eval_code(formatted)
    return formatted, out

## END OF CODE ---------------------

# generate_data = generate_data_demo
# Uncomment the line above for a demo calculator

generate_data()

6 + 3  7 +


('6 + 3 ', '9')

#### Create a Hugging Face Dataset

Sample data using the generator and create a dataset.

In [25]:
# this ensures data does not repeat and therefore does not leak from train to test
raw_dataset = list({generate_data() for i in range(size)})

x_data, y_data = zip(*raw_dataset)
dataset = Dataset.from_dict({"text":x_data, "out":y_data})
dataset = dataset.train_test_split(test_size=test_size)
print(f"Train size = {len(dataset['train'])}, test size = {len(dataset['test'])}")
dataset["train"][10]

6 + 9  2 +
7 - 4  3 -
2 - 5  4 -
5 + 1  4 -
9 + 5  8 +
9 - 2  6 -
8 + 2  4 +
4 - 8  6 -
8 + 1  9 -
4 - 1  3 +
6 - 7  2 +
3 - 1  7 -
5 - 5  9 -
1 - 9  1 -
3 - 7  6 +
7 + 4  1 +
7 - 9  8 -
8 + 0  8 +
7 - 0  7 +
2 + 0  7 +
0 + 4  9 -
8 - 6  8 -
1 + 0  6 -
7 - 4  2 -
5 + 2  0 +
2 - 0  4 +
6 - 8  9 -
2 - 6  0 +
3 + 4  6 +
3 - 6  2 -
1 + 9  8 -
3 - 9  6 +
6 + 0  0 +
8 + 3  8 +
5 - 7  8 +
0 - 2  9 -
5 - 7  8 +
0 - 9  3 -
6 + 1  2 +
0 + 7  0 +
1 + 1  5 +
0 - 0  2 -
4 - 9  5 +
3 + 6  7 +
5 - 7  4 -
5 + 5  0 -
5 - 2  3 +
3 + 2  9 +
2 - 3  6 +
0 - 7  6 -
0 - 8  8 +
9 - 2  6 -
9 - 8  3 +
1 + 0  4 -
6 + 8  8 +
2 - 3  7 -
0 + 7  3 +
7 - 3  5 -
2 + 8  2 +
1 - 1  1 +
8 - 3  0 +
4 - 3  7 -
6 - 2  0 +
2 - 5  6 -
5 - 5  2 -
1 - 4  0 +
0 - 4  2 +
0 - 0  4 +
8 + 4  7 +
4 - 2  0 +
4 - 6  0 +
1 + 8  9 -
9 - 2  7 -
1 - 5  6 +
9 + 1  9 -
0 - 8  5 +
9 + 6  9 -
8 + 7  9 +
3 - 3  0 -
2 - 6  1 -
6 + 5  2 -
5 - 9  9 +
3 + 9  5 -
4 + 0  7 -
4 - 6  3 +
3 - 2  6 -
1 + 9  2 -
2 - 9  6 +
9 + 4  6 +
4 - 0  9 +
1 - 5  8 +

{'text': '8 - 9 ', 'out': '99'}

### Create the Tokenizer


As a first step, we need to tokenize our input string into a series of discrete tokens. We're going to use a Byte Pair Encoding Tokenizer, explained in the lecture, that is trained on a given dataset. Moreover, to train our model to work with sequences of different lengths (there is no guarantee how BPE will tokenize our dataset), we need to add a few special tokens:
- `[EOS]` a special token inserted at the end of the desired output, that is used to end the variable sequence length generation process.
- `[UNK]` or tokens unseen during the training.
- `[CLS]` which is a token separating the encoder and the decoder part.
- `[PAD]`  a padding token, inserted after `[EOS]` to match the desired sequence length in a batch, actually masked during training.

In [26]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
from transformers import PreTrainedTokenizerFast
from tokenizers.processors import TemplateProcessing

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[EOS]", "[UNK]", "[CLS]", "[PAD]"])

tokenizer.train_from_iterator(raw_dataset, trainer=trainer)
tokenizer.post_processor = TemplateProcessing(
    single="$0 [EOS]",
    special_tokens=[("[EOS]", tokenizer.model.token_to_id("[EOS]"))],
)
pretokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
pretokenizer.add_special_tokens({'pad_token': '[PAD]', 'cls_token': "[CLS]", 'eos_token': "[EOS]"})
print(f"Size of the dictionary: {len(pretokenizer)}")
print(tokenizer.encode(*generate_data()).tokens)



Size of the dictionary: 34


7 + 3  2 +
['7', '+', '3', '10']


In [27]:
def preprocess_function(examples):
    model_inputs = pretokenizer(examples["text"], text_target=examples["out"])
    return model_inputs

tokenized_data = dataset.map(preprocess_function, batched=True)
print(tokenized_data["train"][10])
print(tokenized_data["train"][11])

Map:   0%|          | 0/140 [00:00<?, ? examples/s]

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

{'text': '8 - 9 ', 'out': '99', 'input_ids': [14, 5, 15, 0], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1], 'labels': [17, 0]}
{'text': '0 + 8 ', 'out': '8', 'input_ids': [6, 4, 14, 0], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1], 'labels': [14, 0]}


## Create the Model

There are three basic types of Transformers models:
- Decoder-only: sequence is processed from left to right, each token sees only the previous ones, like GPT.
- Encoder-only: i.e. BERT, sequence is processed without any order. This does not allow autoregressive inference, however is better for sequence classification, since each token can get information from any other one.
- Encoder-Decoder, with two sequences, one input sequence processed using an Encoder, and an output sequence is generated sequentially using a Decoder that can also attend to the output of the Encoder.


The general rule of thumb is to use Encoder-Decoder architectures for a Sequence to Sequence tasks, for example for a translation.


Here, we're also having a Seq2Seq task, but we could also use a Decoder-only architecture. Why it might be a worse idea? Note that, for a `demo` task, Encoder-only architecture could also be utilized.


In [28]:
from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel

args = dict(
    vocab_size=len(pretokenizer),
    hidden_size=256,
    num_hidden_layers=4,
    num_attention_heads=4,
    intermediate_size=1024,  # 4*hidden_size is the standard
)

config_encoder = BertConfig(**args)
config_decoder = BertConfig(**args)
config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)

# rewrite special token id's to the model
config.decoder_start_token_id = pretokenizer.cls_token_id
config.pad_token_id = pretokenizer.pad_token_id
config.eos_token_id = pretokenizer.eos_token_id
config.unk_token_id = pretokenizer.unk_token_id

# Initializing a Bert2Bert model (with random weights) from the bert-base-uncased style configurations
model = EncoderDecoderModel(config=config)

Config of the encoder: <class 'transformers.models.bert.modeling_bert.BertModel'> is overwritten by shared encoder config: BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 4,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.47.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 34
}

Config of the decoder: <class 'transformers.models.bert.modeling_bert.BertLMHeadModel'> is overwritten by shared decoder config: BertConfig {
  "add_cross_attention": true,
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "initializer_range": 0.02,
  "interme

## Create a Hugging Face Repository
Remember that Google Colab does not have any permanent storage and after you restart your session your model will be lost. Hugging Face provides a convenient way to store and share your models with others through repositories.


To create a repository you first need to have an account: https://huggingface.co/join


Then, create an HF authentication token: https://huggingface.co/settings/tokens


In [None]:
from huggingface_hub import create_repo
import os

repo_name = "calculator_model_test"
# insert your token here
access_token = ""

os.environ["HF_TOKEN"] = access_token
create_repo(repo_name, exist_ok=True)

RepoUrl('https://huggingface.co/Jokilos/calculator_model_test', endpoint='https://huggingface.co', repo_type='model', repo_id='Jokilos/calculator_model_test')

## Train the Model

This might take a while if you do not use GPU acceleration.


How is the training different from the `demo` version? In theory, the `step-by-step` calculation forces the model to produce complex output, which might be harder to learn. On the other hand, it also uses more computation to generate the output (that might be beneficial to performance) and can benefit from in-context learning. Is loss enough to compare these two models?


Note that, training loss is averaged from the whole epoch, while validation loss is calculated after the epoch ends, therefore when the training curve is steep, especially at the beginning, training loss may exceed the validation loss. Moreover, label smoothing and data augmentation can also influence this.



In [None]:
#@title Insert your own Credentials

from clearml import Task

web_server = 'https://app.clear.ml'
api_server = 'https://api.clear.ml'
files_server = 'https://files.clear.ml'
access_key = 'ZFPJL4XWHPUU24S70P3YVFQIBNGQY4'#@param {type:"string"}
secret_key = ''#@param {type:"string"}

Task.set_credentials(web_host=web_server,
                     api_host=api_server,
                     files_host=files_server,
                     key=access_key,
                     secret=secret_key)

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq


data_collator = DataCollatorForSeq2Seq(tokenizer=pretokenizer, model=model)

training_args = Seq2SeqTrainingArguments(
    output_dir="calculator_model_test",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=1e-3,
    per_device_train_batch_size=512,
    per_device_eval_batch_size=512,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=40,
    predict_with_generate=True,
    fp16=False,  # you can change it to True for a faster training on GPU
    push_to_hub=True,
    hub_token=access_token,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    data_collator=data_collator,
    tokenizer=pretokenizer,
)

from transformers import BertModel

original_forward = BertModel.forward

def patched_forward(self, *args, **kwargs):
    """Patched forward method that accepts **kwargs."""
    kwargs.pop("num_items_in_batch", None)
    return original_forward(self, *args, **kwargs)

# Apply the patch
BertModel.forward = patched_forward

trainer.train()

  trainer = Seq2SeqTrainer(


ClearML Task: created new task id=a457361262f840dfb79cf393836965d3
2025-03-04 11:49:19,259 - clearml.Task - INFO - Storing jupyter notebook directly as code
ClearML results page: https://app.clear.ml/projects/2239cef3aefd4d71b8c23178f7921860/experiments/a457361262f840dfb79cf393836965d3/output/log


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring


  0%|          | 0/40 [00:00<?, ?it/s]

RecursionError: maximum recursion depth exceeded

ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start


Retrying (Retry(total=237, connect=237, read=240, redirect=240, status=240)) after connection broken by 'NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x715ff026f530>: Failed to resolve 'api.clear.ml' ([Errno -3] Temporary failure in name resolution)")': /v2.23/tasks.ping




Retrying (Retry(total=237, connect=237, read=240, redirect=240, status=240)) after connection broken by 'NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x716138bd9010>: Failed to resolve 'api.clear.ml' ([Errno -3] Temporary failure in name resolution)")': /v2.23/events.add_batch


2025-03-04 16:19:16,994 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=stopped)> (task=a457361262f840dfb79cf393836965d3, artifacts=[{'key': 'notebook preview', 'type': 'custom', 'uri': 'https://files.clear.ml/HuggingFace%20Transformers/Trainer.a457361262f840dfb79cf393836965d3/artifacts/notebook%20preview/notebook_a457361262f840dfb79cf393836965d3.html', 'content_size': 499402, 'hash': 'aa024470c6a814b070d47e2611b8fbb3fbc5a27ceddcaddd02c93b5dfa7e8fd7', 'timestamp': 1741085369, 'type_data': {'preview': 'Click `FILE PATH` link', 'content_type': 'text/html'}, 'display_data': [('UPDATE', '2025-03-04 10:49:28')]}, {'key': 'notebook', 'type': 'custom', 'uri': 'https://files.clear.ml/HuggingFace%20Transformers/Trainer.a457361262f840dfb79cf393836965d3/artifacts/notebook/NLP_LAB2_Training_with_Hugging_Face.ipynb', 'content_size': 344938, 'hash': 'b2484941bf2900c3b594b70d3ad0bae9e1849e21b4ada5c65d2ebb32acdf225b', 

: 

## Save the Model to HF

We can easily save the trained model to the repository. This will create a new commit, with the model and the tokenizer stored as Git LFS files.

In [None]:
trainer.push_to_hub()

## Download the Model from HF

Now we can download and try to evaluate the model. You can also download a model someone else has created by providing their username.


In [None]:
from transformers import EncoderDecoderModel
from transformers import PreTrainedTokenizerFast

# insert your (or someone else's) username here
login="ludziej"

pretokenizer = PreTrainedTokenizerFast.from_pretrained(f"{login}/{repo_name}")
model = EncoderDecoderModel.from_pretrained(f"{login}/{repo_name}")

## Evaluate

Note that, `model.generate` function will stop after the model returns the `[EOS]` token.

In [None]:
def evaluate(text, skip_special_tokens=True):
  inputs = pretokenizer(text, return_tensors="pt").input_ids
  inputs = inputs.to(model.device.type)
  outputs = model.generate(inputs, max_new_tokens=10)
  return pretokenizer.decode(outputs[0], skip_special_tokens=skip_special_tokens)

print(evaluate("3 + 2 - 2", skip_special_tokens=False))
print(evaluate("1 + 2 + 6"))

## Gradio Inference API

Using GradIO, we can easily create an interface that will allow us to test the app. This type of interface is available online while this cell is running and could also be embedded on the Hugging Face website with proper integration within the repository.


In [None]:
import gradio as gr

app = gr.Interface(
    fn=evaluate,
    inputs=["text"],
    outputs=["text"],
    description="Calculator"
)
app.launch(debug=True)

## Improve the Model

Try to manually assess the model's performance using the app above. Where are the weak points? What may be a possible solution? You can try to use this notebook to fix identified problems. First, try to change training hyperparameters. What's the best validation accuracy you can get?

You can also try to change the dataset to create some other calculators that i.e allow multiplication or a wider range of numbers.

## Further reading

The lecture inspired by Grokking Blogpost by Neel Nanda:  https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking


Grokking is a phenomenon, where neural networks at first learn to memorize the training dataset and then, within a phase change, learn the features that allow them to extrapolate to the validation dataset. This can be observed in a simple scenario of a 2-layer MLP trained on modular addition and was part of Mechanistic Intepretability research by Neel Nanda. However, this is an advanced reading, beyond the scope of the lecture, recommended to advanced students.
This notebook may be an environment to experiment with similar ideas.


