<a href="https://colab.research.google.com/github/AsRumi/Colab-Notebooks/blob/main/Rumis_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a Tokenizer

## Tokenizer Model Implementation From Scratch (From HuggingFace)

We use tokenizers because many of the Language Models based on Transformers use subword tokenization algorithms. Tokenizers can be trained on any comprehensize corpus of the target language. Tokenizers translate text into data that can be processed by the model, this means converting the text into numerical data. The goal is to find the most meaningful representation.

A "VOCABULARY" is a set of unique tokens that comprise a corpus. In a word based tokenizer, each word is assigned an ID that starts from 0 and goes all the way to the size of the vocabulary. Tokenizers index these IDs to identify a word. You also need a token to represent unseen words that the tokenizer may encounter after its training period. Usually this token is called "unknown" token and is represented by " < unk > ". It is a bad sign if you see your tokenizer producing a lot of these, therefore craft your vocabulary in such a way that your tokenizer does not generate a lot of < unk > tokens.

One way to avoid that is to have character based tokenizer. 2 benefits of this technique is a much smaller vocabulary and a lesser chances of encountering the < unk > token. This approach is not that great either since a character on its own does not mean much and punctution becomes a daunting task too. The in-between solution to both of these techniques is "subword" tokenization.

subword -> sub + word ::: This the technique. Works well for highly agglutinative languages like Turkish.

Converting text to numbers is known as encoding, it is a 2 step process, tokenization, and conversion to input IDs.

In [1]:
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00

In [2]:
from transformers.utils import send_example_telemetry

send_example_telemetry("tokenizer_training_notebook", framework="none")

In [3]:
# You can import the BERT Tokenizer in the following way:
from transformers import BertTokenizer

bertTokenizer = BertTokenizer.from_pretrained("bert-base-cased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



In [4]:
# There is also a class to grab the most fitting tokenizer based on the check point name called the AutoTokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [5]:
# You can now use the above tokenizers to tokenize your text:
bertTokenizer("Tokinze this text please!")

{'input_ids': [101, 1706, 4314, 3171, 1142, 3087, 4268, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
sequence = "My chair is squeaking."
tokens = bertTokenizer.tokenize(sequence)
print(tokens)

# Since the tokenizer used here is a subword tokenizer, you will see the words split into 2/3 constituent parts.
# Next step is to assign token IDs to these words:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

['My', 'chair', 'is', 'sq', '##ue', '##aking', '.']
[1422, 2643, 1110, 4816, 4175, 13024, 119]


In [7]:
# Last step is to decode the ids that the tokenizer has allocated:

mySentence = tokenizer.decode(ids)
print(mySentence)

# Notice that this method not only converts ids to tokens, but also aggregates it into a coherent sentence.

My chair is squeaking.


In [8]:
# Saving a tokenizer is similar to saving a model:

bertTokenizer.save_pretrained("/path")

('/path/tokenizer_config.json',
 '/path/special_tokens_map.json',
 '/path/vocab.txt',
 '/path/added_tokens.json')

## Loading the Dataset

In [9]:
# Using the HuggingFace datasets library is easy:
from datasets import load_dataset

dataset = load_dataset("wikitext", name = "wikitext-2-raw-v1", split = "train")

print(dataset) # 36718 texts

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 36718
})


In [10]:
dataset[1]
dataset[5:9]

{'text': [" It met with positive sales in Japan , and was praised by both Japanese and western critics . After release , it received downloadable content , along with an expanded edition in November of that year . It was also adapted into manga and an original video animation series . Due to low sales of Valkyria Chronicles II , Valkyria Chronicles III was not localized , but a fan translation compatible with the game 's expanded edition was released in 2014 . Media.Vision would return to the franchise with the development of Valkyria : Azure Revolution for the PlayStation 4 . \n",
  '',
  ' = = Gameplay = = \n',
  '']}

In [None]:
# We need an iterator for accessing our batches of texts:
batch_size = 1000
all_texts = [dataset[i:i+batch_size]["text"] for i in range(0, len(dataset), batch_size)]

print(all_texts) # This is not a list of lists, with each list containing 1000 texts.

In [12]:
# To avoid loading everything into memory all at once, define a function:

def batch_iterator():
    for i in range(0, len(dataset), batch_size):
        yield dataset[i:i+batch_size]["text"]

## Using an existing tokenizer

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



In [14]:
tokenizer.is_fast

True

Training the iterator on our corpus:

In [15]:
newTokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size = 25000) # We can either pass the iterator function or the list we created earlier.

That was all there is to it, your tokenizer is now ready to be used on a language model.

In [16]:
newTokenizer(dataset[:5]["text"]) # Pass in sample texts to see how the tokenizer works on your input.

{'input_ids': [[], [301, 8639, 9504, 3050, 301, 315], [], [4720, 74, 4825, 889, 8639, 491, 529, 672, 6944, 475, 267, 9504, 374, 2809, 529, 10879, 231, 100, 162, 255, 113, 14391, 4046, 113, 4509, 95, 18351, 4509, 256, 4046, 99, 4046, 22234, 96, 19, 264, 6437, 272, 8639, 281, 261, 3518, 2035, 491, 373, 264, 5162, 3305, 290, 344, 8639, 9504, 3050, 2616, 1822, 264, 364, 259, 14059, 1559, 340, 2393, 1527, 737, 1961, 370, 805, 3604, 288, 7577, 14, 54, 782, 337, 261, 4840, 15585, 272, 19958, 284, 1404, 1696, 284, 1822, 264, 385, 364, 261, 1431, 737, 284, 261, 8639, 906, 272, 2531, 1858, 286, 261, 1112, 9658, 281, 14059, 288, 1626, 340, 645, 6556, 344, 520, 14434, 264, 261, 1485, 3436, 7515, 290, 261, 518, 737, 288, 4750, 261, 302, 22039, 302, 264, 259, 21720, 1743, 3836, 5654, 261, 4259, 281, 4742, 490, 724, 261, 3581, 1351, 283, 1114, 579, 952, 4010, 1985, 2563, 288, 453, 2128, 807, 935, 261, 7655, 3836, 302, 2038, 314, 271, 89, 22414, 302, 272, 315], [324, 737, 1022, 1984, 284, 1525, 264, 7

In [17]:
newTokenizer("Please convert this text to tokens!")

{'input_ids': [48, 297, 689, 15005, 589, 4036, 290, 290, 75, 642, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [18]:
# You can save the tokenizer locally too:
newTokenizer.save_pretrained("my_new_tokenizer")

('my_new_tokenizer/tokenizer_config.json',
 'my_new_tokenizer/special_tokens_map.json',
 'my_new_tokenizer/vocab.json',
 'my_new_tokenizer/merges.txt',
 'my_new_tokenizer/added_tokens.json',
 'my_new_tokenizer/tokenizer.json')

If you want instructions on how to upload your tokenizer to HuggingFace Hub so you can access it from anywhere, follow the official guide.

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb

## Building your tokenizer from scratch

To build your own tokenizer from scratch, you need to dive a little deeper in to the tokenization pipeline, which consists of several steps:


*   Normalization
*   Pre-tokenization: Splitting the initial input string, maybe on the basis of spaces.
*   Model: Handles all the subtoken discovery and generation. This is the trainable part which is dependent on your input data.
*   Post Processing: This is done to make the tokens compatible with Transformers. Eg: BERT needs tokens wrapped with [CLS] and [SEP]
*   Decoder: Which maps the tokens back to the corpus.

To train the model, use the Trainer class from the tokenizers library.

Let us see how to create the Word Piece tokenizer like the one used for trainer BERT.





In [19]:
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer

tokenizer = Tokenizer(models.WordPiece(unk_token = "[UNK]")) # Creating a tokenizer instance which assigns unknown tokens [UNK]

This tokenizer is not ready for training yet, since you are yet to set up normalization parameters and pre-tokenization.

In [20]:
#Since BERT is such a popular model, it has its own normalizer which can be used as follows:
tokenizer.normalizer = normalizers.BertNormalizer(lowercase = True)

# This normalizer performs NFD (Normalization Form Canonical Decomposition) normalization, lowercasing, and stripping the accents of the characters. You can also instead choose to replicate this function, if you want to avoid using the BertTokenizer:
tokenizer.normalizer = normalizers.Sequence([normalizers.NFD(),
                                             normalizers.Lowercase(),
                                             normalizers.StripAccents()])

*NFD reconciles different ways of writing the same character: Example an H can be written in  cursive and Latin, but both represent the same character, even though they have different unicodes.

In [21]:
# There is also a Bert Pretokenizer we can use that creates tokens using spaces and punctuations.
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

Just like we did for the normalizer, we can also have multiple pre_tokenizers in a sequence to replicate the effect of the Bert Pre Tokenizer.

In [22]:
# To test how the model creates pre-tokens, use the pre_tokenize_str() method:
tokenizer.pre_tokenizer.pre_tokenize_str("Please pre-tokenize my string.")

[('Please', (0, 6)),
 ('pre', (7, 10)),
 ('-', (10, 11)),
 ('tokenize', (11, 19)),
 ('my', (20, 22)),
 ('string', (23, 29)),
 ('.', (29, 30))]

The locations of the pre-tokens are important for the tokenizer to match where the tokens are in the input string for tasks like questions answering or token classification.

Now we need to train the model, before we can go on to post-processor in the pipeline. For this task, we use a WordPieceTrainer. The key thing to remember is the pass on the special tokens since that won't be seen anywhere in the corpus.

In [23]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", '[SEP]', '[MASK]']
trainer = trainers.WordPieceTrainer(vocab_size = 25000, special_tokens = special_tokens) # WordPieceTrainer is a subtoken algorithm.

We can either train from a text file, or use an iterator like we used before.

In [24]:
tokenizer.train_from_iterator(batch_iterator(), trainer = trainer)

For post-processing, we need to add the CLS token at the beginning of every sentence and the SEP token at the end of every sentence, or a pair of SEP tokens if there are a pair of sentences. Now that our tokenizer is trained, we can grab the IDs of CLS and SEP tokens, which will be needed when we use Template Processing.

In [25]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

2 3


Now we build the post processor, since we insert the CLS token at the start of every sequence, and the SEP token at the end of every sequence, we follow the following template:

A indicates the first sequence and B indicates the second sequence.

In [26]:
tokenizer.post_processor = processors.TemplateProcessing(single = f"[CLS]:0 $A:0 [SEP]:0",
                                                         pair = f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
                                                         special_tokens = [("[CLS]", cls_token_id),
                                                                           ("[SEP]", sep_token_id)])

We can also check if the tokenizer is working as intended:

In [27]:
my_encodings = tokenizer.encode("This is one sentence.", "Sentence 2 along with àccents.")
my_encodings.tokens

['[CLS]',
 'this',
 'is',
 'one',
 'sentence',
 '.',
 '[SEP]',
 'sentence',
 '2',
 'along',
 'with',
 'accents',
 '.',
 '[SEP]']

Notice how your text has been normalized (lowercased and stripped off accents.)

We can also check if the token type IDs are correct (for the pair of sentences.)

In [28]:
my_encodings.type_ids

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

Now we need the decoder. We use the special prefix "##" to create a decoder.

In [29]:
tokenizer.decoder = decoders.WordPiece(prefix = "##")

Now that we have successfully created our tokenizer, we need to wrap it in a Transformers object to be able to use it with the Transformers library. More specifically, we need to put it inside the corresponding "TokenizerFast" class of our trained model. Here it is the **BertTokenizerFast** class, however, if your tokenizer does not align with any of the Fast classes because it is truly special, use the **PreTrainedTokenizerFast** class instead.

In [30]:
from transformers import BertTokenizerFast

new_tokenizer = BertTokenizerFast(tokenizer_object = tokenizer)

### DONE! Your tokenizer is ready to be used to train a language model!

There are also different tokenizers that you can create like you did for BERT. (GPT-2, Albert)

# Train a Language Model using your Tokenizer

### In case you want to uplaod your work to your HuggingFace repo.

In [31]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [32]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [33]:
import transformers

print(transformers.__version__)

4.44.2


### Training the model

This section covers training Transformers for language modelling tasks. There are two types of language modelling tasks:
* Casual Modelling: The model has to predict the next 'n' number of tokens in the sentence. To make sure the model does not cheat and look at tokens ahead, an attention mask is provided to the model.
* Masked Language Modelling: Some tokens in the input are masked, these need to be predicted. Since it has access to the entire sentence, the model can use tokens ahead of and behind of the masked token to make this prediction.

In [34]:
# Using the same dataset we used to train our tokenizer - Wikitext2
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

In [35]:
# In case you want to use your own dataset/files for training, do so this way:
# datasets = load_dataset("text", data_files = {"train": path_to_train.txt, "validation": path_to_validation.txt})

You can also load dataset from a CSV or a JSON file, see this [documentation](https://huggingface.co/docs/datasets/en/loading) for more information.

In [None]:
# Check the content of the dataset:
datasets["train"][10]

In [43]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples = 10):
    assert num_examples <= len(dataset)
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

## Causal Language Modelling (CLM)

For CLM, we are going to concatenate the dataset after it is tokenized. This will result in a contigous text that has parts that may span less than, equal to, or more than one chunk of data that will be used for training.

To perform this task, we will use the gpt2 architecture and tokenizer.

In [45]:
model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"

In [46]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)

tokenizer_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/396k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/678k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]



Now we need to run the tokenizer on our dataset, to do so, we will first define a function:

In [47]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Now we apply this to all the splits in our dataset, this can be done using the "map" function present in the datasets library.

In [48]:
tokenized_datasets = datasets.map(tokenize_function, batched = True, num_proc = 4, remove_columns = ["text"])

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

Tokenized datasets now has input IDs, that the model will require, not strings.

In [50]:
tokenized_datasets["train"][1]

{'input_ids': [238, 8576, 9441, 2987, 238, 252],
 'attention_mask': [1, 1, 1, 1, 1, 1]}

Now that the dataset has been tokenized, we need to concatenate it and split it in chunks of ```batch_size``` data. We can use the map method again to perform this function. Our tokenizer was pretrained with a much higher chunk size, therefore we will tune this down to 128.


In [52]:
print(f"Pretrained chunk size of the tokenizer: {tokenizer.model_max_length}")
block_size = 128
print(f"Our chunk size: {block_size}")

Pretrained chunk size of the tokenizer: 1024
Our chunk size: 128


In [55]:
# Function to concatenate our datasets:
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size # Removes the small decimal at the end and provided a whole number
    # Split by chunks of max_length
    result = {
        k: [t[i: i+block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

By default, the map function sends 1000 examples to be treated by the preprocessing function. Therefore, here, we drop the remainder to make concatenated tokenized texts a multiple of block_size every 1000 examples.

In [56]:
lm_datasets = tokenized_datasets.map(group_texts,
                                     batched = True,
                                     batch_size = 128,
                                     num_proc = 4)

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

If you check the data now, you will see that it follows the same chunk size which may span over multiple elements.

In [57]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries, along with Valkyria Chronicles II director Takeshi Ozawa. A large'

Now we can instantiate our ```Trainer```. First we create the model using the same config as our checkpoint, but initialized with random weights.

In [59]:
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_config(config)

In [62]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(f"{model_checkpoint}-wikitext2",
                                  evaluation_strategy = "epoch",
                                  learning_rate = 2e-5,
                                  weight_decay = 0.01,
                                  push_to_hub = True)

In [63]:
trainer = Trainer(model = model,
                  args = training_args,
                  train_dataset = lm_datasets["train"],
                  eval_dataset = lm_datasets["validation"])

In [64]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 37


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,6.5402,6.469583
2,6.1999,6.205528
3,6.0167,6.116378


TrainOutput(global_step=6705, training_loss=6.391947838017338, metrics={'train_runtime': 2427.9426, 'train_samples_per_second': 22.084, 'train_steps_per_second': 2.762, 'total_flos': 3502554365952000.0, 'train_loss': 6.391947838017338, 'epoch': 3.0})

We may evaluate the model once its training is completed.

In [65]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 453.22


We find that our perplexity is high since our dataset was small and our epochs were less.

In [66]:
trainer.push_to_hub()

events.out.tfevents.1729603045.93c8c136161b.240.0:   0%|          | 0.00/9.06k [00:00<?, ?B/s]

events.out.tfevents.1729605498.93c8c136161b.240.1:   0%|          | 0.00/359 [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/AsRumi/gpt2-wikitext2/commit/b12e36e6b4c92509526ae9d9241ace6aa5ccfcb7', commit_message='End of training', commit_description='', oid='b12e36e6b4c92509526ae9d9241ace6aa5ccfcb7', pr_url=None, pr_revision=None, pr_num=None)

Once your model has been pushed to hub, you can simply load the same model back in in just a couple of lines of code:

In [69]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("AsRumi/gpt2-wikitext2")

config.json:   0%|          | 0.00/907 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]