If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and execute it:

In [1]:
! pip install datasets transformers[sentencepiece] torch accelerate



# Pretraining a language model

1. Preparing a dataset
2. Training a tokenizer
3. Training a language model

참고: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb#scrollTo=q4Qt6GM8nj73

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb#scrollTo=HFASsisvIrIb

## 1. Preparing a dataset

We will need texts to train our tokenizer and language model. We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download our text data, which can be easily done with the `load_dataset` function:

In [2]:
from datasets import load_dataset

For this example, we will use Wikitext-2 (which contains 4.5MB of texts so training goes fast for our example) but you can use any dataset you want (and in any language, just not English).

In [3]:
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


We can have a look at the dataset, which as 36,718 texts:

In [4]:
dataset

Dataset({
    features: ['text'],
    num_rows: 36718
})

To access an element, we just have to provide its index:

In [5]:
dataset[1]

{'text': ' = Valkyria Chronicles III = \n'}

We can also access a slice directly, in which case we get a dictionary with the key `"text"` and a list of texts as value:

In [6]:
dataset[:5]

{'text': ['',
  ' = Valkyria Chronicles III = \n',
  '',
  ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n',
  " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making th

The API to train our tokenizer will require an iterator of batch of texts, for instance a list of list of texts:

In [7]:
batch_size = 1000
all_texts = [dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size)]

To avoid loading everything into memory (since the Datasets library keeps the element on disk and only load them in memory when requested), we define a Python iterator. This is particularly useful if you have a huge dataset:

In [8]:
def batch_iterator():
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

Now let's see how we can use this corpus to train a new tokenizer! There are two APIs to do this: the first one uses an existing tokenizer and will train a new version of it on your corpus in one line of code, the second is to actually build your tokenizer block by block, so lets you customize every step!

# 2. Training a tokenizer

## 2-1. Using an existing tokenizer

If you want to train a tokenizer with the exact same algorithms and parameters as an existing one, you can just use the `train_new_from_iterator` API. For instance, let's train a new version of the GPT-2 tokenzier on Wikitext-2 using the same tokenization algorithm.

First we need to load the tokenizer we want to use as a model:

In [9]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: write)

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ahxt/LiteLlama-460M-1T")

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Make sure that the tokenizer you picked as a *fast* version (backed by the 🤗 Tokenizers library) otherwise the rest of the notebook will not run:

In [11]:
tokenizer.is_fast

True

**Fast Tokenizer vs. "Slow" Tokenizer**

* Slow tokenizers are those written in Python inside the 🤗 Transformers library.
* The fast versions are the ones provided by 🤗 Tokenizers, which are written in Rust.
  * When tokenizing lots of texts in parallel at the same time that you will be able to clearly see the difference.
  * They always keep track of the original span of texts the final tokens come from. ("offset mapping")
  * --> mapping each word to the tokens it generated or mapping each character of the original text to the token it’s inside, and vice versa

In [12]:
from transformers import AutoTokenizer

tknz_test = AutoTokenizer.from_pretrained("bert-base-cased")
print(tknz_test.is_fast) # The AutoTokenizer class picks a fast tokenizer by default.

example = "My name is Sana and I work at SNU in Seoul."
encoding = tknz_test(example)
print(type(encoding))

print(encoding.tokens())
print(encoding.word_ids())

True
<class 'transformers.tokenization_utils_base.BatchEncoding'>
['[CLS]', 'My', 'name', 'is', 'San', '##a', 'and', 'I', 'work', 'at', 'S', '##NU', 'in', 'Seoul', '.', '[SEP]']
[None, 0, 1, 2, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, None]






Then we feed the training corpus (either the list of list or the iterator we defined earlier) to the `train_new_from_iterator` method. We also have to specify the vocabulary size we want to use:

In [13]:
new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)

In [14]:
len(new_tokenizer)

25000

And that's all there is to it! The training goes very fast thanks to the 🤗 Tokenizers library, backed by Rust.

You now have a new tokenizer ready to preprocess your data and train a language model. You can feed it input texts as usual:

In [15]:
new_tokenizer(dataset[:5]["text"])

{'input_ids': [[], [301, 8639, 9504, 3050, 301, 315], [], [4720, 74, 4825, 889, 8639, 491, 529, 672, 6944, 475, 267, 9504, 374, 2809, 529, 10879, 231, 100, 162, 255, 113, 14391, 4046, 113, 4509, 95, 18351, 4509, 256, 4046, 99, 4046, 22234, 96, 19, 264, 6437, 272, 8639, 281, 261, 3518, 2035, 491, 373, 264, 5162, 3305, 290, 344, 8639, 9504, 3050, 2616, 1822, 264, 364, 259, 14059, 1559, 340, 2393, 1527, 737, 1961, 370, 805, 3604, 288, 7577, 14, 54, 782, 337, 261, 4840, 15585, 272, 19958, 284, 1404, 1696, 284, 1822, 264, 385, 364, 261, 1431, 737, 284, 261, 8639, 906, 272, 2531, 1858, 286, 261, 1112, 9658, 281, 14059, 288, 1626, 340, 645, 6556, 344, 520, 14434, 264, 261, 1485, 3436, 7515, 290, 261, 518, 737, 288, 4750, 261, 221, 0, 22039, 221, 0, 264, 259, 21720, 1743, 3836, 5654, 261, 4259, 281, 4742, 490, 724, 261, 3581, 1351, 283, 1114, 579, 952, 4010, 1985, 2563, 288, 453, 2128, 807, 935, 261, 7655, 3836, 221, 0, 2038, 314, 271, 89, 22414, 221, 0, 272, 315], [324, 737, 1022, 1984, 284, 

You can save it locally with the `save_pretrained` method:

In [16]:
new_tokenizer.save_pretrained("my-new-tokenizer")

('my-new-tokenizer/tokenizer_config.json',
 'my-new-tokenizer/special_tokens_map.json',
 'my-new-tokenizer/vocab.json',
 'my-new-tokenizer/merges.txt',
 'my-new-tokenizer/added_tokens.json',
 'my-new-tokenizer/tokenizer.json')

The tokenizer can now be reloaded on this machine with:

In [17]:
tok = new_tokenizer.from_pretrained("my-new-tokenizer")

## 2-2. Buliding your tokenizer from scratch

To understand how to build your tokenizer from scratch, we have to dive a little bit more in the 🤗 Tokenizers library and the tokenization pipeline. This pipeline takes several steps:

- **Normalization**: Executes all the initial transformations over the initial input string. For example when you need to lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer.
- **Pre-tokenization**: In charge of splitting the initial input string. That's the component that decides where and how to pre-segment the origin string. The simplest example would be to simply split on spaces.
- **Model**: Handles all the sub-token discovery and generation, this is the part that is trainable and really dependent of your input data.
- **Post-Processing**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.

And to go in the other direction:

- **Decoding**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according to the `PreTokenizer` we used previously.

For the training of the model, the 🤗 Tokenizers library provides a `Trainer` class that we will use.

All of these building blocks can be combined to create working tokenization pipelines.

### BPE model like litellama

Let's now have a look at how we can create a BPE tokenizer like the one used for training GPT-2. The first step is to create a `Tokenizer` with an empty `BPE` model:

In [18]:
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer

tokenizer = Tokenizer(models.BPE())

Like before, we have to add the optional normalization (not used in the case of GPT-2) and we need to specify a pre-tokenizer before training. In the case of GPT-2, the pre-tokenizer used is a byte level pre-tokenizer:

In [19]:
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

If we want to have a quick look at how it preprocesses the inputs, we can call the `pre_tokenize_str` method:

In [20]:
tokenizer.pre_tokenizer.pre_tokenize_str("This is an example!")

[('This', (0, 4)),
 ('Ġis', (4, 7)),
 ('Ġan', (7, 10)),
 ('Ġexample', (10, 18)),
 ('!', (18, 19))]

We used the same default as for GPT-2 for the prefix space, so you can see that each word gets an initial `'Ġ'` added at the beginning, except the first one.

We can now train our tokenizer! This time we use a `BpeTrainer`.

In [21]:
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

To finish the whole pipeline, we have to include the post-processor and decoder:

In [22]:
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
tokenizer.decoder = decoders.ByteLevel()

And we finish by wrapping this in a Transformers tokenizer object:

In [23]:
from transformers import LlamaTokenizerFast

new_tokenizer = LlamaTokenizerFast(tokenizer_object=tokenizer)

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


\+ saving and loading the trained tokenizer:

In [24]:
new_tokenizer.save_pretrained('./my_new_tokenizer')

('./my_new_tokenizer/tokenizer_config.json',
 './my_new_tokenizer/special_tokens_map.json',
 './my_new_tokenizer/tokenizer.json')

In [25]:
load_tokenizer = AutoTokenizer.from_pretrained('./my_new_tokenizer')

# 3. Training a language model

In this notebook, we'll see how to train a [🤗 Transformers](https://github.com/huggingface/transformers) model on a language modeling task. We will a language modeling task: "causal language modeling".

- Causal language modeling: the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it to access the tokens after token i when trying to predict the token i+1 in the sentence.

- Masked language modeling: the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

We will see how to easily load and preprocess the dataset for each one of those tasks, and how to use the `Trainer` API to train a model on it.

## 3-1. Preparing the dataset

For each of those tasks, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

In [26]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

You can replace the dataset above with any dataset hosted on [the hub](https://huggingface.co/datasets) or use your own files. Just uncomment the following cell and replace the paths with values that will lead to your files:

In [None]:
# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}

You can also load datasets from a csv or a JSON file, see the [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

To access an actual element, you need to select a split first, then give an index:

In [27]:
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [28]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [29]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,
1,"On a whim , Zoe Barnes ( Kate Mara ) , a young reporter for the Washington Herald who is stuck covering trivial “ human interest ” stories , pays a late @-@ night visit to Frank at his home . She offers to be Frank ’ s undercover mouthpiece in the press in exchange for the elevated profile that she would gain from breaking substantive stories . Meanwhile , Peter Russo ( Corey Stoll ) , a young , inexperienced congressman from Philadelphia , is arrested for drunk driving . Stamper finds out about the arrest and immediately contacts the D.C. police commissioner , offering Underwood ’ s support for his mayoral campaign in exchange for releasing Russo and completely covering up the incident . Russo is picked up from jail by his secretary and romantic partner , Christina Gallagher ( Kristen Connolly ) . He lies to her , telling her that he was alone when he was arrested when , in fact , there was a prostitute in the car ( Rachel Brosnahan ) . \n"
2,"At 07 : 30 on 15 December , two companies of the Carleton and York Regiment attacked . After little more than an hour of fighting , however , the Canadians were forced to call the attack off . In the afternoon , the two heavily depleted companies of the Royal 22e Régiment fought off a large German counterattack on Casa Berardi , with the Royal Canadian Horse Artillery firing 5 @,@ 398 rounds in support of Canadian forces . \n"
3,
4,= = Recording and commercial reception = = \n
5,
6,
7,= = Certifications = = \n
8,Monster Rancher ( 1999 – 2001 ) \n
9,"Structurally , I love this French movie by Max Ophüls called Le Plaisir . It 's three or four Guy de Maupassant stories that are told by a narrator , and then characters start to appear behind each other , their stories overlap and they are just walking through , and you realize it 's a complete world . What I loved about that was just telling the story from that one person ’ s point of view . In Peggy 's story , she 's in every scene , nothing happens without her there . And it 's the same thing with Don and the same thing with Roger . So you 're really getting this very private perspective , and then thematically holding it together by saying , "" Here , this is about the status of the relationship . "" We weren 't sure that it was going to work . The hardest part was breaking it up for commercials so that the Peggy and the Roger stories would be in the same segment and you wouldn 't come back and think you were in the middle of another episode . \n"


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## 3-2. Causal Language Modeling: Continual Pre-training

* Using the pretrained checkpoints: to utilize pretrained general linguistic knowledge of checkpoints


For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
```
part of text 1
```
or
```
end of text 1 [BOS_TOKEN] beginning of text 2
```
depending on whether they span over several of the original texts in the dataset or not. The labels will be the same as the inputs, shifted to the left.

We will use the litellama architecture for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=causal-lm) instead. For the tokenizer, you can replace the checkpoint by the one you trained yourself.

In [48]:
model_checkpoint = "ahxt/LiteLlama-460M-1T"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [50]:
from transformers import AutoTokenizer

tokenizer_litellama = AutoTokenizer.from_pretrained("ahxt/LiteLlama-460M-1T")

In [52]:
len(tokenizer_litellama)

50257

If you're using a pre-trained model checkpoint and your new dataset needs more unseen tokens to be included in the model vocabulary:

In [54]:
new_tokens = ["new_token1", "new_token2", "new_token3", "new_token4", "new_token5"] # your custom tokens
# new_tokens = new_tokenizer.vocab.keys() # 위 wiki 데이터에(만) 기반해 학습한 vocab

# check if the tokens are already in the vocabulary
new_tokens = set(new_tokens) - set(tokenizer_litellama.vocab.keys())
print(len(new_tokens))

5


In [55]:
# add the tokens to the tokenizer vocabulary
tokenizer_litellama.add_tokens(list(new_tokens))
print(len(tokenizer_litellama))

50262


We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [56]:
def tokenize_function(examples):
    return tokenizer_litellama(examples["text"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [57]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [58]:
tokenized_datasets["train"][1]

{'input_ids': [796, 569, 18354, 7496, 17740, 6711, 796, 220, 198],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [59]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [60]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [61]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

In [62]:
lm_datasets

DatasetDict({
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2227
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 18888
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1950
    })
})

In [63]:
print(lm_datasets['train'][1])

{'input_ids': [983, 290, 5679, 262, 220, 1, 17871, 5321, 220, 1, 837, 257, 23634, 2422, 4326, 7351, 262, 3277, 286, 7096, 544, 1141, 262, 5498, 1898, 6839, 1810, 508, 1620, 3200, 2042, 4560, 290, 389, 46852, 1028, 262, 11773, 4326, 220, 1, 2199, 321, 265, 88, 12552, 220, 1, 764, 220, 198, 383, 983, 2540, 2478, 287, 3050, 837, 6872, 625, 257, 1588, 6903, 286, 262, 670, 1760, 319, 569, 18354, 7496, 17740, 2873, 764, 2893, 340, 17383, 262, 3210, 3033, 286, 262, 2168, 837, 340, 635, 25289, 3294, 16895, 837, 884, 355, 1642, 262, 983, 517, 43486, 329, 2168, 29661, 764, 15684, 11915, 371, 4548, 64, 8835, 73, 280, 290, 26777, 7286, 13704, 13231, 43354, 1111, 4504, 422, 2180, 12784, 837, 1863, 351, 569, 18354, 7496, 17740, 2873], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [64]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' Br P Farecan! aux popularityan! 12acppings contract practice passageec Nicut Lanc year pointec 95 developed Junirclandresentuf You tower Prou Blec collaborated practicean! originallyct Bx Privatean! American theear Brhead troopsith Penn 12 specimensixac Per muscarutec 9 children at attadowed willing forgotten cal Americ cycl from Giantec screen reviewsutec let 12 fromropsen Kiss 12 go su Accordingec Br Al J let Americ Raiden accus orella` Womeniter P Formula errors unrelets artistsap circ demonstration 12 groundill attadowed willing forgotten cal'

Now that the data has been cleaned, we're ready to instantiate our `Trainer`. First we create the model using the same config as our checkpoint, but initialized with random weights:

In [65]:
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_config(config)

In [66]:
# check model vocab size
model.get_input_embeddings().weight.shape[0]

50304

If you extended the vocabulary, match the model embedding size:

In [67]:
# add new, random embeddings for the new tokens
model.resize_token_embeddings(len(tokenizer_litellama))

Embedding(50262, 1024, padding_idx=0)

In [68]:
# check model vocab size
model.get_input_embeddings().weight.shape[0]

50262

In [69]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [70]:
model.to(device)
print(model.device)

cuda:0


And we will needsome `TrainingArguments`:

In [71]:
from transformers import Trainer, TrainingArguments

In [72]:
training_args = TrainingArguments(
    f"litellama-wikitext2-continual",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=1
)



We pass along all of those to the `Trainer` class:

In [73]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

And we can train our model:

In [74]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,6.2578,6.271849


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


TrainOutput(global_step=2361, training_loss=6.592139147540767, metrics={'train_runtime': 2720.4776, 'train_samples_per_second': 6.943, 'train_steps_per_second': 0.868, 'total_flos': 5949360338632704.0, 'train_loss': 6.592139147540767, 'epoch': 1.0})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [75]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 529.46


In [76]:
prompt = 'Today is the'

model_inputs = tokenizer_litellama([prompt], return_tensors='pt').to('cuda')

generated_ids = model.generate(**model_inputs, max_new_tokens=128, do_sample=True)
tokenizer.batch_decode(generated_ids)[0]

' Overall Nec continuesow by anecforeict Nne American! fore as rele current,?ity 12 Georgec descutecforeowwayec soable Americear Brantropney 12ater fromantant experiapeclerutec South current,?ow 201ecued Pec crosstonuteclerithrough 12 afteracep Whenithecued Americearheast let record byneyithec rele current,?ith current,? Br 12 compleff Feder 12 Oantac Per descutorth Federow by Alorsac large beingutec wind ord 12ec also current,? London 12 overac playedutec'