## Fine-tune a Language Model
___

#### Types of Language Model

##### Causal Language Model
1. The model has to predict the next token in the sentence. 
2. To make sure that the model does not cheat, it gets an attention mask that will prevent it to access the tokens after the $(i+1)$-th position.

##### Masked Language Model
1. The model has to predict some tokens that are masked in the input.
2. It still has access to the whole sentence, so it can use the tokens before and after the masked tokens to predict their values.

### Library
___

In [1]:
import random
import pandas as pd
from IPython.display import display, HTML

import transformers
print(transformers.__version__)
from transformers import AutoTokenizer

from datasets import load_dataset
from datasets import ClassLabel

4.31.0


### Preparing the Dataset
___
1. We will use the `Wikitext 2` datasets as an example.

In [2]:
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

Found cached dataset wikitext (/home/kccheng1988/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
# View a sample of the dataset
datasets['train'][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

In [4]:
# The following function shows some randomly picked sample
def show_random_elements(dataset, num_examples = 10):
    assert num_examples <= len(dataset)

    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i : typ.names[i])
    display(HTML(df.to_html())) # Renders a dataframe as an HTML table.


In [5]:
show_random_elements(datasets['train'])

Unnamed: 0,text
0,= = Common names = = \n
1,"Peshkin 's findings show a "" total world "" where the lessons of religion and education are intertwined into an "" interrelated , interdependent "" philosophy . The academy 's intent is to make Christian professionals as what Peshkin describes as "" a vocational school directed to work in the Lord 's service "" . When compared to the work of public schools , the private school 's instructors said both kinds of institutions impose a lifestyle and set of values as a kind of "" brainwashing "" . Peshkin notes that while students "" largely identify with "" and uphold the fundamentalist teachings , they permit themselves the option of having "" individual interpretations "" and minor beliefs . Some students either dissent against the academy 's rules or are regarded as too pious , but most students are moderate . \n"
2,
3,2O + XeO \n
4,= = Fate of the DuMont stations = = \n
5,
6,"The evacuation by train from Romani was carried out in a manner which caused much suffering and shock to the wounded . It was not effected till the night of August 6 – the transport of prisoners of war being given precedence over that of the wounded – and only open trucks without straw were available . The military exigencies necessitated shunting and much delay , so that five hours were occupied on the journey of twenty @-@ five miles . It seemed a cruel shame to shunt a train full of wounded in open trucks , but it had to be done . Every bump in our springless train was extremely painful . \n"
7,
8,"The fieldfare is omnivorous . Animal food in the diet includes snails and slugs , earthworms , spiders and insects such as beetles and their larvae , flies and grasshoppers . When berries ripen in the autumn these are taken in great number . Hawthorn , holly , rowan , yew , juniper , dog rose , Cotoneaster , Pyracantha and Berberis are all relished . Later in the winter windfall apples are eaten , swedes attacked in the field and grain and seeds eaten . When these are exhausted , or in particularly harsh weather , the birds may move to marshes or even the foreshore where molluscs are to be found . \n"
9,= = = Modern era = = = \n


##### Notes:
1. Some of the texts are full paragraphs of a Wikipedia article, while others are just titles or empty lines.

### Causal Language Modeling
___
* We are going to take all the texts in our dataset and concatenate them after they are tokenized.
* Then, we split them in examples of a certain sequence length.
* This way, the model will receive chunks of contiguous text that may look like:

`part of text 1` <br>
`end of text 1 [BOS_TOKEN] beginning of text 2`

* We will use `distilgpt2` model for this example.

In [6]:
model_checkpoint = 'distilgpt2'

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast = True)

In [7]:
# We can now call the tokenizer on our texts
def tokenize_function(examples):
    return tokenizer(examples['text'])

tokenized_datasets = datasets.map(tokenize_function, batched = True, num_proc = 4, remove_columns = ['text'])

Loading cached processed dataset at /home/kccheng1988/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-7a451edd1b9d6a54_*_of_00004.arrow
Loading cached processed dataset at /home/kccheng1988/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-47d068062a7caf0a_*_of_00004.arrow
Loading cached processed dataset at /home/kccheng1988/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-8f73ed8d3cd21d84_*_of_00004.arrow


In [8]:
# View a sample of the tokenized datasets
tokenized_datasets['train'][1]

{'input_ids': [796, 569, 18354, 7496, 17740, 6711, 796, 220, 198],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

##### Notes:
* Now, we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`.
* We will use `map` again, with the option `batched = True`
* This option lets us change the number of examples in the datasets by returning a different number of examples than we got.

In [13]:
block_size = tokenizer.model_max_length
print('Maximum input length of model: ', block_size)

# If the maximum input length is too big to fit in your GPU RAM, we will take a bit less
block_size = 2

Maximum input length of model:  1024


In [14]:
## -- Group texts -- ##
def group_texts(examples):
    # Using sum(ls, []) to concatenate a list of lists.
    concatenated_examples = {k : sum(ls, []) for k, ls in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # We drop the small residual block in the end.
    # We could add padding if the model supports it (customizable)
    total_length = (total_length // block_size) * block_size

    # Split by chunks of max_len
    result = {
        k : [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result['labels'] = result['input_ids'].copy()
    return result 