# <font color='#2B4865'>**Hugging Face 🤗 Transformers Tutorial III**</font>

---
### Natural Language Processing
Date: Jan 11, 2023

Author: Lorena Calvo-Bartolomé (lcalvo@pa.uc3m.es)

Version 1.0

---
This notebook is based on the [Hugging Face course](https://huggingface.co/course/chapter1/1) and documentation available at the Hugging Face website.

It constitutes the third and last tutorial notebook on the usage of Hugging Face libraries as well as its application for solving a series of NLP tasks. In particular, we will be covering how to carry out Text Generation.

---

<font color='#E0144C'>**For this notebook's execution, we highly encourage you to use Google Colaboratory. While for the inference part it is not necessary, you will highly speed up the execution if you make use of a GPU. For doing so, follow the following steps:**</font>

<font color='#E0144C'>**1. Connect to hosted runtime**</font>

<font color='#E0144C'>**2. Enable GPU setting by clicking Edit -> Notebook Settings -> Select GPU in Hardware Acceleration Tab -> Save**</font>

Nos centraremos en la tarea de "Text Generation"

### PRÁCTICA 4.4 - PROCESAMIENTO DEL LENGUAJE NATURAL - MASTER EN INTELIGENCIA ARTIFICIAL APLICADA

### JOSÉ LORENTE LÓPEZ - DNI: 48842308Z

## <font color='#2B4865'>Installing necessary packages, imports and auxiliary functions</font>

In [1]:
# Install necessary packages
import importlib, os

necessary_packages = ['transformers[sentencepiece]', 'datasets', 'colored', 'evaluate', 'gradio', 'accelerate']
def import_missing(packages):
  for p in packages:
    try:
      mod = importlib.import_module(p)
      print(f"Package {p} already installed!")
      packages.remove(p)
    except ModuleNotFoundError:
      print(f"Installing package {p}")
      with open("requirements.txt", 'w') as f:
        f.write("\n".join(str(i) for i in packages))
  if os.path.isfile("requirements.txt"):
    %pip install --quiet -r "requirements.txt"

import_missing(necessary_packages)

Installing package transformers[sentencepiece]
Installing package datasets
Installing package colored
Installing package evaluate
Installing package gradio
Installing package accelerate
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m47.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.2/14.2 MB[0m [31m84.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.5/191.5 KB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

In [2]:
# Common imports 
import os
import numpy as np
import pandas as pd
from termcolor import colored
import seaborn as sns
import matplotlib.pyplot as plt
import tqdm
import scipy
from colored import fore, back, style 
import torch
import json

# Figures plotted inside the notebook
%matplotlib inline 
# High quality figures
%config InlineBackend.figure_format = 'retina' 
# Figures style
plt.style.use('seaborn-whitegrid')
sns.set_style("darkgrid")
sns.color_palette("deep")
# Figues size
plt.rcParams['figure.figsize'] = [8, 6]

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings(action='ignore',module='gradio')

In [3]:
# To wrap long text lines
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# For fancy table Display
%load_ext google.colab.data_table

We are going to save all the files in this notebook generated into Drive. Fill the variable ``path_to_folder`` in the next with your Drive's folder in which you want to save the files.

In [4]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
path_to_folder = '/content/drive/MyDrive/Cosas/NLP_IA/Cosas_LAB_4/TUTOIII'  # UPDATE THIS ACCORDING TO WHERE YOU WANT TO SAVE THE FILES!!!!

# Change to assignment directory
os.chdir(path_to_folder) 

## <font color='#2B4865'>**4. Text Generation**
---
</font>

Text Generation is the task of producing new text. These models can, for instance, fill in incomplete text or paraphrase. 

Some use cases include:

* **Story Generation**: By receiving an input like "*Once upon a time*", a story generation model can create a story based on those words as we saw with GPT-2.
* **Code Generation**: We can train a text generation model on code from scratch to help with repetitive coding tasks.

* **Text-to-Text Generation Models**, which are trained to learn the mapping between a pair of texts (e.g. translation from one language to another). They are trained with multi-tasking capabilities, they can accomplish a wide range of tasks, including summarization, translation, and text classification.

Estos modelon pueden comp`letar textos incompletos o parafrasear. Pueden suarse para generar historias, codigo o incluso traducir. Veamos una demo de un modelo ya creado con gradio y HF:

##### <font color='#2B4865'>**Demo**</font>

In [7]:
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed, pipeline

#https://huggingface.co/spaces/lvwerra/codeparrot-generation

title = "CodeParrot Generator 🦜"
description = "This is a subspace to make code generation with [CodeParrot](https://huggingface.co/lvwerra/codeparrot), it is used in a larger [space](https://huggingface.co/spaces/loubnabnl/Code-generation-models-v1) for model comparison. For more flexibilty in sampling, you can find another demo for CodeParrot [here](https://huggingface.co/spaces/lvwerra/codeparrot-generation)."
example = [
    ["def print_hello_world():", 8, 0.6, 42],
    ["def get_file_size(filepath):", 40, 0.6, 42],
    ["def count_lines(filename):", 40, 0.6, 42],
    ["def count_words(filename):", 40, 0.6, 42]]
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small")
model = AutoModelForCausalLM.from_pretrained("codeparrot/codeparrot-small", low_cpu_mem_usage=True)


def code_generation(gen_prompt, max_tokens, temperature=0.6, seed=42):
    set_seed(seed)
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    generated_text = pipe(gen_prompt, do_sample=True, top_p=0.95, temperature=temperature, max_new_tokens=max_tokens)[0]['generated_text']
    return generated_text


iface = gr.Interface(
    fn=code_generation, 
    inputs=[
        gr.Textbox(lines=10, label="Input code"),
        gr.inputs.Slider(
            minimum=8,
            maximum=256,
            step=1,
            default=8,
            label="Number of tokens to generate",
        ),
        gr.inputs.Slider(
            minimum=0,
            maximum=2,
            step=0.1,
            default=0.6,
            label="Temperature",
        ),
        gr.inputs.Slider(
            minimum=0,
            maximum=1000,
            step=1,
            default=42,
            label="Random seed to use for the generation"
        )
    ],
    outputs=gr.Textbox(label="Predicted code", lines=10),
    examples=example,
    layout="horizontal",
    theme="peach",
    description=description,
    title=title
)
iface.launch()

Downloading:   0%|          | 0.00/259 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/497k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/277k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/840k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/903 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/457M [00:00<?, ?B/s]

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



##### <font color='#2B4865'>**Architecture for approaching the task**</font>

The most popular models for this task are GPT-based models (such as GPT-2). Since these models are trained on unlabeled data, we just need plain text to train our own model to generate a wide variety of documents, from code to stories. Regarding the Text-to-Text Generation Models, its most popular variants are T5, T0 and BART. 

##### <font color='#2B4865'>**Evaluation metrics**</font>

La calidad de estos modelos se suele medir con crossentropy (diferencia entre distribuciones de probabilidad) y perplexity (exp de la cross, mide la prob asignada a la siguiente palabra del modelo; a más pequeña, mejor modelo).

**Typical metrics** for Text Generation are:

* **Cross Entropy**, which is a metric that calculates the difference between two probability distributions. Each probability distribution is the distribution of predicted words.
* **Perplexity**, which is the exponential of the cross-entropy loss. It evaluates the probabilities assigned to the next word by the model. Lower perplexity indicates better performance.

### <font color='#2B4865'>*4.1. Inference*</font>

Performing inference for a text generation task is pretty similar to what we saw in former tutorials for the studied tasks. For basic text generation tasks (i.e., stories or code generation), we need to provide the model to use in conjunction with the task identifier ``"text-generation"``:

Veamos un ejemplo con el modelo pre-entrenado "gpt-2":

In [7]:
from transformers import pipeline

checkpoint_name = "gpt2"
text_generator = pipeline("text-generation",
                          model=checkpoint_name,
                          pad_token_id=tokenizer.eos_token_id)                  # Prevents warning during decoding

In [None]:
prompt = "Once upon a time"
print(text_generator(prompt, max_length=50, do_sample=True, top_p=0.9))

[{'generated_text': "Once upon a time it seemed like every single player had a clear objective that was the key to achieving the goal of getting all the items at the top, and that they'd done it in less than ten minutes.\n\nBut now, even as"}]


The model is generating a random text with a total maximal length of $50$ tokens from context “*Once upon a time*”. To create text, the pipeline object invokes the method ``PreTrainedModel.generate()``. The default arguments for this method can be overridden in the pipeline.

Below is an example of text generation using directly the GPT2 model and its tokenizer, which includes calling ``generate()`` directly. Note that for the instantiation of the model alone, we utilize a ``CausalLM`` head.

Tomamos el tokenizador de gpt2, el modelo aparte y lo usamos en un ejemplo hecho a mano:

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer

gpt2_tokenizer = AutoTokenizer.from_pretrained(checkpoint_name)
gpt2 = AutoModelForCausalLM.from_pretrained(checkpoint_name)
gpt2.config.pad_token_id = gpt2.config.eos_token_id                            

In [None]:
tokenized_prompt = gpt2_tokenizer(prompt, return_tensors="pt")
output = gpt2.generate(**tokenized_prompt, max_length=50, do_sample=True, top_p=0.9)
print(f"{gpt2_tokenizer.batch_decode(output)[0]}")

Once upon a time the human race was going through a period of chaos, they all had the capability to live in an anarchic and free society, and this was a very difficult society to rule, but the people were free to act freely and in


The following code snippet creates a **pipeline for Code generation**, based on CodeParrot, similar to what is done above for the Gradio app.

Creamos un modelo basado en codeparrot:

In [9]:
from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small")

codeparrot = pipeline("text-generation", model="codeparrot/codeparrot-small")
outputs = codeparrot("def hello_world():", max_length=50, pad_token_id=tokenizer.eos_token_id)
print(outputs)

[{'generated_text': 'def hello_world(): # This test case passes if the server needs an empty\n    print "Hello World!"\n# -*- coding: utf-8 -*-\n\n# Copyright(C) 2010-2011 Romain Bignon\n#\n'}]


**Text-to-Text generation models** have a separate pipeline called ``text2text-generation``. This pipeline takes an input containing the sentence including the task and returns the output of the accomplished task.

Generamos modelos de text2text.

In [None]:
from transformers import pipeline

text2text_generator = pipeline("text2text-generation", model="t5-base")

# Question Answering
question = "What is 42 ?"
context = "42 is the answer to life, the universe and everything"
output = text2text_generator("question: " + question + " " + context)
print(f"{question} --> {output[0]['generated_text']}")

# Translation
source_sent = "I'm very happy"
output = text2text_generator("translate from English to French: " + source_sent)
print(f"{source_sent} --> {output[0]['generated_text']}")

Downloading:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

What is 42 ? --> the answer to life, the universe and everything
I'm very happy --> Je suis très heureux


### <font color='#2B4865'>*4.2. Decoding methods for language generation*</font>

There exist different decoding strategies that are used by **auto-regressive language generation models** (i.e., decoder models), being the currently most prominent one the following:

1. **Greedy search**: At each time step, it selects **the word with the highest probability as its next word**, which is given by:

  $$w_t = argmax_{w}P(w | w_{1:t-1})$$

    <br><center><img src="https://drive.google.com/uc?id=1TPmWTY4bljqyRAB0qTXJVtf3qQ5CRBbT" width="40%"></center><br>

  Starting from the word "*The*", the algorithm greedily chooses the next word of highest probability "*nice*" and so on, so that the final generated word sequence is *{"The", "nice", "woman"}*, with a probability of $0.5 \times 0.4 = 0.2$.

  The **way of generating text following a greedy search approach** is as follows:

  ```
  greedy_output = model.generate(input_ids, do_sample=False)
  ```

  The main drawbacks of this approach are:
  * The model starts repeating itself quickly
  * It misses high-probability words hidden behind a low-probability word, e.g., the word "*has*" with its high conditional probability of $0.9$ is hidden behind the word "*dog*", which has only the second-highest conditional probability, so that greedy search misses the word sequence *{"The", "dog", "has"}*.

2. **Beam search**: It reduces the risk of missing hidden high-probability word sequences by keeping the most likely ``num_beams`` of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. For example, with ``num_beams=2``:

    <br><center><img src="https://drive.google.com/uc?id=1Hi0dDz4PKe6dOHOZLbmPmP-1g6_7J1Ia" width="40%"></center><br>

  At time step 1, besides the most likely hypothesis *{"The", "nice"}*, beam search also keeps track of the second most likely one *{"The", "dog"}*. At time step 2, it finds that the word sequence *{"The", "dog, "has"}*, has with $0.36$ a higher probability than *{"The", "nice", "woman"}*, which has $0.2$, thus finding the most likely word sequence.

  The way of generating text following a beam search approach is by setting  ``num_beams > 1`` and ``early_stopping=True`` so that generation is finished when all beam hypotheses reached the EOS token. We can further improve the result to reduce repetitions of the same word sequences by **introducing n-grams**. The most common n-grams penalty makes sure that no n-gram appears twice by manually setting the probability of the next words that could create an already seen n-gram to 0, e.g.:
  ```
  beam_output = model.generate(input_ids, num_beams=5, no_repeat_ngram_size=2, early_stopping=True, do_sample=False)
  ```

3. **Sampling**: It refers to **randomly picking the next word $w_t$ according to its conditional probability** distribution $w_t \sim P(w|w_{1:t-1})$:

    <br><center><img src="https://drive.google.com/uc?id=14rILPEf-c43F8vXc_o2hFcRNoWQlCvod" width="40%"></center><br>

  The word "*car*" is sampled from $P(w | \text{"The"})$, followed by sampling "*drives*" from $P(w | \text{"The"}, \text{"car"})$.

  We can activate sampling by setting ``do_sample=True``.

  One issue with sampling is that the models tend to produce incomprehensible nonsense. This may be solved by sharpening the conditional probability distribution by raising the likelihood of high-probability words and decreasing the likelihood of low-probability words, i.e.,  by lowering the softmax's so-called ``temperature``:
  
      <br><center><img src="https://drive.google.com/uc?id=1nz12FfdJIcf92LhS3pg6K5-PeUMQ_jrI" width="40%"></center><br>

  The conditional next word distribution of step $t=1$ becomes much sharper leaving almost no chance for the word "*car*" to be selected.

  In practice, the Transformers library provides two different sampling methods:

  *   **Top-k sampling:** The $K$ most likely next words are filtered and the probability mass is redistributed among only those $K$ next words. GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation:
  
    <br><center><img src="https://drive.google.com/uc?id=1PrJR5ZCe8-T4-1lsPt1tV6NCq7bSen95" width="70%"></center><br>

    Having set $K = 6$, in both sampling steps we limit our sampling pool to 6 words.

    We activate top-K by setting ``top_k`` to the size of the candidate set:

    ```
    sample_output = model.generate(input_ids, do_sample=True, top_k=50)
    ```

  *   **Top-p (nucleus) sampling:** Instead of sampling only from the most likely K words, Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. The probability mass is then redistributed among this set of words. This way, the size of the set of words (a.k.a the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution:

    <br><center><img src="https://drive.google.com/uc?id=11bTzQPA_cNIPfJwaTMcwPMHfXF6aRYUn" width="70%"></center><br>

    Having set $p=0.92$, it picks the minimum number of words to exceed together $p=92\%$ of the probability mass ($V_{\text{top-p}}$). In the first example, this included the 9 most likely words, whereas it only has to pick the top 3 words in the second example to exceed 92%.
  
    We activate Top-p sampling by setting ``top_p``to 0 < top_p < 1 and ``top_k`` to 0:

    ```
    sample_output = model.generate(input_ids, do_sample=True, top_p=0.92, top_k=0)
    ```

###### **Exercise 4.1**

Take the prompt we defined in 4.1 (``prompt = "Once upon a time"``) and generate text with the GPT-2 model according to the different decoding methods explained above. When possible, carry out the fine-tuning of parameters (e.g., ``num_beams`` for greedy search, ``temperature`` for sampling methods, etc.).

Compare the results that you obtain for the different methods.

In [None]:
prompt = "Once upon a time"
prompt_ids = gpt2_tokenizer.encode(prompt, return_tensors='pt')

In [None]:
#<SOL>
# Greedy search
outputs = gpt2.generate(prompt_ids, do_sample=False)
output_text = gpt2_tokenizer.convert_tokens_to_string(gpt2_tokenizer.convert_ids_to_tokens(outputs[0]))
print(output_text)
#</SOL>

Once upon a time, the world was a place of great beauty and great danger. The world was


In [None]:
#<SOL>
# Beam search
beam_output = gpt2.generate(prompt_ids, num_beams=5, no_repeat_ngram_size=2, early_stopping=True, do_sample=False)
output_text = gpt2_tokenizer.convert_tokens_to_string(gpt2_tokenizer.convert_ids_to_tokens(beam_output[0]))
output_text
#</SOL>

'Once upon a time, it was said, there was a man in the house of the Lord,'

In [None]:
#<SOL>
# Top-k sampling
sample_output = gpt2.generate(prompt_ids, do_sample=True, top_k=50)
output_text = gpt2_tokenizer.convert_tokens_to_string(gpt2_tokenizer.convert_ids_to_tokens(sample_output[0]))
output_text
#</SOL>



'Once upon a time all mankind knew that all was to be taken away from this earth...I knew'

In [None]:
#<SOL>
# Top-p sampling
sample_output = gpt2.generate(prompt_ids, do_sample=True, top_p=0.92, top_k=0)
output_text = gpt2_tokenizer.convert_tokens_to_string(gpt2_tokenizer.convert_ids_to_tokens(sample_output[0]))
output_text
#</SOL>



'Once upon a time there was war, the household of Ennegan of Gnarena accepted the'

### <font color='#2B4865'>*4.2. Fine-tuning*</font>

Fine-tuning a model for text generation is not different from what we have seen for previous tasks, just with a couple of peculiarities regarding preprocessing. Let's see the latter by using the [WikiText](https://huggingface.co/datasets/wikitext) language modeling dataset:

In [11]:
from datasets import load_dataset

datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

Downloading and preparing dataset wikitext/wikitext-2-raw-v1 to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...


Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
datasets

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

As checkpoint for this example, we will use the ``distilgpt2`` model:

In [None]:
checkpoint_name = "distilgpt2"

A **standard preprocessing step** for both **auto-regressive and masked language modeling** is to **concatenate all the instances** and then **split the whole corpus into chunks of similar size**. This is quite different from our usual approach, where we simply tokenize individual examples. The reason for this is that if individual samples are too long, they may be truncated, resulting in the loss of information that may be valuable for our task.

The approach is then to first tokenize our corpus as usual, but without setting the ``truncation=True`` option in our tokenizer:

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(checkpoint_name)

print(tokenizer.is_fast)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

True


In [23]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

      

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

  

#2:   0%|          | 0/2 [00:00<?, ?ba/s]

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

#3:   0%|          | 0/2 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/10 [00:00<?, ?ba/s]

#2:   0%|          | 0/10 [00:00<?, ?ba/s]

#3:   0%|          | 0/10 [00:00<?, ?ba/s]

#0:   0%|          | 0/10 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

Having our text tokenized, the next step is to concatenate all our texts together and then split the result into small chunks of certain block size. The desired block size would be the maximum length with which our model was pretrained; yet, this might be too big to fit in the GPU RAM, so let's just work with $128$:

In [None]:
print(tokenizer.model_max_length)

block_size = 128

1024


The following function will carry out the grouping:

In [None]:
def group_texts(examples):

  # Concatenate all texts
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])

  # We make all input samples have the same length equal to block_size 
  # by dropping the small remainder, we could add padding if the model
  # supported it instead of this drop
  # You can customize this part to your needs.
  total_length = (total_length // block_size) * block_size

  # Split by chunks of max_len
  result = {
      k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
      for k, t in concatenated_examples.items()
  }

  # Create a new labels column as copy of the input_ids
  result["labels"] = result["input_ids"].copy()

  return result

Note that in the former function's last step, we are creating a new ``labels``column which is a copy of the ``input_ids`` one. This is because the goal of masked language modeling is to predict randomly masked tokens in the input batch, and by adding a labels column, we give the ground truth for our language model to learn from.

In [None]:
grouped_datasets = tokenized_datasets.map(group_texts, batched=True, batch_size=1000, num_proc=4)

        

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

#3:   0%|          | 0/2 [00:00<?, ?ba/s]

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

#2:   0%|          | 0/2 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/10 [00:00<?, ?ba/s]

#1:   0%|          | 0/10 [00:00<?, ?ba/s]

#3:   0%|          | 0/10 [00:00<?, ?ba/s]

#2:   0%|          | 0/10 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

Here we are sending a batch of $1,000$ examples to be treated by the preprocessing function; this means that we will drop the remainder to make the concatenated tokenized texts a multiple of ``block_size`` every $1,000$ examples. You can adjust this behavior with a higher batch size, but note that this will make the processing slower.

We can see that our datasets have changed: the samples now comprise blocks of block-size contiguous tokens that might span multiple of our original texts:

In [None]:
tokenizer.decode(grouped_datasets["train"][1]["input_ids"])

' game and follows the " Nameless ", a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven ". \n The game began development in 2010, carrying over a large portion of the work done on Valkyria Chronicles II. While it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more forgiving for series newcomers. Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries, along with Valkyria Chronicles II director Takeshi Oz'

Having the dataset fully preprocessed, we have everything we need to carry out the fine-tuning of our model. Yet, there are not many differences from what we have been doing until now.

In the next and last exercise of this notebook, your task will be to carry out the complete fine-tuning of a dataset of your choice (different from WikiTest) and its corresponding previous dataset preparation and preprocessing. Here are some instructions that you may find useful:

##### <font color='#2B4865'>**Some instructions on fine-tuning for text generation**</font>

You can mimic the training we carried out in the first notebook tutorial of Hugging Face Transformers, but note that:

- The model that you need to use is one with a language modeling head (``AutoModelForCausalLM``).
- There is no need for the ``compute_metrics`` function, as evaluation for text generation can be performed accurately based on the loss.
- As DataCollator, use an instance of the class ``DataCollatorForLanguageModeling``. Since we are performing Causal Language modeling, you need to set the argument ``mlm`` to False. Before creating the collator, you will need to set the tokenizer's PAD token as the EOS token. The following code shows how to make the instantiation:

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

##### <font color='#2B4865'>**Instructions on how to create a custom dataset**</font>

When fine-tuning a Transformer model, it is not strictly necessary to use one of the Datasets available at the Hub. There are a few ways to go about defining our datasets, by creating and loading them from local or remote datasets.

🤗 Datasets support several common data formats to load local datasets, such as **CSV & TSV**, **Text files**, **JSON & JSON Lines** and **Pickled DataFrames**, and they can be loaded with the ``load_dataset`` and the file type identifier. F.e., if we want to load a CSV file, we will make it as follows:

```
load_dataset("csv", data_files="my_file.csv")
````

You may also want to use datasets that are stored in a remote server such as GitHub. It turns out that loading these files is as simple as doing so for local ones: we only need to point the ``data_files`` argument of ``load_dataset()`` to one or more URLs where the remote files are stored. For example, for the SQuAD-it dataset hosted on GitHub, we can just point data_files to the SQuAD_it-*.json.gz URLs as follows:

```
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
```

You can check more details about these functionalities [here](https://huggingface.co/course/chapter5/2?fw=pt).

In some scenarios, you may need to build a dataset from scratch, because you need to solve a certain NLP application and it does not exist. [In this link](https://huggingface.co/course/chapter5/5?fw=pt), you have a tutorial on how to create a corpus of GitHub issues, first by getting the data via the GitHub REST APLI, and then constructing the dataset. In this scenario, you will probably need to carry out the cleaning of the data.

###### **Exercise 4.2**

Your task here is to choose an autoregressive model from the Hub (try to use a different one from what we have been using in this tutorial) and a dataset to fine-tune for text generation (e.g., stories, code, etc.). 

As **dataset**, you **cannot use one provided in the Hub**. This means that you will need to find one from other resources (e.g., Kaggle) or construct it by yourself (e.g., you may want to generate movie dialogues following the Star Wars style; for this, you can web scrap the dialogues from IMSDb). If necessary, carry out the cleaning of the dataset.

Once you have your dataset ready, proceed with its pre-processing as we have seen above, and fine-tune it following the specified guidelines. Choose the parameter that best fits your dataset. 

Once the model is trained, you can check its performance via the following code snippet:

```python
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
```

After having the final fine-tuned model, create a pipeline and generate at least 10 different texts with it. For creating the pipeline, you will need to specify the model checkpoint, and the tokenizer and set the following argument:

```python
pad_token_id=tokenizer.eos_token_id
```

As the tokenizer for the pipeline, use that of the checkpoint you fine-tuned.

In [8]:
def tokenize_function(examples):
    return tokenizer(examples["sentence"])

In [9]:
from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("ptb_text_only")

checkpoint_name = "distilgpt2"
    
tokenizer = AutoTokenizer.from_pretrained(checkpoint_name)

Downloading builder script:   0%|          | 0.00/6.50k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.15k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/4.21k [00:00<?, ?B/s]

Downloading and preparing dataset ptb_text_only/penn_treebank to /root/.cache/huggingface/datasets/ptb_text_only/penn_treebank/1.1.0/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.70M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/135k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/150k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/42068 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3761 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3370 [00:00<?, ? examples/s]

Dataset ptb_text_only downloaded and prepared to /root/.cache/huggingface/datasets/ptb_text_only/penn_treebank/1.1.0/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence'],
        num_rows: 42068
    })
    test: Dataset({
        features: ['sentence'],
        num_rows: 3761
    })
    validation: Dataset({
        features: ['sentence'],
        num_rows: 3370
    })
})

In [11]:
tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns='sentence')

        

#2:   0%|          | 0/11 [00:00<?, ?ba/s]

#1:   0%|          | 0/11 [00:00<?, ?ba/s]

#0:   0%|          | 0/11 [00:00<?, ?ba/s]

#3:   0%|          | 0/11 [00:00<?, ?ba/s]

       

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

In [12]:
def group_texts(examples):

  # Concatenate all texts
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])

  # We make all input samples have the same length equal to block_size 
  # by dropping the small remainder, we could add padding if the model
  # supported it instead of this drop
  # You can customize this part to your needs.
  total_length = (total_length // block_size) * block_size

  # Split by chunks of max_len
  result = {
      k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
      for k, t in concatenated_examples.items()
  }

  # Create a new labels column as copy of the input_ids
  result["labels"] = result["input_ids"].copy()

  return result

In [13]:
print(tokenizer.model_max_length)

block_size = 128

1024


In [14]:
grouped_datasets = tokenized_datasets.map(group_texts, batched=True, batch_size=1000, num_proc=4)

        

#2:   0%|          | 0/11 [00:00<?, ?ba/s]

#0:   0%|          | 0/11 [00:00<?, ?ba/s]

#3:   0%|          | 0/11 [00:00<?, ?ba/s]

#1:   0%|          | 0/11 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

In [15]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [20]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoModelWithLMHead

checkpoint = "bigscience/mt0-small"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [21]:
from transformers import TrainingArguments

# Especificar argumentos de entrenamiento
training_args = TrainingArguments(
    output_dir='/content/drive/MyDrive/Cosas/NLP_IA/Cosas_LAB_4/TUTOIII',          # Directorio para guardar los resultados del entrenamiento
    evaluation_strategy = "steps",  # Estrategia de evaluación
    per_device_train_batch_size=1,  # Tamaño de los batches
    num_train_epochs=1,             # Número de épocas de entrenamiento
    save_steps=2000,                # Pasos para guardar el modelo
    save_total_limit=2,             # Límite de guardado
)

In [24]:
grouped_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 8528
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 765
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 674
    })
})

In [26]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=grouped_datasets['train'],
    eval_dataset = grouped_datasets['validation'],
    tokenizer=tokenizer,
    data_collator = data_collator
)

In [27]:
trainer.train()

***** Running training *****
  Num examples = 8528
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 8528
  Number of trainable parameters = 300176768


Step,Training Loss,Validation Loss
500,0.6397,0.247901
1000,0.6092,0.180119
1500,0.4994,0.137262
2000,0.3967,0.112302
2500,0.3561,0.096956
3000,0.3034,0.080527
3500,0.2672,0.070296
4000,0.2332,0.065205
4500,0.22,0.058722
5000,0.2087,0.053795


***** Running Evaluation *****
  Num examples = 674
  Batch size = 8
***** Running Evaluation *****
  Num examples = 674
  Batch size = 8
***** Running Evaluation *****
  Num examples = 674
  Batch size = 8
***** Running Evaluation *****
  Num examples = 674
  Batch size = 8
Saving model checkpoint to /content/drive/MyDrive/Cosas/NLP_IA/Cosas_LAB_4/TUTOIII/checkpoint-2000
Configuration saved in /content/drive/MyDrive/Cosas/NLP_IA/Cosas_LAB_4/TUTOIII/checkpoint-2000/config.json
Model weights saved in /content/drive/MyDrive/Cosas/NLP_IA/Cosas_LAB_4/TUTOIII/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/Cosas/NLP_IA/Cosas_LAB_4/TUTOIII/checkpoint-2000/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/Cosas/NLP_IA/Cosas_LAB_4/TUTOIII/checkpoint-2000/special_tokens_map.json
Copy vocab file to /content/drive/MyDrive/Cosas/NLP_IA/Cosas_LAB_4/TUTOIII/checkpoint-2000/spiece.model
***** Running Evaluation *****
  Num examples = 674

TrainOutput(global_step=8528, training_loss=0.29136470295698513, metrics={'train_runtime': 2115.6282, 'train_samples_per_second': 4.031, 'train_steps_per_second': 4.031, 'total_flos': 1127294340956160.0, 'train_loss': 0.29136470295698513, 'epoch': 1.0})