<a href="https://colab.research.google.com/github/SpaceSapiens/Study/blob/main/F21CA_Transformers_for_Dialogue_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# F21CA - Dialogue Generation with Huggingface Transformers

## Background

In this tutorial, we will explore the topic of dialogue generation. This is a very novel field of Conversational AI in which we assume that the agent learns from a set of previous conversations 𝔻 made of several pairs (history, response). The agent will learn to encode the history, and generate, token by token, the corresponding response. 

Since [Sequence-to-Sequence](https://arxiv.org/abs/1409.3215) models, we can now design conversational systems in a complete end-to-end fashion without the need for the pipeline approach that we have seen during the lecture. In this tutorial, we will use the recent model [DialoGPT](https://arxiv.org/abs/1911.00536), a large-scale Transformer trained on a very large dataset of conversations derived from Reddit. This model is essentially based on the autoregressive model GPT-2 that models the following probability distribution:

$p(y_1,\ldots,y_{T} | x_1,\ldots,x_T) = \prod_{t=1}^{T} p(y_t | v, y_1, \ldots, y_{t-1})$

This is achieved in a Transformer architecture by using a dedicated causal mask:

![image.png](https://sshleifer.github.io/blog_v2/images/copied_from_nb/diagram_bartpost_gpt2.jpg)

This prevents past tokens from seing future tokens which makes this model suitable for conditional generation (i.e., generates the response conditioned on the dialogue history).

## Learning End-to-End Goal-Oriented Dialog

We will use the model DialoGPT to train a task-oriented conversational agent able to support a conversation about restaurants. To do so, we will use the [Dialog-bAbI](https://arxiv.org/abs/1605.07683) dataset released by Facebook. The resulting agent will be able to suppor a conversation like the following:

```
1 User: hi    System: hello what can i help you with today
2 User: can you make a restaurant reservation with italian cuisine for six people in a cheap price range    System: i'm on it
3 User: <SILENCE>    System: where should it be
4 User: rome please    System: ok let me look into some options for you
5 User: <SILENCE>    System: api_call italian rome six cheap
```

The system main objective is to correctly predict the final API call: `api_call italian rome six cheap` which will be then used to retrieve the correct restaurants that match the details requested by the user. 

As stated above, we will create a dataset of (history, response) pairs. This will be created from the full dialogue by progressively extending the dialogue history for each turn, and using the system utterance as response for that turn.
For instance, given the full dialogue above, the set of examples generated from it will look like this:

```
[
    ("[user] hi", "[system] hello what can i help you with today"),
    ("[user] hi [system] hello what can i help you with today [user] can you make a restaurant reservation with italian cuisine for six people in a cheap price range", "[system] i'm on it"),
     ...
]
```

## PyTorch-Lightning

This notebook is intended as a playground to let you understand how this large language models are implemented and trained. Intentionally, we have created everything for you so that you can simply run the code. In order to digest it, you must familiarise with [PyTorch-Lightning](https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction.html). This is a self-contained deep learning that provides all the boilerplate code required to train Deep Learning models using PyTorch. 

PyTorch-Lightning provides the abstraction that are common in any ML flow. You can think of it as a 3 steps process:

1. Dataset definition (i.e., how do I read my raw data?)
2. Data module definition (i.e., how do I create batches of data for my DL model?)
3. ML module definition (i.e., what does my model look like?)

PyTorch-Lightning (and PyTorch) provides you with three abstractions to easily implement these 3 steps:

1. Define a dataset class that inherits from `Dataset` (available in `torch.utils.data`)
2. Define a data module class that inherits from `LightningDataModule` (available in `pytorch_lightning`)
3. Define a module class that inherits from `LightningModule`

Once you have all these components, you can pass them to a `Trainer` (available in `pytorch_lightning`) which will complete the model training process.

### Setup Google Colab

Before running this notebook, make sure to request a GPU to train your models. You can do so by clicking on "Runtime" > "Change runtime type" > "GPU". This will make sure that we run the code using GPU acceleration.

In [None]:
!pip install pytorch-lightning transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch-lightning
  Downloading pytorch_lightning-1.9.4-py3-none-any.whl (827 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m827.8/827.8 KB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m75.3 MB/s[0m eta [36m0:00:00[0m
Collecting lightning-utilities>=0.6.0.post0
  Downloading lightning_utilities-0.7.1-py3-none-any.whl (18 kB)
Collecting torchmetrics>=0.7.0
  Downloading torchmetrics-0.11.3-py3-none-any.whl (518 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m518.6/518.6 KB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [None]:
# This command downloads the dialog-babi dataset from Dropbox
!wget https://www.dropbox.com/s/20rgyj8rryvos9l/dialog-bAbI-tasks-1_6.zip?dl=1 -O dialog-babi-tasks-1_6.zip
!unzip dialog-babi-tasks-1_6.zip

--2023-03-05 15:12:41--  https://www.dropbox.com/s/20rgyj8rryvos9l/dialog-bAbI-tasks-1_6.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/20rgyj8rryvos9l/dialog-bAbI-tasks-1_6.zip [following]
--2023-03-05 15:12:41--  https://www.dropbox.com/s/dl/20rgyj8rryvos9l/dialog-bAbI-tasks-1_6.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc4fbce1d59051a935b8ebc4d304.dl-eu.dropboxusercontent.com/cd/0/get/B3qlsK5v9uwaFj1j0IvH0AXonvh_lTu_w-KwkYYxM0mfHJLRVPXge2T9mLUpOMzPjjs97oWrKAHs5dGajshPxAn4tnVv_b4_4-d3KHwDfTT7aCdCvIs-o7QSrfoS0hUT3G3acHUmnNaKFnReZDc8QdrnQ4xxfQQLdDV3O9EEYwuQ3xL6x-p1tKkYKUnANZIOtL0/file?dl=1# [following]
--2023-03-05 15:12:42--  https://uc4fbce1d59051a935b8ebc4d304.dl-eu.dropboxusercontent.com/cd/0/get/B3qlsK5v9uwa

In [None]:
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning import LightningDataModule
from transformers import AutoTokenizer
import torch
import math
import itertools


# We will use these tokens to delimit the user and system utterances
USER_DELIMITER = "[user]"
SYS_DELIMITER = "[system]"

class DialogBabiAPICalls(Dataset):
    """This represents the dataset for Dialog-bAbI task1"""
    def __init__(self, dataset):
        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        return self._format_dialogue(**self.dataset[idx])

        
    def _format_dialogue(self, history, response):
        history_str = " ".join(history)

        return {
            "history": history_str,
            "response": response
        }

class BatchCollateFn():
    """This represents the collate function implemented for the dialogue generation task.
    
    This is a slightly more complex implementation because we need to appropriately format the input 
    to the model by combining both history and response in a single input.
    This code has been created by following the original DialoGPT implementation: 
    https://github.com/microsoft/DialoGPT/blob/master/data_loader.py#L171
        
    """

    def __init__(self, tokenizer, max_history_length=60, max_response_length=10):
        self.tokenizer = tokenizer
        self.max_history_length = max_history_length
        self.max_response_length = max_response_length

    def __call__(self, batch):
        transposed_batch = {k: [dic[k] for dic in batch] for k in batch[0]}

        # we first tokenize the history by making sure to truncate on the right
        encoded_history = tokenize_history(
            transposed_batch["history"],
            self.tokenizer,
            self.max_history_length
        )
            
        # we tokenize the response as well
        encoded_response = self.tokenizer(
            transposed_batch["response"],
            max_length=self.max_response_length,
            truncation=True
        )

        # at this stage both encoded_history and encoded_response are lists of tokens
        # with this step we can combine them together in a single sequence 
        # [context_tokens] <eos_token> [response_tokens] <eos_token> 
        encoded_context = {
            "input_ids" : [torch.tensor(h + [self.tokenizer.eos_token_id] + r + [self.tokenizer.eos_token_id], dtype=torch.long) for h, r in zip(encoded_history["input_ids"], encoded_response["input_ids"])],
            "attention_mask": [torch.tensor(h + [1] + r + [1], dtype=torch.long) for h, r in zip(encoded_history["attention_mask"], encoded_response["attention_mask"])]
        }
        
        labels = []

        # when creating the target labels for the model we want to mask out 
        # all the tokens that are not part of the current system response
        # in this way we make sure that the agent will be penalised only for the 
        # tokens that belong to the system response.
        for idx, hist in enumerate(encoded_history["input_ids"]):
            labels.append(torch.tensor(
                [-100] * len(hist) + [-100] + encoded_response["input_ids"][idx] + [self.tokenizer.eos_token_id],
                dtype=torch.long
            ))
 
        encoded_context["labels"] = labels

        # once we create all the data, we make sure to add some extra padding 
        # to the tensors so that they all have the same size (batch_size, sequence_length)
        return {
            "input_ids": torch.nn.utils.rnn.pad_sequence(encoded_context["input_ids"], batch_first=True, padding_value=self.tokenizer.eos_token_id),
            "attention_mask": torch.nn.utils.rnn.pad_sequence(encoded_context["attention_mask"], batch_first=True),
            "labels": torch.nn.utils.rnn.pad_sequence(encoded_context["labels"], batch_first=True, padding_value=-100),
        }

class DialogBabiAPICallsDataModule(LightningDataModule):
    def __init__(self, tokenizer_name="microsoft/DialoGPT-small", batch_size=3, num_workers=0):
        super().__init__()
        self.tokenizer_name = tokenizer_name
        self.batch_size = batch_size
        self.num_workers = num_workers

    def setup(self, stage=None):
        self.training_dataset = DialogBabiAPICalls(
            self._generate_examples(load_dataset("/content/dialog-bAbI-tasks-1_6/task1-API-calls/dialog-babi-task1-API-calls-trn.txt", "train"))
        )
        self.validation_dataset = DialogBabiAPICalls(
            self._generate_examples(load_dataset("/content/dialog-bAbI-tasks-1_6/task1-API-calls/dialog-babi-task1-API-calls-dev.txt", "valid"))
        )
        self.test_dataset = DialogBabiAPICalls(
            self._generate_examples(load_dataset("/content/dialog-bAbI-tasks-1_6/task1-API-calls/dialog-babi-task1-API-calls-tst.txt", "test"))
        )
        self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name)

        # We want to add the special tokens that we use to delimit the turns
        self.tokenizer.add_tokens([USER_DELIMITER, SYS_DELIMITER])
        if self.tokenizer.pad_token_id is None:
            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

    def train_dataloader(self):
        return DataLoader(
            self.training_dataset,  
            shuffle=True,
            batch_size=self.batch_size, 
            num_workers=self.num_workers, 
            collate_fn=BatchCollateFn(self.tokenizer)
        )


    def val_dataloader(self):
        return DataLoader(
            self.validation_dataset, 
            batch_size=self.batch_size, 
            num_workers=self.num_workers, 
            collate_fn=BatchCollateFn(self.tokenizer)
        )


    def test_dataloader(self):
        return DataLoader(
            self.test_dataset, 
            batch_size=self.batch_size, 
            num_workers=self.num_workers, 
            collate_fn=BatchCollateFn(self.tokenizer)
        )

    def _generate_examples(self, dataset):
        """Given a dataset containing full dialogues, it generates a list of 
        (history, response) pairs"""
        examples_iterator = map(lambda x: generate_examples_from_dialogue(x[1]), dataset)

        return list(itertools.chain.from_iterable(examples_iterator))

def generate_examples_from_dialogue(dialogue):
    """Given a dialogue, generates a list of (history, response) pairs"""
    assert len(dialogue["user_turns"]) == len(dialogue["system_turns"])

    num_turns = len(dialogue["user_turns"])

    history = []
    examples = []

    for idx in range(num_turns):
        user_utterance = dialogue["user_turns"][idx]
        history.append(f"{USER_DELIMITER} {user_utterance}")
        response = f"{SYS_DELIMITER} {dialogue['system_turns'][idx]}"
        examples.append({"history": history.copy(), "response": response})
        history.append(response)

    return examples

def load_dataset(filepath, split):
    """Loads a dataset saved in the file specified by `filepath`"""
    with open(filepath, encoding="utf-8") as f:
        dialogue_rows = []
        dialogue_id = 1

        for row in f:
            turn = row.strip()
            if not turn:
                yield format_dialogue(dialogue_id, split, dialogue_rows)
                dialogue_rows.clear()
                dialogue_id += 1
            else:
                dialogue_rows.append(turn)
        
        if dialogue_rows:
            yield format_dialogue(dialogue_id, split, dialogue_rows)

def format_dialogue(dialogue_id, split, dialogue_rows):
    """Given a sequence of lines read from a file generates a formatted dialogue"""
    user_turns = []
    system_turns = []

    for turn in dialogue_rows:
        rest_turn, sys_turn = turn.split("\t")
        _, user_turn = rest_turn.split(" ", 1)
        user_turns.append(user_turn)
        system_turns.append(sys_turn)


    example_key = f"{split}-{dialogue_id}"
    return example_key, {
        "user_turns": user_turns,
        "system_turns": system_turns
    }

def tokenize_history(history, tokenizer, max_history_length=20):
    """"Tokenizes the history making sure that the sequence is truncated on the right"""
    encoded_history = tokenizer(history)
    
    # Huggingface doesn't support truncation on the right
    # we need to do it manually here
    for k, v in encoded_history.items():
        if len(v) > max_history_length:
            encoded_history[k] = v[-max_history_length:]
    
    return encoded_history

In [None]:
# We create the datamodule by specifying the batch size
dm = DialogBabiAPICallsDataModule(batch_size=32)

dm.prepare_data()
dm.setup()
 
# Uncomment this part to visualise the content of the batch
# You should reduce the batch size if you want to properly visualise the content
# for batch in dm.train_dataloader():
#     print("---")
#     print(batch["input_ids"])
#     print("---")
#     print(batch["labels"])
#     print("---")
#     print(batch["attention_mask"])
#     break

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
from torch.nn.modules.loss import CrossEntropyLoss
from pytorch_lightning import LightningModule
from transformers import AutoModelForCausalLM, get_linear_schedule_with_warmup
import torch


class DialoGPTModule(LightningModule):
    """This class defines a wrapper for the DialoGPT model."""
    def __init__(self, model_name, lr=5e-5):
        super().__init__()
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.save_hyperparameters()
    
    def training_step(self, batch, batch_idx):
        model_output = self.model(
            input_ids=batch["input_ids"], 
            attention_mask=batch["attention_mask"],
            labels=batch["labels"]
        )
        self.log("train_loss", model_output.loss)

        return model_output

    def validation_step(self, batch, batch_idx):
        model_output = self.model(
            input_ids=batch["input_ids"], 
            attention_mask=batch["attention_mask"],
            labels=batch["labels"]
        )

        self.log("valid_loss", model_output.loss, on_epoch=True)

        return model_output
    

    def test_step(self, batch, batch_idx):
        model_output = self.model(
            input_ids=batch["input_ids"], 
            attention_mask=batch["attention_mask"],
            labels=batch["labels"]
        )

        self.log("test_loss", model_output.loss, on_epoch=True)

        return model_output

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.hparams.lr)
        num_training_steps, num_warmup_steps = self.compute_warmup(
            num_training_steps=-1,
            num_warmup_steps=0.1,
        )
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps
        )
        return {
            "optimizer": optimizer,
            "lr_scheduler": {"scheduler": scheduler, "interval": "step", "frequency": 1},
        }

    @property
    def num_training_steps(self):
        return self.trainer.estimated_stepping_batches

    def compute_warmup(self, num_training_steps, num_warmup_steps):
        if num_training_steps < 0:
            # less than 0 specifies to infer number of training steps
            num_training_steps = self.num_training_steps
        if isinstance(num_warmup_steps, float):
            # Convert float values to percentage of training steps to use as warmup
            num_warmup_steps *= num_training_steps
        return num_training_steps, num_warmup_steps


In [None]:
from pytorch_lightning import Trainer

# we create the model by loading the pretrained DialoGPT-small
# You can load bigger versions of the model but you might lack GPU memory on Colab
# Visit https://huggingface.co/models to explore the ones available
model = DialoGPTModule("microsoft/DialoGPT-small")
# we have to resize the token embeddings because we added some extra ones 
model.model.resize_token_embeddings(len(dm.tokenizer))
# We initialise the Trainer to complete the training on GPU
trainer = Trainer(gradient_clip_val=1.0, accelerator="gpu", devices="1", max_epochs=5)

trainer.fit(model, dm)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/351M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loading `train_dataloader` to estimate number of stepping batches.
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type            | Params
------------------------------------------
0 | model | GPT2LMHeadModel | 124 M 
------------------------------------------
124 M     Trainable params
0         Non-trainable params
124 M     Total params
497.765   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")


In [None]:
# Now we check the loss on the test set

trainer.test(model, dm)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_loss          5.507322202902287e-05
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 5.507322202902287e-05}]

## Test your new shiny model

When completing a sequence-to-sequence generation task, the loss function doesn't really provide a very good approximation of the model performance. This is because when computing the loss we use [teacher-forcing](): we always condition the generation of the next token, on a sequence of tokens that is gold (i.e., provided by your teacher). So no matter what the model generates, we always take the gold token.



Remember that this is not indicative of the real performance of the agent
In this setup, the agent still receives gold tokens when generating!

In [None]:
# Looks like our model has a reasonably low cross-entropy loss on the test set.
# We can test the real skills of our model by using the `generate` function

new_dialogue = tokenize_history(dm.test_dataset[5]["history"], dm.tokenizer, max_history_length=60)

outputs = model.model.generate(new_dialogue.input_ids, attention_mask=new_dialogue.attention_mask, max_length=100)

dm.tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'history': "[user] good morning [system] hello what can i help you with today [user] can you book a table in a cheap price range in london [system] i'm on it [user] <SILENCE> [system] any preference on a type of cuisine [user] with french food [system] how many people would be in your party [user] for four please [system] ok let me look into some options for you [user] <SILENCE>", 'response': '[system] api_call french london four cheap'}


["[user] good morning [system] hello what can i help you with today [user] can you book a table in a cheap price range in london [system] i'm on it [user] <SILENCE> [system] any preference on a type of cuisine [user] with french food [system] how many people would be in your party [user] for four please [system] ok let me look into some options for you [user] <SILENCE> api_call french london four cheap<|endoftext|>"]