# <u>Chapter 9</u>: Generating Text in Chatbots

In [1]:
import sys
import subprocess
import pkg_resources

# Find out which packages are missing.
installed_packages = {dist.key for dist in pkg_resources.working_set}
required_packages = {'torch', 'transformers', 'trl'}
missing_packages = required_packages - installed_packages

# If there are missing packages install them.
if missing_packages:
    print('Installing the following packages: ' + str(missing_packages))
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing_packages], stdout=subprocess.DEVNULL)

## Fine-tuning the pre-trained model using reinforcement learning

We utilize the `Transformer Reinforcement Learning` (trl) library that allows the training of transformer language models with `Proximal Policy Optimization` (PPO). 

The small version of the _DialoGPT_ model is used.

In [2]:
import torch
from transformers import GPT2Tokenizer
from trl.gpt2 import GPT2HeadWithValueModel, respond_to_batch
from trl.ppo import PPOTrainer

# Load the models.
gpt2_model = GPT2HeadWithValueModel.from_pretrained('microsoft/DialoGPT-small')
gpt2_model_ref = GPT2HeadWithValueModel.from_pretrained('microsoft/DialoGPT-small')
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('microsoft/DialoGPT-small')

Some weights of GPT2HeadWithValueModel were not initialized from the model checkpoint at microsoft/DialoGPT-small and are newly initialized: ['transformer.h.4.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'v_head.summary.bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'v_head.summary.weight', 'transformer.h.2.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.3.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2HeadWithValueModel were not initialized from the model checkpoint at microsoft/DialoGPT-small and are newly initialized: ['transformer.h.4.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'v_head.summary.bias', 'transformer.h.7.attn.masked

Next we create the _chat_ method.

In [3]:
# Chat with the bot using a new input and the previous history.
def chat(input, history=[], gen_kwargs=[]):
    
    # Tokenize the input.
    new_user_input_ids = gpt2_tokenizer.encode(input+gpt2_tokenizer.eos_token, return_tensors='pt')

    # Update the dialogue history.
    bot_input_ids = torch.cat([torch.LongTensor(history), new_user_input_ids], dim=-1)

    # Generate the response of the bot.
    new_history = gpt2_model.generate(bot_input_ids, **gen_kwargs).tolist()

    # Convert the tokens to text.
    output = gpt2_tokenizer.decode(new_history[0]).split("<|endoftext|>")
    output = [(output[i], output[i+1]) for i in range(0, len(output)-1, 2)]
    return output, new_history

We can then define the parameters for the model and initializat the trainer.

In [4]:
# Parameters for the model.
gen_kwargs = {
    "max_length":1000,
    "min_length":-1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": gpt2_tokenizer.eos_token_id
}

# Initialize the trainer.
ppo_config = {'batch_size': 1, 'forward_batch_size': 1}
ppo_trainer = PPOTrainer(gpt2_model, gpt2_model_ref, gpt2_tokenizer, **ppo_config)

The query is one of the elements for the reinforcement learning.

In [5]:
# Encode a query.
query_txt = "Does money buy happiness?"
query_tensor = gpt2_tokenizer.encode(query_txt+gpt2_tokenizer.eos_token, return_tensors="pt")

Let's perform ten tnteraction that will help tuning the language model. In practice many more interactions are needed.

In [6]:
# Repeat the training for 10 interactions.
for x in range(10):

    response_tensors = []
    pipe_outputs = []

    # Get a reposnse from the chatbot.
    result, history = chat(query_txt, [], gen_kwargs)
    response_txt = result[0][1]
    response_tensor = gpt2_tokenizer.encode(response_txt+gpt2_tokenizer.eos_token, return_tensors="pt")
    
    # Positive reward.
    if response_txt.find('happy') >= 0 or response_txt.find('happiness') >= 0 or response_txt.find('fun') >= 0:
        print("+ reward: " + response_txt)
        reward = [torch.tensor(1.0)]
    # Negative reward.
    else:
        print("- reward: " + response_txt)
        reward = [torch.tensor(-1.0)]

    # Train the model with the ppo algorithm.
    train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)

- reward: If you keep barney top 10, idk if it's even cheaper
+ reward: Yes, but before there was always spend money how can you get happiness. Get happiness, get your money for it.
- reward: Megan convinces Barbel that the advice she was giving was good. In retrospect, they were the right choices.
+ reward: Money buys happiness.
+ reward: Money buy happiness?
+ reward: . can i buy happiness from your family?
+ reward: Money buys happiness. Money buys happiness. Money doesn't buy happiness. What trends should we look to bring with our little robot uprising? Money Is Happiness.
+ reward: money buy happiness religion wage happiness
+ reward: Money buy happiness. Any number guys... anyone? Please?
- reward: Why would you come live with someone else when you can have this mother?


## What we have learned …

| |
| --- |
| **ML concepts** <ul><li>Fine-tuning</li><li>Reinforcement learning</li></ul> |
| |