In this notebook we tried to fine tuned a gpt2 model with the wow dataset

# GPT fine-tuning

We load the data for train, validation and test 

In [1]:
import json
import torch

# Base_directory
base_dir = './wizard_of_wikipedia/'

# Load the data
with open(base_dir + 'train.json') as f:
    train_data = json.load(f)
with open(base_dir + 'valid_random_split.json') as f:
    valid_data = json.load(f)
with open(base_dir + 'test_random_split.json') as f:
    test_data = json.load(f)

In [None]:
train_data[0]

We started from this gpt model loaded from huggingface

In [None]:
from transformers import GPTNeoForCausalLM, AutoTokenizer

model_id = "EleutherAI/gpt-neo-125M"

# default_device = 'cpu'
default_device = 'mps' # apple silicon
device = torch.device('cuda' if torch.cuda.is_available() else default_device)

tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left')
model = GPTNeoForCausalLM.from_pretrained('./model').to(device)

In [8]:
tokenizer.pad_token = tokenizer.eos_token

In [3]:
def extract_checked_sentence(utterance):
    try:
        checked_sentence = list(utterance['checked_sentence'].values())[0]
        return 'PASSAGE: ' + checked_sentence + '\n'
    except:
        return ''

def parse_dialog(dialog):
        return '\n'.join([
            f'SPEAKER: {utterance["speaker"]}\n' + \
            extract_checked_sentence(utterance) + \
            f'TEXT: {utterance["text"]}\n'
        for utterance in dialog])

def parse_data(dataset):
    return [
        f'CHOSEN_TOPIC: {sample["chosen_topic"]}\n' + \
        f'PERSONA: {sample["persona"]}\n' + \
        parse_dialog(sample['dialog'])
    for sample in dataset]

In [4]:
train_parsed = parse_data(train_data)
valid_parsed = parse_data(valid_data)
test_parsed = parse_data(test_data)

In [5]:
from datasets import Dataset

train_parsed = Dataset.from_dict({'text': train_parsed})
valid_parsed = Dataset.from_dict({'text': valid_parsed})
test_parsed = Dataset.from_dict({'text': test_parsed})

In [6]:
from datasets import DatasetDict

data = DatasetDict()
data['train'] = train_parsed
data['validation'] = valid_parsed
data['test'] = test_parsed

In [9]:
def tokenize_function(examples):
    input_encodings = tokenizer(examples["text"], padding=True, truncation=True)
    sample = {
        'input_ids': input_encodings.input_ids
    }
    return sample

tokenized_data = data.map(tokenize_function, batched=True)

100%|██████████| 19/19 [00:12<00:00,  1.48ba/s]
100%|██████████| 1/1 [00:00<00:00,  1.70ba/s]
100%|██████████| 1/1 [00:00<00:00,  2.22ba/s]


In [10]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    "cooler_trainer_name", 
    evaluation_strategy="steps",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=6.25e-5,
    lr_scheduler_type="linear",
    per_device_eval_batch_size=1,
    use_mps_device=True
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=tokenized_data['train'], 
    eval_dataset=tokenized_data['validation'],
    data_collator=data_collator
)

In [None]:
trainer.train(resume_from_checkpoint=True)

In [42]:
print(train_parsed[0]['text'])

CHOSEN_TOPIC: Science fiction
PERSONA: i enjoy movies about aliens invading the earth.
SPEAKER: 0_Wizard
PASSAGE: Science fiction (often shortened to SF or sci-fi) is a genre of speculative fiction, typically dealing with imaginative concepts such as futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life.
TEXT: I think science fiction is an amazing genre for anything. Future science, technology, time travel, FTL travel, they're all such interesting concepts.

SPEAKER: 1_Apprentice
TEXT: I'm a huge fan of science fiction myself! 

SPEAKER: 0_Wizard
PASSAGE: Science fiction films have often been used to focus on political or social issues, and to explore philosophical issues like the human condition.
TEXT: Awesome! I really love how sci-fi storytellers focus on political/social/philosophical issues that would still be around even in the future. Makes them relatable.

SPEAKER: 1_Apprentice
TEXT: I agree. One of

Let's test the model on a some sentences of the test set

In [None]:
GENERATION_LENGTH = 200

test_index = [0, 5, 6, 12, 13, 19, 50]

outputs = []

for i in test_index:
    train = train_parsed[i]['text']
    split_train = train.split('\n')
    input = '\n'.join(split_train[:5])
    encoded_input = tokenizer.encode(input, return_tensors="pt")
    encoded_output = model.generate(encoded_input, do_sample=True, max_length=GENERATION_LENGTH, top_p=0.95, temperature=0.85)
    decoded_output = tokenizer.decode(encoded_output[0], skip_special_tokens=True)
    output = decoded_output.split('\n')
    topic_output = []
    topic_output.append(output[0])
    topic_output.append(output[2])
    topic_output.append(output[4])
    topic_output.append(output[6:8])
    outputs.append(topic_output)

In [41]:
for output in outputs:
    for elem in output:
        print(elem)
    print('\n')

CHOSEN_TOPIC: Science fiction
SPEAKER: 0_Wizard
TEXT: I think science fiction is an amazing genre for anything. Future science, technology, time travel, FTL travel, they're all such interesting concepts. I don't want to be a total science person.
['SPEAKER: 1_Apprentice', 'TEXT: I agree, I love science fiction. Have you ever watched it?']


CHOSEN_TOPIC: Romance (love)
SPEAKER: 0_Wizard
TEXT: I don't know how to be romantic. I have trouble expressing emotional attraction. Do you enjoy romance?
['SPEAKER: 1_Apprentice', 'TEXT: I love romance. It is the easiest way to get into love']


CHOSEN_TOPIC: Krav Maga
SPEAKER: 0_Wizard
TEXT: Hello. I hope you might enjoy or know something about Krav Maga? It's a sport in which people try to keep their balance. Do you like it?
['SPEAKER: 1_Apprentice', 'TEXT: I enjoy it too! I love krav maga!']


CHOSEN_TOPIC: The Hershey Company
SPEAKER: 0_Wizard
TEXT: Hi there, I love chocolate, my favorite brand of chocolate is Hershey coming from my local city

In [101]:
topic = 'AlphaZero'
passage = """
AlphaZero is a computer program developed by artificial intelligence research company DeepMind to master the games of chess, shogi and go. This algorithm uses an approach similar to AlphaGo Zero.

On December 5, 2017, the DeepMind team released a preprint paper introducing AlphaZero, which within 24 hours of training achieved a superhuman level of play in these three games by defeating world-champion programs Stockfish, Elmo, and the three-day version of AlphaGo Zero. In each case it made use of custom tensor processing units (TPUs) that the Google programs were optimized to use.[1] AlphaZero was trained solely via self-play using 5,000 first-generation TPUs to generate the games and 64 second-generation TPUs to train the neural networks, all in parallel, with no access to opening books or endgame tables. After four hours of training, DeepMind estimated AlphaZero was playing chess at a higher Elo rating than Stockfish 8; after nine hours of training, the algorithm defeated Stockfish 8 in a time-controlled 100-game tournament (28 wins, 0 losses, and 72 draws).[1][2][3] The trained algorithm played on a single machine with four TPUs.

DeepMind's paper on AlphaZero was published in the journal Science on 7 December 2018;[4] however, the AlphaZero program itself has not been made available to the public.[5] In 2019, DeepMind published a new paper detailing MuZero, a new algorithm able to generalise AlphaZero's work, playing both Atari and board games without knowledge of the rules or representations of the game.[6]
"""
text = 'AlphaZero is so impressive! When I first read it was far better at playing than the world champion I was astonished!\n'
input = f'CHOSEN_TOPIC: {topic}\n' \
    'PERSONA: I am a chess enthusiast.\n' \
    'SPEAKER: 0_Wizard\n' \
    f'PASSAGE: {passage}\n' \
    f'TEXT: {text}'

encoded_input = tokenizer.encode(input, return_tensors="pt")
encoded_output = model.generate(encoded_input, do_sample=True, max_length=GENERATION_LENGTH*3, top_p=0.95, temperature=0.80)
decoded_output = tokenizer.decode(encoded_output[0], skip_special_tokens=True)
output = decoded_output.split('\n')
print(output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['CHOSEN_TOPIC: AlphaZero', 'PERSONA: I am a chess enthusiast.', 'SPEAKER: 0_Wizard', 'PASSAGE: ', 'AlphaZero is a computer program developed by artificial intelligence research company DeepMind to master the games of chess, shogi and go. This algorithm uses an approach similar to AlphaGo Zero.', '', 'On December 5, 2017, the DeepMind team released a preprint paper introducing AlphaZero, which within 24 hours of training achieved a superhuman level of play in these three games by defeating world-champion programs Stockfish, Elmo, and the three-day version of AlphaGo Zero. In each case it made use of custom tensor processing units (TPUs) that the Google programs were optimized to use.[1] AlphaZero was trained solely via self-play using 5,000 first-generation TPUs to generate the games and 64 second-generation TPUs to train the neural networks, all in parallel, with no access to opening books or endgame tables. After four hours of training, DeepMind estimated AlphaZero was playing chess 

In [102]:
print(output[10])
print(output[14])

TEXT: AlphaZero is so impressive! When I first read it was far better at playing than the world champion I was astonished!
TEXT: It was so good at playing against a large opponent that it made the world champions the most powerful of the players.
