# Description
#### This notebook presents a step-by-step example training and testing *GPT2*. The user can choose the model types and the characters, which are to be trained. By default, the GPT2-Small is trained for Ross with two-level training. Firstly, with monologues. Secondly, with replics. 

#### The structure of the notebook is following:
- Pre-setting
- Training on monologues
- Additional training on replics
- Testing the model


# Pre-setting
#### In this section the routine part is covered.
- the specific packages are installed
- the training data is cloned
- the working directories are set

#### NB: here one can change the following features:
- the working directory
- _character_ and _type of model_

In [1]:
import os
import torch
import pathlib
import pandas as pd
from IPython.display import clear_output

In [2]:
# Here you need to set the directory of the project
# The trained binaries are going to be saved here
# so make sure, there is enough free space.
# By default, it is the current directory.

ROOT_DIR = os.path.abspath(os.getcwd())

In [3]:
chars_ru2en = {
    'ДЖОУИ'  : 'Joey',
    'МОНИКА' : 'Monica',
    'РЕЙЧЕЛ' : 'Rachel',
    'РОСС'   : 'Ross',
    'ФИБИ'   : 'Phoebe',
    'ЧЕНДЛЕР': 'Chandler',
}
en_names = list(chars_ru2en.values())
print("Please, choose one of the following character or leave the field blank if you prefer character by default:")
print(", ".join(en_names))

target = input()
en_names += ['']
while target not in (en_names):
    print('Choose another name, please')
    target = input()
if target == '':
    CHARACTER = 'РОСС'
else:
    CHARACTER = {j: i for i, j in chars_ru2en.items()}[target]
clear_output(wait=True)

the_models = ['gpt2', 'gpt2-medium', 'gpt2-large', '']
print("Please, choose one of the models:")
print(", ".join(the_models))

target = input()
while target not in (the_models):
    print('Choose another model, please')
    target = input()
if target == '':
    model_type = "gpt2"
else:
    model_type = target
clear_output()

print(f"Character: {CHARACTER}")
print(f"Model type: {model_type}")

Character: РОСС
Model type: gpt2


In [4]:
!pip install transformers==4.2.2

Collecting transformers==4.2.2
  Downloading transformers-4.2.2-py3-none-any.whl (1.8 MB)
[?25l[K     |▏                               | 10 kB 22.6 MB/s eta 0:00:01[K     |▍                               | 20 kB 24.6 MB/s eta 0:00:01[K     |▋                               | 30 kB 12.0 MB/s eta 0:00:01[K     |▊                               | 40 kB 9.4 MB/s eta 0:00:01[K     |█                               | 51 kB 5.1 MB/s eta 0:00:01[K     |█▏                              | 61 kB 5.6 MB/s eta 0:00:01[K     |█▎                              | 71 kB 6.0 MB/s eta 0:00:01[K     |█▌                              | 81 kB 6.7 MB/s eta 0:00:01[K     |█▊                              | 92 kB 6.4 MB/s eta 0:00:01[K     |█▉                              | 102 kB 5.4 MB/s eta 0:00:01[K     |██                              | 112 kB 5.4 MB/s eta 0:00:01[K     |██▎                             | 122 kB 5.4 MB/s eta 0:00:01[K     |██▍                             | 133 kB 5.4 MB/

In [5]:
## Uncomment it to check Cuda availability 
# !nvidia-smi

In [6]:
from transformers import AutoTokenizer
from transformers import pipeline
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

In [7]:
# Clone the git repository with the data

!git clone https://github.com/Alenush/style_transfer_sirius2021summer.git

Cloning into 'style_transfer_sirius2021summer'...
remote: Enumerating objects: 1830, done.[K
remote: Counting objects: 100% (1830/1830), done.[K
remote: Compressing objects: 100% (1152/1152), done.[K
remote: Total 1830 (delta 833), reused 1641 (delta 650), pack-reused 0[K
Receiving objects: 100% (1830/1830), 54.74 MiB | 14.55 MiB/s, done.
Resolving deltas: 100% (833/833), done.
Checking out files: 100% (663/663), done.


In [8]:
# ... get the last updates.

%%bash
cd style_transfer_sirius2021summer
git checkout master
git pull

cd -

Your branch is up to date with 'origin/master'.
Already up to date.
/content


Already on 'master'


# Training on monologues

#### Here the chosen `gpt2-...` model is uploaded from [huggingface](https://huggingface.co/gpt2-medium). Since we consider the two-step variation of model, firstly gpt2 is trained on the monologues. They represent all the replics said by a character throughout the series. The monologues are uploaded in the notebook already preprocessed and are tokenized into the final datasets. After that `Trainer` object is initialized with `TrainingArguments` and the model. Finally, the model is trained and saved!

In [9]:
path_to_data_1 = 'style_transfer_sirius2021summer/data/train_data/en/mono/'
train_path = path_to_data_1 + f'{CHARACTER}_mono_train_9to1_en.txt'
test_path = path_to_data_1 + f'{CHARACTER}_mono_valid_9to1_en.txt'

tokenizer = AutoTokenizer.from_pretrained(model_type)

block_size_ = 128

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=block_size_)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=block_size_)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (255810 > 1024). Running this sequence through the model will result in indexing errors


In [11]:
model = AutoModelWithLMHead.from_pretrained(model_type)

output_folder = f"./models/en_{model_type}_{chars_ru2en[CHARACTER]}_mono"
training_args = TrainingArguments(
    output_dir = output_folder,
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=5, # number of training epochs
    per_device_train_batch_size=2, # batch size for training
    per_device_eval_batch_size=2,  # batch size for evaluation
    eval_steps = 400, # number of update steps between two evaluations
    save_steps=800, # after # steps model is saved 
    dataloader_drop_last=True # avoid an error with an incomplete batch
    )

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

In [12]:
trainer.train()

Step,Training Loss
500,1.7643


TrainOutput(global_step=999, training_loss=1.7124591021685749, metrics={'train_runtime': 261.2955, 'train_samples_per_second': 3.823, 'total_flos': 190948405542912, 'epoch': 1.0})

In [13]:
trainer.save_model()

In [14]:
# The name of the task to train.
TASK_NAME = f'{model_type}_mono_en_{chars_ru2en[CHARACTER].lower()}'

# The output directory where the fine-tuned model and checkpoints will be written.
OUTPUT_DIR = f'{ROOT_DIR}/outputs/{TASK_NAME}/'
pathlib.Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

CONFIG_NAME = "config.json"
WEIGHTS_NAME = f"{TASK_NAME}_pytorch_model.bin"

In [15]:
def saver(model, OUTPUT_DIR, WEIGHTS_NAME):
    model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self

    # If we save using the predefined names, we can load using `from_pretrained`
    output_model_file = os.path.join(OUTPUT_DIR, WEIGHTS_NAME)
    output_config_file = os.path.join(OUTPUT_DIR, CONFIG_NAME)

    torch.save(model_to_save.state_dict(), output_model_file)
    model_to_save.config.to_json_file(output_config_file)

saver(model, OUTPUT_DIR, WEIGHTS_NAME)

# Additional training on replics
#### The model trained on the monologues further is additionally trained on the replics. According to our hypothesis, this would let the model to generate responses to shots. The training process is the same, but already trained gpt2 is taken instead of the raw one. In example below, the datasets are changed from the monologues to the cleaned replics.

In [16]:
tokenizer = AutoTokenizer.from_pretrained(model_type)

path = 'style_transfer_sirius2021summer/data/train_data/en/cleaned_replics/'
train_path = path + f'{CHARACTER}_train_9to1_cleaned_en.txt'
test_path = path + f'{CHARACTER}_valid_9to1_cleaned_en.txt'
train_dataset,test_dataset,data_collator = load_dataset(train_path, test_path, tokenizer)

Token indices sequence length is longer than the specified maximum sequence length for this model (299417 > 1024). Running this sequence through the model will result in indexing errors


In [17]:
model = AutoModelWithLMHead.from_pretrained(output_folder)

final_output_folder = f"./models/en_{model_type}_{chars_ru2en[CHARACTER]}_replics"

training_args = TrainingArguments(
    output_dir=output_folder, #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=5, # number of training epochs
    per_device_train_batch_size=2, # batch size for training
    per_device_eval_batch_size=2,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    dataloader_drop_last=True # avoid an error with an incomplete batch
    )

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



In [18]:
trainer.train()

Step,Training Loss
500,2.3045
1000,2.2565


TrainOutput(global_step=1169, training_loss=2.2760331126948112, metrics={'train_runtime': 300.1539, 'train_samples_per_second': 3.895, 'total_flos': 223442128207872, 'epoch': 1.0})

In [19]:
trainer.save_model()

In [20]:
# The name of the task to train.
TASK_NAME = f'{model_type}_mono_replics_en_{chars_ru2en[CHARACTER].lower()}'

# The output directory where the fine-tuned model and checkpoints will be written.
OUTPUT_DIR = f'{ROOT_DIR}/outputs/{TASK_NAME}/'
pathlib.Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

CONFIG_NAME = "config.json"
WEIGHTS_NAME = f"{TASK_NAME}_pytorch_model.bin"

saver(model, OUTPUT_DIR, WEIGHTS_NAME)

# Testing the model

To test the model we are going to use another [highlight of the transformers library](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) called `pipeline`. [Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) are objects that offer a simple API dedicated to several tasks, among others also `text-generation`

In [21]:
path_to_questions = '/content/style_transfer_sirius2021summer/data/questions/english_questions.txt'
questions = pd.read_csv(path_to_questions, sep="\n", header=None)
questions.columns = ['Question']

In [None]:
for ru_name, en_name in chars_ru2en.items():
    try:
        path_to_model = f'./outputs/{model_type}_mono_replics_en_{en_name.lower()}'

        try:
            os.rename(path_to_model + f'/{model_type}_mono_replics_en_{en_name.lower()}_pytorch_model.bin', path_to_model + '/pytorch_model.bin')
        except Exception: 
            pass

        chef = pipeline('text-generation', model=path_to_model, tokenizer=model_type)
        res = []
        EN_NAME = en_name.upper()
        for line in tqdm(questions.values.flatten().tolist(), desc='\t\t'):
            tmp = chef(f"<s>NOTFRIEND: {line}\n{EN_NAME}:")[0]['generated_text']
            tmp = tmp[tmp.find(f"{EN_NAME}: ") + len(EN_NAME) + 2 : tmp.find('</s>')]
            res.append(tmp)
        questions[en_name] = res
    except Exception: 
        print(f"\nNo model found for {en_name}!")

questions