## Information
This notebook is used to fine-tune a pre-trained GPT-2 model on the data that is contained in the data.tsv file from the *data-collection.ipynb* script.
In the first half of the notebook, all the necessary processing methods are introduced. In the second half, you can choose which plays you want to use for training.

<b> Note </b>: The training is only feasible if you do it on your GPU / on Colab

In [150]:
import torch
print(torch.cuda.is_available())

True


## Processing the data

In [6]:
import pandas as pd

# Fetching the dialogue lines as a pandas dataframe
dialogue_data = pd.read_table("data.tsv")

# Removing redundant whitespace from the text data
cleaned_text = []
for i in range(len(dialogue_data)):
    text = str(dialogue_data.iloc[i]["text"])
    cleaned_text.append(" ".join(text.split()))

cleaned_text_column = pd.DataFrame({'text': cleaned_text})

dialogue_data.update(cleaned_text_column)

In [144]:
def get_lines_for_play_id(play_id=0, feature="character_gender"):
    '''
    :param play_id: The play_id of the play of which we want the dialogue lines
    :param feature: The feature that is placed in front of each dialogue line (e.g. character_name, character_gender)
    :return: List with all the lines of play with play_id in the following format: "<s>{feature}: {text}"
    '''
    line_list = []
    dialogues_play_id = dialogue_data[dialogue_data['play_id'] == play_id][[feature, "text"]]
    for row in dialogues_play_id.itertuples():
        line = "<s>" + str(row[1]) + ": " + row[2]
        line_list.append(line)
    return line_list


In [67]:
from transformers import AutoTokenizer
# We need the tokenizer in the next method
tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [99]:
import math

def input_to_chunks(play_id, feature="character_gender"):
    i = 0

    final_ids = []
    final_mask = []

    new_lines = get_lines_for_play_id(play_id, feature)
    # Iterate over the whole list of lines
    while i < (len(new_lines)-1):
        print(f"i: {i}") ##
        ids_i = []
        mask_i = []
        # Iterate from current i up to (potentially) the end of the lines
        for j in range(i, len(new_lines)):
            # Fetch the input_ids and the mask values for line j
            ids_j, mask_j = tokenizer(new_lines[j]).values()

            # Only append ids of line j if the resulting training example is less than 1024 or 512 tokens long
            if len(ids_i) + len(ids_j) <= 512:
                ids_i += ids_j
                mask_i += mask_j
                # If j is the last line, we want to break out of the j loop and also the i loop (by setting i=j) since we reached the end of our lines
                if j == (len(new_lines)-1):
                    i=j
                    print("j == len(list) (vor else)")

                    # Append the ids and mask to the final list
                    final_ids.append([ids_i])
                    final_mask.append([mask_i])
                    break

            else:
                # I think the part before the next else is unnecessary.
                # (If j is the last line, we want to break out of the j loop and also the i loop (by setting i=j) since we reached the end of our lines)
                if j == (len(new_lines)-1):
                    i = j
                    print("j == len(list)") ##
                    break
                else:
                    # Otherwise, we want to move our new i ca. (j-i / 2) lines
                    print("----")
                    print(f"Länge zum Ende von Zeitpunkt i = {i}: {len(ids_i)}")
                    print(f"i inkrementieren: {i} -> {i + math.floor(3*(j-i)/4) + 1} \t j: {j}, floor term:{math.floor(3*(j-i)/4)}") ##
                    i = i + math.floor(3*(j-i)/4) + 1

                    # Append the ids and mask to the final list
                    final_ids.append([ids_i])
                    final_mask.append([mask_i])

                    break
    return final_ids, final_mask


In [152]:

# Testing out the methods:

# input_ids, input_mask = input_to_chunks(play_id=187)
#
# print("----------------")
# for i in range(len(input_ids)):
#     print(f"i: {i}\t length: {len(input_ids[i][0])}")
#     print(input_ids[i][0][:10])
#     print(input_mask[i][0][:10])
#     print(tokenizer.decode(input_ids[i][0]), "\n")

In [154]:
from datasets import Dataset

def get_datasets(train_play_ids, test_play_ids):
    '''
    :param train_play_ids: A list of the play_ids on which the model should be trained
    :param test_play_ids: A list of the play_ids on which the model should be validated
    :return: Two datasets, one for training and one for validating
    '''

    training_dict = {
        'input_ids': [],
        'attention_mask': []
    }

    test_dict = {
        'input_ids': [],
        'attention_mask': []
    }

    for id in train_play_ids:
        input_ids, input_mask = input_to_chunks(play_id=id)
        for i in range(len(input_ids)):
            training_dict["input_ids"].append(input_ids[i][0])
            training_dict["attention_mask"].append(input_mask[i][0])

    for id in test_play_ids:
        input_ids, input_mask = input_to_chunks(play_id=id)
        for i in range(len(input_ids)):
            test_dict["input_ids"].append(input_ids[i][0])
            test_dict["attention_mask"].append(input_mask[i][0])

    train_dataset = Dataset.from_dict(training_dict)
    test_dataset = Dataset.from_dict(test_dict)
    return train_dataset, test_dataset


## Training:

In [155]:
# The plays we want to use during training
dataset_train, dataset_test = get_datasets([0, 1, 2, 3, 4, 5, 6], [7])

# Checking what our datasets look like
print("\n\n-----------Datasets-------------")
print(dataset_train)
print(dataset_test)

i: 0
----
Länge zum Ende von Zeitpunkt i = 0: 498
i inkrementieren: 0 -> 7 	 j: 9, floor term:6
i: 7
----
Länge zum Ende von Zeitpunkt i = 7: 486
i inkrementieren: 7 -> 17 	 j: 20, floor term:9
i: 17
----
Länge zum Ende von Zeitpunkt i = 17: 468
i inkrementieren: 17 -> 23 	 j: 24, floor term:5
i: 23
----
Länge zum Ende von Zeitpunkt i = 23: 294
i inkrementieren: 23 -> 30 	 j: 31, floor term:6
i: 30
----
Länge zum Ende von Zeitpunkt i = 30: 490
i inkrementieren: 30 -> 34 	 j: 35, floor term:3
i: 34
----
Länge zum Ende von Zeitpunkt i = 34: 490
i inkrementieren: 34 -> 42 	 j: 44, floor term:7
i: 42
----
Länge zum Ende von Zeitpunkt i = 42: 499
i inkrementieren: 42 -> 58 	 j: 63, floor term:15
i: 58
----
Länge zum Ende von Zeitpunkt i = 58: 502
i inkrementieren: 58 -> 73 	 j: 77, floor term:14
i: 73
----
Länge zum Ende von Zeitpunkt i = 73: 488
i inkrementieren: 73 -> 89 	 j: 93, floor term:15
i: 89
----
Länge zum Ende von Zeitpunkt i = 89: 512
i inkrementieren: 89 -> 102 	 j: 106, floor 

In [133]:
from transformers import AutoModelWithLMHead

# Setting the tokenizer and the model

tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
tokenizer.pad_token = '<pad>'

model = AutoModelWithLMHead.from_pretrained("dbmdz/german-gpt2")

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at C:\Users\User/.cache\huggingface\hub\models--dbmdz--german-gpt2\snapshots\f0edef6d975b1338bae533502e1dae74974cb2d2\config.json
Model config GPT2Config {
  "_name_or_path": "dbmdz/german-gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.0,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index

In [134]:
from transformers import DataCollatorForLanguageModeling
from transformers import TrainingArguments, Trainer

# Setting up the training procedure

training_args = TrainingArguments(output_dir="Theater_Model",
                                  report_to="none",
                                  overwrite_output_dir=True,
                                  per_device_train_batch_size=4,
                                  per_device_eval_batch_size=4,
                                  gradient_accumulation_steps=2,
                                  evaluation_strategy="steps",
                                  eval_steps=30,
                                  num_train_epochs=5,
                                  save_steps=90,
                                  logging_steps =30)

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_train,
    eval_dataset=dataset_test,
    data_collator = data_collator
)

PyTorch: setting up devices


In [135]:
# Run the training loop
trainer.train()

***** Running training *****
  Num examples = 368
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 2
  Total optimization steps = 230
  Number of trainable parameters = 124445952
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
30,4.3006,4.568961
60,3.9538,4.452933
90,3.7688,4.406631
120,3.6596,4.390954
150,3.5947,4.39018
180,3.5312,4.384194
210,3.491,4.387879


***** Running Evaluation *****
  Num examples = 79
  Batch size = 4
***** Running Evaluation *****
  Num examples = 79
  Batch size = 4
***** Running Evaluation *****
  Num examples = 79
  Batch size = 4
Saving model checkpoint to Theater_Model\checkpoint-90
Configuration saved in Theater_Model\checkpoint-90\config.json
Model weights saved in Theater_Model\checkpoint-90\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 79
  Batch size = 4
***** Running Evaluation *****
  Num examples = 79
  Batch size = 4
***** Running Evaluation *****
  Num examples = 79
  Batch size = 4
Saving model checkpoint to Theater_Model\checkpoint-180
Configuration saved in Theater_Model\checkpoint-180\config.json
Model weights saved in Theater_Model\checkpoint-180\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 79
  Batch size = 4


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=230, training_loss=3.7321149743121604, metrics={'train_runtime': 320.9332, 'train_samples_per_second': 5.733, 'train_steps_per_second': 0.717, 'total_flos': 470307285504000.0, 'train_loss': 3.7321149743121604, 'epoch': 5.0})

In [136]:
# Save the final version of the model
trainer.save_model("Theater-Model-final")

Saving model checkpoint to Theater-Model-final
Configuration saved in Theater-Model-final\config.json
Model weights saved in Theater-Model-final\pytorch_model.bin
