**1. Imports**

Let's start with imports and installing Hugging Face datasets.

In [1]:
%%capture
!pip install datasets

In [2]:
import torch
import numpy as np
from transformers import AutoTokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling, AutoModelForCausalLM
from datasets import load_dataset, concatenate_datasets, Dataset, DatasetDict
from peft import LoraConfig, get_peft_model, TaskType

**2. Tokenization and Data Blocks**

Let's access our .txt files from Google Drive and prepare them with the tokenzier.

*   Step 1. Read all files as a concatenated text and Tokenize everything.
*   Step 2. Create blocks of data from the one long piece of tokens.
*   Step 3. Make Hugging Face dataset.



In [3]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# set up tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2-large")
tokenizer.pad_token = tokenizer.eos_token

Mounted at /content/drive


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [4]:
# ---------Saber's path--------------

# read and tokenize each file
def load_and_tokenize(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    tokenized = tokenizer(text)
    return tokenized['input_ids']

# specify file path in Google Drive
data_path = "/content/drive/MyDrive/SEP775NLP_Final_Project/Friends_script/"

# run the function above to tokenize the data
train_ids = load_and_tokenize(f"{data_path}/cleaned_train_data.txt") # used latest version of processed data here
val_ids = load_and_tokenize(f"{data_path}/cleaned_val_data.txt")
test_ids = load_and_tokenize(f"{data_path}/cleaned_test_data.txt")
general_test_ids = load_and_tokenize(f"{data_path}/wikitext103_test.txt")

Token indices sequence length is longer than the specified maximum sequence length for this model (1062284 > 1024). Running this sequence through the model will result in indexing errors


In [5]:
# create blocks of sequences
block_size=512

def create_blocks(all_tokens, block_size):
  blocks = []
  # loop through all of the tokens, split them by block_size
  for i in range(0, len(all_tokens), block_size):
    tokens_in_this_block = all_tokens[i:i+block_size] # takes [0:512], then [512:1024]...
    blocks.append(tokens_in_this_block)
  return blocks # returns a list of list of blocks

# run the function above to create the data blocks
train_blocks = create_blocks(train_ids, block_size)
val_blocks = create_blocks(val_ids, block_size)
test_blocks = create_blocks(test_ids, block_size)
general_test_blocks = create_blocks(general_test_ids, block_size)

# verify that one block contains multiple lines of the script
print(train_blocks[0])

[464, 1881, 6350, 23240, 29620, 257, 968, 5564, 2002, 378, 357, 464, 21697, 12, 464, 791, 8968, 10628, 8, 198, 25354, 416, 25, 3981, 64, 28148, 487, 805, 1222, 3271, 38376, 628, 198, 58, 36542, 25, 5694, 2448, 74, 11, 28346, 11, 26154, 11, 1380, 2577, 1350, 11, 290, 23240, 389, 612, 8183, 198, 198, 9069, 3970, 25, 1318, 338, 2147, 284, 1560, 0, 679, 338, 655, 617, 3516, 314, 670, 351, 0, 198, 198, 19585, 88, 25, 327, 1101, 261, 11, 345, 821, 1016, 503, 351, 262, 3516, 0, 1318, 338, 17753, 307, 1223, 2642, 351, 683, 0, 198, 198, 1925, 392, 1754, 25, 1439, 826, 26154, 11, 307, 3621, 13, 220, 1406, 857, 339, 423, 257, 49779, 30, 317, 49779, 290, 257, 4190, 12239, 30, 198, 198, 2725, 2577, 1350, 25, 16314, 11, 857, 339, 4483, 30860, 30, 198, 198, 7, 2990, 477, 24170, 11, 307, 76, 1484, 2014, 198, 198, 2725, 2577, 1350, 25, 2329, 11, 705, 25587, 11, 314, 836, 470, 765, 607, 284, 467, 832, 644, 314, 1816, 832, 351, 8124, 12, 11752, 0, 198, 198, 9069, 3970, 25, 16805, 11, 7288, 8960, 13, 770,

In [6]:
# make dictionaries and then turn them into hugging face datasets
train_dataset = Dataset.from_dict({"input_ids": train_blocks})
val_dataset = Dataset.from_dict({"input_ids": val_blocks})
test_dataset = Dataset.from_dict({"input_ids": test_blocks})
general_test_dataset = Dataset.from_dict({"input_ids": general_test_blocks})

# create the Hugging Face dataset dict
dataset_dict = DatasetDict({"train": train_dataset, "validation": val_dataset, "test": test_dataset,
                            "general_test": general_test_dataset})

In [7]:
# set up data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)  # For causal LM

**3. Model Training**

Let's set up the parameters for training here. We will also benchmark the pre-trained model's ability on a general Wiki language dataset (before fine-tuning). We will test the model again on this dataset after fine-tuning to see if our model retains the general language capability.

In [17]:
# access the model from Hugging Face
model = AutoModelForCausalLM.from_pretrained("gpt2-large")
model.config.pad_token_id = model.config.eos_token_id

# --------------------------------------------------------------------------------(comment this section out if running without LoRA)
# freeze all the weights
#for param in model.parameters():
    #param.requires_grad = False
# -------------------------------------------------------------------------------- (this code might not be needed even when running LoRa, but just in case)

In [9]:
# set up dummy trainer to benchmark the pre-trained model's capability on the general wiki test set
dummy_training_args = TrainingArguments(output_dir="./eval_output", do_train=False, per_device_eval_batch_size=4, report_to="none")
dummy_trainer = Trainer(model=model, args=dummy_training_args, eval_dataset=dataset_dict["general_test"], data_collator=data_collator)

# test the pre-trained model
benchmark_general_eval_results = dummy_trainer.evaluate(dataset_dict["general_test"])
print(f"General Wiki Data before Fine-Tuning: eval_loss = {benchmark_general_eval_results['eval_loss']:.2f}")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


General Wiki Data before Fine-Tuning: eval_loss = 3.05


In [9]:
# take a look at how many steps will be one epoch
num_of_train_blocks=(len(train_blocks))
step_per_epoch = num_of_train_blocks // 4 # note 4 here is the batch_size
print(step_per_epoch)

518


In [18]:
# --------------------------------------------------------------------------------(comment this section out if running without LoRA)
# setup LoRA configuration and wrap the model with it
#lora_config = LoraConfig(r=32, lora_alpha=32, lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM) # increase r for a more powerful adapter
#model = get_peft_model(model, lora_config)
#model.print_trainable_parameters()# verify we're only training LoRA weights
# --------------------------------------------------------------------------------

# set up training_args
training_args = TrainingArguments(output_dir="/content/drive/MyDrive/SEP775NLP_Final_Project/Models/gpt2-large", # Rose: "/content/drive/MyDrive/datasets"
                                  evaluation_strategy="steps",
                                  save_strategy="steps",
                                  max_steps=600,
                                  eval_steps=100,
                                  save_steps=100,
                                  logging_steps=100,
                                  #num_train_epochs=3,
                                  per_device_train_batch_size=4,
                                  per_device_eval_batch_size=4,
                                  learning_rate=2e-5,
                                  #save_total_limit=1,
                                  resume_from_checkpoint=True,
                                  load_best_model_at_end=True,
                                  metric_for_best_model="eval_loss",
                                  weight_decay=0.01, # added regularization
                                  warmup_steps=5, # added lr scheduler and warmup
                                  lr_scheduler_type="linear",
                                  report_to="none" # disable wandb
                                  ) # default optimizer is adamw

# setup trainer
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=dataset_dict["train"],
                  eval_dataset=dataset_dict["validation"],
                  data_collator=data_collator)



In [19]:
# start training
trainer.train()

Step,Training Loss,Validation Loss
100,2.3103,2.131631
200,2.2494,2.102783
300,2.1963,2.085449
400,2.185,2.072713
500,2.1762,2.065934
600,2.0991,2.064348


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=600, training_loss=2.2027021789550782, metrics={'train_runtime': 696.412, 'train_samples_per_second': 3.446, 'train_steps_per_second': 0.862, 'total_flos': 5220644565811200.0, 'train_loss': 2.2027021789550782, 'epoch': 1.1560693641618498})

**4. Evaluation**

Take the trained model, and evaluate its performance on the two different tests sets.

*   Test 1. Unseen FRIENDS scripts - to test model's ability to pick up the specific style of the show.
*   Test 2. General Wiki English data - to test model's general language capabilities. (compare to benchmark)



In [20]:
# evaluation process
eval_results = trainer.evaluate(dataset_dict["test"])
general_eval_results = trainer.evaluate(dataset_dict["general_test"])

print(f"Unseen FRIENDS scripts: eval_loss = {eval_results['eval_loss']:.2f}")
print(f"General Wiki Data after Fine-Tuning: eval_loss = {general_eval_results['eval_loss']:.2f}")

Unseen FRIENDS scripts: eval_loss = 2.33
General Wiki Data after Fine-Tuning: eval_loss = 3.13


**5. Generation**

Finally, we'll use this model to generate a new episode to see for ourselves how the model performs. We trialed the following different approaches for text generation and found that better results can be obtained with the last approach, so in this version of the code we only kept that approach.

*   Approach 1. Greedy Search (using argmax)
*   Approach 2. MultiNomial selection
*   Approach 3. MultiNomial selection with top_k and top_p filters





In [21]:
# create initial prompt and tokenize it
prompt = """
The One with students from McMaster University
Written by: GPT2-Base

[Scene: Monica's Apartment. Monica, Ross, Rachel, and Chandler are there.]

Monica: Hey guys! Big news—I heard there's going to be a party at McMaster University.

Ross: A party?

Rachel: Oh yeah! The students are celebrating their SEP775 class with the professors!

Chandler: We should join them! I mean, where is the fun in a party without your favorite FRIENDS characters?

(They all leave for the party.)

[Scene: SEP775 classroom at McMaster University. Rachel, Chandler, Ross, Monica, and Phoebe are hanging out with the students.]

Rachel: Wow, this place kicks ass! Look, isn’t that Professor Mahyar who teaches Natural Language Processing?

"""

model_inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=True).to(model.device)
model.generation_config.pad_token_id = tokenizer.pad_token_id # suppress padding warning

# create a list that holds the prompt, future tokens will be appended into this list later on
all_tokens = model_inputs["input_ids"][0].tolist()

# to avoid hitting memory limits, generate only one batch with the initial prompt, take last few tokens out to generate the next batch.
context_window = 700 # this is the sliding window - take this many tokens to set up next batch's generation
max_token = 2500
step_size = 128 # generate this many tokens at a time
num_of_loops = max_token // step_size

# end episode management
min_tokens_before_end = 2000 # generate this many tokens before being allowed to generate the word "end"
end_id = tokenizer.encode("end", add_special_tokens=False) # token_id of the word "end" (I checked it previously and it is only 1 token id, meaning it is the standalone word "end")
end_of_episode_marker = "### END OF EPISODE ###" # this marker was added during data pre-processing. it should come right after the "end" word

# generation loop
for i in range(num_of_loops):

  # suppresses the word "end" before reaching certain number of tokens
  if len(all_tokens) < min_tokens_before_end:
    bad_words = [end_id]
  else:
    bad_words = None

  output = model.generate(**model_inputs,
                          max_new_tokens=step_size,
                          do_sample=True,
                          top_k=40, # keep k highest probable options
                          top_p=0.75, # keep only the options that add up to p% probability, filter out the remaining less probable options
                          temperature=1, # to add a little bit more creativity if needed
                          repetition_penalty=1.2,
                          bad_words_ids=bad_words # bad words to generate before hitting certain number of tokens
                          )

  output_tokens = output[0].tolist()
  new_tokens = output_tokens[-step_size:] # get rid of the previous context and keep only new tokens

  new_text = tokenizer.decode(new_tokens, skip_special_tokens=True) # turn tokens to text to check if end marker has been generated
  start_index_of_end_marker = new_text.find(end_of_episode_marker) # returns -1 if the marker is not found, otherwise returns the start index

  if start_index_of_end_marker != -1: # if end marker is present
    truncated_new_text = new_text[: start_index_of_end_marker + len(end_of_episode_marker)] # get rid of what is after the end marker
    truncated_new_tokens = tokenizer.encode(truncated_new_text, add_special_tokens=False) # turn them into tokens
    all_tokens.extend(truncated_new_tokens) # and put them in the list
    break

  else:
      all_tokens.extend(new_tokens) # otherwise place the whole thing into the list without truncating

  previous_context = output_tokens[-context_window:] # set new previous_context for next loop
  previous_context_tensor = torch.tensor([previous_context]).to(model.device) # prepare the inputs and attention mask for next loop
  attention_mask = (previous_context_tensor != tokenizer.pad_token_id).long()
  model_inputs = {"input_ids": previous_context_tensor, "attention_mask":attention_mask}

# decode from the list and print results
generated_script = tokenizer.decode(all_tokens, skip_special_tokens=True)
print(generated_script)


The One with students from McMaster University
Written by: GPT2-Base

[Scene: Monica's Apartment. Monica, Ross, Rachel, and Chandler are there.]

Monica: Hey guys! Big news—I heard there's going to be a party at McMaster University.

Ross: A party?

Rachel: Oh yeah! The students are celebrating their SEP775 class with the professors!

Chandler: We should join them! I mean, where is the fun in a party without your favorite FRIENDS characters?

(They all leave for the party.)

[Scene: SEP775 classroom at McMaster University. Rachel, Chandler, Ross, Monica, and Phoebe are hanging out with the students.]

Rachel: Wow, this place kicks ass! Look, isn’t that Professor Mahyar who teaches Natural Language Processing?


Phoebe: Yeah. (Points to her professor) He's great. Y'know he's got so much fun!

Monica: So he taught you all those Chinese words?

Phoebe: Yep! But y'know what? No matter how many times I hear 'chop chop chop', I don't think it ever leaves my head!

Ross: You just keep saying