# Project: TV Show Script Generation

For this project, we decide to leverage the power of NLP and ML by fine-tuning a pre-trained NLP model and generate a script for a new episode of Friends.

Our project has 4 main parts:

1. Data Collection
2. Data Preprocessing
3. Load and Fine Tune the model
4. Model Evaluation
5. Generate Scripts



## Preparation

In [1]:
!python /content/drive/MyDrive/NLP_Project/operations/setup_env.py

Installing datasets...
✅ datasets installed.
Installing bitsandbytes -U...
✅ bitsandbytes -U installed.


In [2]:
import pandas as pd
import numpy as np
import torch
import sys
import os

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig

sys.path.append('/content/drive/MyDrive/NLP_Project/operations')
import utils

# Data Collection
We collect "friends" dataset from Kaggle, and extract the script part.

In [3]:
df = pd.read_csv("/content/drive/MyDrive/NLP_Project/dataset/friends_episodes.csv")

In [4]:
df.head()

Unnamed: 0,episode_title,script
0,THE ONE WHERE MONICA GETS A NEW ROOMATE (THE P...,THE ONE WHERE MONICA GETS A NEW ROOMATE (THE P...
1,THE ONE WITH THE SONOGRAM AT THE END\nW,THE ONE WITH THE SONOGRAM AT THE END\nWritten ...
2,THE ONE WITH THE THUMB\nW,THE ONE WITH THE THUMB\nWritten by: Jeffrey As...
3,THE ONE WITH GEORGE STEPHANOPOULOS\nW,THE ONE WITH GEORGE STEPHANOPOULOS\nWritten by...
4,THE ONE WITH THE EAST GERMAN LAUNDRY DETERGENT\nW,THE ONE WITH THE EAST GERMAN LAUNDRY DETERGENT...


In [5]:
df.shape

(223, 2)

# Data Preprocessing
In this part, we focused on the following steps:

1. Clean the data and divide it by episodes.
2. Tokenize the scripts.
3. Train-Validation-Test Split.



## Data Cleaning

In [6]:
scripts_df = df['script'].apply(utils.clean_script)
scripts_df.to_csv("/content/drive/MyDrive/NLP_Project/dataset/friends_scripts_by_episode.csv")

print("Results after data cleaning: \n")
scripts_df.head()

Results after data cleaning: 



Unnamed: 0,script
0,Monica: There's nothing to tell! He's just som...
1,"Monica: What you guys don't understand is, for..."
2,"Phoebe: (entering) Hi guys!\nAll: Hey, Pheebs!..."
3,"Monica: Alright. Phoebe?\nPhoebe: Okay, okay. ..."
4,Monica: Would you let it go? It's not that big...


After data cleaning process above, each element in the `scripts_df` refers to the full scripts of one episode.

## Tokenize the scripts data

In [7]:
scripts = scripts_df.tolist()
len(scripts)

223

Save hugging face token to Colab Secrets so we don't need to enter it every time login.

In [8]:
utils.huggingface_login()

Hugging Face Successfully Login!


In [9]:
# tokens = tokenizer(scripts, return_tensors="pt", padding=True, truncation=False)

In [10]:
# scripts_dataset = chunk_scripts(scripts)

In [11]:
tokenizer, tokens, scripts_dataset = utils.tokenize_scripts(scripts)


Select GPT2-7b Version for Tokenization...


Scripts Received! 

Begin to chunk scripts into pieces length less than 1024...



Token indices sequence length is longer than the specified maximum sequence length for this model (6432 > 1024). Running this sequence through the model will result in indexing errors



Chunking Finished! Ready to return the new scripts dataset...



In [12]:
tokenizer

GPT2TokenizerFast(name_or_path='gpt2-xl', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

In [13]:
scripts_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 2010
})

In [14]:
# print(f"{tokens}\n\n")
# print(f"{tokens.keys()}\n\n")
# print(f"{tokens['input_ids'].shape}\n\n")
# print(f"{tokens['input_ids'][0]}\n\n")

### Use 'BitsAndBytesConfig' to fix the GPU RAM Exploding problem

In [15]:
# bnb_config = BitsAndBytesConfig(
#     load_in_8bit=True,
#     llm_int8_enable_fp32_cpu_offload=True
# )

In [16]:
model = AutoModelForCausalLM.from_pretrained(
    "gpt2-xl"
    # quantization_config=bnb_config,
    # device_map='auto'
)
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x GPT2Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=4800, nx=1600)
          (c_proj): Conv1D(nf=1600, nx=1600)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=6400, nx=1600)
          (c_proj): Conv1D(nf=1600, nx=6400)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1600, out_features=50257, bias=False)
)

## Use LoRA to fine tune the Llama model.

In [17]:
from peft import get_peft_model, LoraConfig, TaskType

In [18]:
config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules=["c_attn"]
)

model = get_peft_model(model, config)



In [19]:
training_arguments, trainer = utils.custom_trainer(
    model,
    scripts_dataset,
    tokenizer,
    lr=2e-4,
    warmup=0.03,
    L2=0.05,
    batch=1,
    epochs=3
)

  trainer = Trainer(


Hyperparams Received! Started to generate Trainer...


Training Arguments Generated: 


Started to generate Trainer...



No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Trainer Generated: 



In [20]:
# training_arguments = TrainingArguments(
#     report_to="none",
#     output_dir="./output",
#     # evaluation_strategy="epoch",
#     save_strategy="epoch",
#     learning_rate=2e-4,
#     lr_scheduler_type="constant",
#     warmup_ratio=0.03,
#     weight_decay=0.05,
#     per_device_train_batch_size=1,
#     gradient_accumulation_steps=8,
#     num_train_epochs=3,
#     # load_best_model_at_end=True,
#     logging_steps=20,
#     logging_strategy="steps",
#     fp16=True,
#     # optim="paged_adamw_8bit",
#     # save_total_limit=3,
# )

# training_arguments

In [21]:
# trainer = Trainer(
#     model=model,
#     args=training_arguments,
#     train_dataset=scripts_dataset,
#     tokenizer=tokenizer
# )

# trainer

In [22]:
# import gc
# import torch
# gc.collect()
# torch.cuda.empty_cache()

### Model Training

In [23]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
20,2.5397
40,2.5273
60,2.54
80,2.4766
100,2.454
120,2.442
140,2.4294
160,2.4576
180,2.435
200,2.4357


TrainOutput(global_step=1509, training_loss=2.385868302554465, metrics={'train_runtime': 701.4879, 'train_samples_per_second': 8.596, 'train_steps_per_second': 2.151, 'total_flos': 5.4758128287744e+16, 'train_loss': 2.385868302554465, 'epoch': 3.0})

### Save the model

In [25]:
output_path = "/content/drive/MyDrive/Colab_Notebooks/NLP/models/ProjectModels"
trainer.save_model(output_path)
tokenizer.save_pretrained(output_path)

('/content/drive/MyDrive/Colab_Notebooks/NLP/models/ProjectModels/tokenizer_config.json',
 '/content/drive/MyDrive/Colab_Notebooks/NLP/models/ProjectModels/special_tokens_map.json',
 '/content/drive/MyDrive/Colab_Notebooks/NLP/models/ProjectModels/vocab.json',
 '/content/drive/MyDrive/Colab_Notebooks/NLP/models/ProjectModels/merges.txt',
 '/content/drive/MyDrive/Colab_Notebooks/NLP/models/ProjectModels/added_tokens.json',
 '/content/drive/MyDrive/Colab_Notebooks/NLP/models/ProjectModels/tokenizer.json')

### Generate the new text

In [26]:
# def generate_script(prompt, max_new_tokens=500, temperature=0.9, top_k=50, top_p=0.95):
#     inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

#     with torch.no_grad():
#         output = model.generate(
#             **inputs,
#             max_new_tokens=max_new_tokens,
#             do_sample=True,
#             temperature=temperature,
#             top_k=top_k,
#             top_p=top_p,
#             repetition_penalty=1.2,
#             pad_token_id=tokenizer.eos_token_id
#         )

#     generated = tokenizer.decode(output[0], skip_special_tokens=True)
#     return generated

In [27]:
seed_dialogue = {
    "Monica": "I can't believe he said that!",
    "Rachel": "Well, he does have a point."
}

prompt = utils.generate_prompt(
    characters=['Monica', 'Rachel'],
    location='Restaurant',
    scenario='having dinner',
    seed_dialogue=seed_dialogue,
    lines=10
)

prompt

"[Scene: Restaurant, Monica, Rachel are having dinner.]\n\nMonica: I can't believe he said that!\nRachel: Well, he does have a point.\nMonica:\nRachel:\nMonica:\nRachel:\nMonica:\nRachel:\nMonica:\nRachel:"

In [28]:
new_script = utils.generate_script(model, tokenizer, prompt)
new_script

"[Scene: Restaurant, Monica, Rachel are having dinner.]\n\nMonica: I can't believe he said that!\nRachel: Well, he does have a point.\nMonica:\nRachel:\nMonica:\nRachel:\nMonica:\nRachel:\nMonica:\nRachel: !!! (they notice the two look at each other.) What?\nJoey: Rach, what did you say about that guy?!\nRoss: He said there was no reason not to wear it.\nRachel: No, he didn't say anything like that! He said it because that's how he thinks of me and Joey!\nJoey: Ross, come on! You're talking about one of our friends! The biggest fan ever! And if this is so important, why wouldn't he want us wearing it? It just doesn't make sense! (To Phoebe) She's right. You know when we meet, I'm wearing a hat made by my friend Joe!\nPhoebe: Wow!\nJoey: (To Monica and Rachel) All right, here goes nothing! Come on!\nChandler: Oh hey!\nMonica: Hey.\nChandler: Have you heard the news? They've got another wedding for us in three months?\nRachel: Yeah!\nMonica: So I guess you guys are going to all of them?\

In [30]:
prompt = """You are generating a new Friends script scene.
Format: Character: Dialogue

[Scene: Central Perk, Monica and Rachel are having coffee.]

Monica: I can't believe he said that!
Rachel:"""

result = utils.generate_script(model, tokenizer, prompt)
print(result)

You are generating a new Friends script scene.
Format: Character: Dialogue

[Scene: Central Perk, Monica and Rachel are having coffee.]

Monica: I can't believe he said that!
Rachel: Oh man, you really should have seen the look on his face! And then I just thought, "Hey, maybe if we tell him what he's done right, this will bring it back to its right place." (She starts to pour herself more coffee) Oh yeah? Well, let me know, because I'm gonna start right now. Okay, so now uhm, in my head there was this thing we were trying to do, I mean they came up with a couple of lines, like you got a new hair cut and you're trying out something new that looks good on you. But uhm, um, and I think they called me the best actress I had ever worked with, okay, but then when it became about being an actor instead of a waitress, oh, well... You know.
Chandler: What did they say? How long ago?
Monica: About five years ago.
Ross: Five year old Chandler, you don't wanna hear anything else from them!
Phoebe