In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Project: TV Show Script Generation

For this project, we decide to leverage the power of NLP and ML by fine-tuning a pre-trained NLP model and generate a script for a new episode of Friends.

Our project has 4 main parts:

1. Data Collection
2. Data Preprocessing
3. Load and Fine Tune the model
4. Model Evaluation
5. Generate Scripts



## Preparation

In [None]:
!python /content/drive/MyDrive/NLP_Project/operations/setup_env.py

Installing datasets...
✅ datasets installed.
Installing evaluate...
✅ evaluate installed.
Installing rouge_score...
✅ rouge_score installed.
Installing bitsandbytes...
✅ bitsandbytes installed.


Save hugging face token to Colab Secrets so we don't need to enter it every time login.

In [None]:
import pandas as pd
import numpy as np
import importlib
import evaluate
import sys
import os

from datasets import Dataset

sys.path.append('/content/drive/MyDrive/NLP_Project/operations')
import utils

gpt2 = "gpt2-xl"
llama_2_7 = "meta-llama/Llama-2-7b-hf"
llama_2_13 = "meta-llama/Llama-2-13b-hf"

mixtral_87 = "mistralai/Mixtral-8x7B-Instruct-v0.1"
nousH_13 = "NousResearch/Nous-Hermes-13b"

output_path = "/content/drive/MyDrive/NLP_Project/model"

In [None]:
utils.huggingface_login()

Hugging Face Successfully Login!


# Data Collection
We collect "friends" dataset from Kaggle, and extract the script part.

In [None]:
df = pd.read_csv("/content/drive/MyDrive/NLP_Project/dataset/friends_episodes.csv")

In [None]:
df.head()

Unnamed: 0,episode_title,script
0,THE ONE WHERE MONICA GETS A NEW ROOMATE (THE P...,THE ONE WHERE MONICA GETS A NEW ROOMATE (THE P...
1,THE ONE WITH THE SONOGRAM AT THE END\nW,THE ONE WITH THE SONOGRAM AT THE END\nWritten ...
2,THE ONE WITH THE THUMB\nW,THE ONE WITH THE THUMB\nWritten by: Jeffrey As...
3,THE ONE WITH GEORGE STEPHANOPOULOS\nW,THE ONE WITH GEORGE STEPHANOPOULOS\nWritten by...
4,THE ONE WITH THE EAST GERMAN LAUNDRY DETERGENT\nW,THE ONE WITH THE EAST GERMAN LAUNDRY DETERGENT...


In [None]:
df.shape

(223, 2)

# Data Preprocessing
In this part, we focused on the following steps:

1. Clean the data and divide it by episodes.
2. Train-Validation-Test Split.
3. Tokenize the scripts.




In [None]:
data_preprocessor = utils.DataPreprocessor(llama_2_7)

Initialize data propressor...
Select meta-llama/Llama-2-7b-hf for Tokenization...



tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## Data Cleaning

In [None]:
scripts_df = df['script'].apply(data_preprocessor.clean_script)

scripts_df.to_csv("/content/drive/MyDrive/NLP_Project/dataset/friends_scripts_by_episode.csv")
scripts_df.head()

Unnamed: 0,script
0,THE ONE WHERE MONICA GETS A NEW ROOMATE (THE P...
1,THE ONE WITH THE SONOGRAM AT THE END\n[Scene C...
2,"THE ONE WITH THE THUMB\n[Scene: Central Perk, ..."
3,THE ONE WITH GEORGE STEPHANOPOULOS\n[Scene: Ce...
4,THE ONE WITH THE EAST GERMAN LAUNDRY DETERGENT...


After data cleaning process above, each element in the `scripts_df` refers to the full scripts of one episode.

## Train Validation Split

In [None]:
train_scripts, val_scripts = data_preprocessor.data_split(scripts_df)
print(train_scripts)
print(len(train_scripts))
print(len(val_scripts))

200
23


## Tokenize the scripts data

In [None]:
scripts = scripts_df.tolist()
len(scripts)

223

In [None]:
train_tokens, val_tokens, train_dataset, val_dataset = data_preprocessor.tokenize_scripts(train_scripts, val_scripts)

Tokenizing [training] scripts...


Scripts Received! 

Begin to chunk scripts into pieces length less than 1024...


Chunking Finished! Ready to return the new scripts dataset...

Tokenizing [validation] scripts...


Scripts Received! 

Begin to chunk scripts into pieces length less than 1024...


Chunking Finished! Ready to return the new scripts dataset...


Tokenizations all done!


In [None]:
data_preprocessor.tokenizer

LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

# Create Model and Fine-Tune it
In this part, we focused on creating a custom pre-trained model, and fine-tune it on our tv scripts data.

1. Define a Pre-trained model
2. Use `LoRA` and `BitsAndBytesConfig` to scale down the size of the model, save GPU RAM and speed up training process.
3. Train the model with scripts data and Fine-tune the hyperparameters.



## Define a custom pre-trained model

In [None]:
custom_model = utils.CustomModel(
    llama_2_7,
    data_preprocessor.tokenizer,
    data_preprocessor.train_set,
    data_preprocessor.val_set,
    lr=3e-5,
    warmup=0.03,
    L2=0.05,
    batch_size=4,
    epochs=10,
    enable_lora=True,
    enable_bitsbytes=True
)

Initialize custom pretrained model...


config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Hyperparams Received! Started to generate Trainer...



Started to generate Trainer...

[Training Arguments] and [Trainer] Generated! 



In [None]:
custom_model

<utils.CustomModel at 0x7951d735d090>

In [None]:
custom_model.model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): Linear4

## Use 'BitsAndBytesConfig' to fix the GPU RAM Exploding problem

In [None]:
custom_model.bitsbytes

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

## Use LoRA to fine tune the GPT2 model.

In [None]:
custom_model.lora

LoraConfig(task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='meta-llama/Llama-2-7b-hf', revision=None, inference_mode=False, r=16, target_modules={'q_proj', 'v_proj'}, exclude_modules=None, lora_alpha=64, lora_dropout=0.1, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, eva_config=None, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False)

## Train and Fine-tune the model

In [None]:
custom_model.trainer.train()

Step,Training Loss,Validation Loss
250,1.8312,1.819827
500,1.7916,1.80547
750,1.77,1.800189
1000,1.7599,1.794654
1250,1.7405,1.792647
1500,1.7162,1.792505
1750,1.7187,1.789736
2000,1.6976,1.794163
2250,1.6852,1.794856


TrainOutput(global_step=2250, training_loss=1.7456441243489584, metrics={'train_runtime': 2535.9888, 'train_samples_per_second': 9.692, 'train_steps_per_second': 2.425, 'total_flos': 3.6557964670559846e+17, 'train_loss': 1.7456441243489584, 'epoch': 3.658536585365854})

## Save the model

In [None]:
custom_model.trainer.save_model(output_path)
custom_model.tokenizer.save_pretrained(output_path)

('/content/drive/MyDrive/NLP_Project/model/tokenizer_config.json',
 '/content/drive/MyDrive/NLP_Project/model/special_tokens_map.json',
 '/content/drive/MyDrive/NLP_Project/model/tokenizer.model',
 '/content/drive/MyDrive/NLP_Project/model/added_tokens.json',
 '/content/drive/MyDrive/NLP_Project/model/tokenizer.json')

# New Script Generation
In this part, we focused on generate scripts of a new episode based on our fine-tuned pretrained model and training scripts data.

1. Customize a prompt.
2. Generate a new script based on the prompt.



In [None]:
scripts_generator = utils.ScriptGenerator()

Initialize TV Show Scripts generator...


## Customize the Prompt

In [None]:
seed_dialogue = {
    "Monica": "I can't believe he said that!",
    "Rachel": "Well, he does have a point.",
    "Chandler": "You mean the part where he compared your lasagna to insulation?",
    "Monica": "It was one time! And it was structurally sound!"
}

prompt = scripts_generator.create_prompt(
    characters=['Monica', 'Rachel'],
    location='Restaurant',
    scenario='having dinner',
    seed_dialogue=seed_dialogue,
    continue_speaker='Rachel'
)

prompt

'You are going to generate a new episode of the show *Friends*.\n        \n        The episode should include multiple scenes, natural conversations, character-specific humor, and a clear ending.\n        \n        Title: THE ONE WITH THE THUMB\n        \n        [Scene: Restaurant, Monica, Rachel are having dinner.]\n\n\n        Monica: It was one time! And it was structurally sound!\nRachel: Well, he does have a point.\nChandler: You mean the part where he compared your lasagna to insulation?\nRachel:'

## Generate New Script

In [None]:
# custom_fine_tuned_model.to("cuda")

In [None]:
new_script = scripts_generator.create_new_script(custom_model.model, custom_model.tokenizer)
scripts_generator.pretty_print_script(new_script)

You are going to generate a new episode of the show *Friends*.

The episode should include multiple scenes, natural conversations, character-specific humor, and a clear ending.

Title: THE ONE WITH THE THUMB

[Scene: Restaurant, Monica, Rachel are having dinner.]
------------------------------------------------------


Monica: It was one time! And it was structurally sound!
Rachel: Well, he does have a point.
Chandler: You mean the part where he compared your lasagna to insulation?
Rachel: Oh my god!! I did not know that! How could you be so mean Chandler!!
Monica: Yeah, you're just lucky that Chandler didn't get mad at me when I put garlic butter on his hair  ..... (thinking about it)..... wait...was that in an earlier episode.....


[Scene: Central Perk]
---------------------

   (Joey enters with Kate)
Kate: Hey Joey, how's our date going?
Joeysays to her: Pretty good.. pretty good. But don't worry there is something very strange thing that happened today while we were working out t

In [None]:
print(custom_model.model.config.use_cache)

True


# Evaluation
For causal language models, we tried to evaluate it by the following methods:

1. Perplexity
2. Rouge


### Perplexity

In [None]:
evaluate_results = custom_model.trainer.evaluate()
loss = evaluate_results["eval_loss"]
perplexity = np.exp(loss)
perplexity

np.float64(6.0044728604249284)

### Rouge

We took some beginning sentences of a random episode as the prompt. Compared the generation of this prompt and the original following dialogues.

In [None]:
prompt = """
THE ONE WITH RACHEL'S INADVERTANT KISS
Written by: Andrew Reich & Ted Cohen Transcribed by: Eric Aasen
[Scene: Central Perk, everyone is there as Rachel enters, happily.]
Rachel: Good, you guys are all here!
Ross: Hey! What's up?
Rachel: Well, I have a job interview at Ralph Lauren tomorrow!
All: Congratulations! Ohh, that's great!
Rachel: I know!
Joey: Boy, that guy's underwear sucks!
Rachel: Wh-what?!
Joey: I got this pair marked excess, I gotta tell ya, there was no room for excess anything in there.
Rachel: Anyway, I'm going to be the coordinator of the woman's collection, I'll work right under the director, it's the perfect, perfect job for me!
Phoebe: Wow! Well, if you nail the interview, you'll get it!
Rachel: Yeah.
Phoebe: You wanna work on your interview skills?
Rachel: O-okay!
Phoebe: Okay! All right, let's start with the handshake. Hi.
Rachel: Hi.
(They shake hands.)
Phoebe:
"""

references = """
Phoebe: Very good handshake, good wrist action.
Monica: Let me try. (Gets up to join them.)
Phoebe: Okay. (They shake hands and she pulls away suddenly) Oh my God! What did I ever do to you?! (Rubbing her hand.)
Monica: Did I squeeze it too hard?
Phoebe: Let's just say, I'm glad I'm not Chandler.
(Chandler tries to comprehend that remark.)
Opening Credits
[Scene: Monica and Rachel's, Joey is standing at the window waving at Ross.]
Joey: That's right Ross, I can see you in your new apartment! And you can see me! Same as yesterday, (To Monica) same as the day before.
Monica: Is he doing his shark attack bit yet?
Joey: Nope. Op, wait! There he goes.
(We see Ross through the window and he acts like a swimmer that gets attacked by a shark, picture one of the many, many, many Jaws movies they made and you get the idea.)
Joey: (waving) Very funny Ross! Very life-like and funny. Okay. (Notices that a woman is waving back.) Oh no-no-no, I wasn't waving at you lady. (She just stares at him.) (Joey sees how beautiful she is.) Whoa, maybe I was! Hey, Monica, this totally hot girl in Ross's building is flirting with me.
Monica: Get in there man! Flirt back, mix it up!
"""

references_tokens_len = len(data_preprocessor.tokenizer(references, return_tensors="pt")["input_ids"])
scripts_generator.prompt = prompt
predictions = scripts_generator.create_new_script(custom_model.model, custom_model.tokenizer, max_new_tokens=references_tokens_len)

In [None]:
rouge_metric = evaluate.load("rouge")
pred_lines = predictions.strip().split("\n")
ref_lines = references.strip().split("\n")

for pred, ref in zip(pred_lines, ref_lines):
    rouge_metric.add(prediction=pred, reference=ref)

rouge_score = rouge_metric.compute()
print(rouge_score)

{'rouge1': np.float64(0.04937085945489307), 'rouge2': np.float64(0.00234192037470726), 'rougeL': np.float64(0.05060024009603842), 'rougeLsum': np.float64(0.050744742341381)}


In [None]:
print(f"Rouge-1 Score is :{rouge_score['rouge1']}\n")
print(f"Rouge-2 Score is :{rouge_score['rouge2']}\n")
print(f"Rouge-L Score is :{rouge_score['rougeL']}\n")
print(f"Rouge-Lsum Score is :{rouge_score['rougeLsum']}\n")

Rouge-1 Score is :0.04937085945489307

Rouge-2 Score is :0.00234192037470726

Rouge-L Score is :0.05060024009603842

Rouge-Lsum Score is :0.050744742341381



# Some Experiments

In [None]:
seed_dialogue = {
    "Monica": "I can't believe he said that!",
    "Rachel": "Well, he does have a point.",
    "Chandler": "You mean the part where he compared your lasagna to insulation?",
    "Monica": "It was one time! And it was structurally sound!"
}

prompt = scripts_generator.create_prompt(
    characters=['Monica', 'Rachel'],
    location='Restaurant',
    scenario='having dinner',
    seed_dialogue=seed_dialogue,
    continue_speaker='Rachel'
)

prompt

'You are going to generate a new episode of the show *Friends*.\n        \n        The episode should include multiple scenes, natural conversations, character-specific humor, and a clear ending.\n        \n        [Scene: Restaurant, Monica, Rachel are having dinner.]\n\n\n        Monica: It was one time! And it was structurally sound!\nRachel: Well, he does have a point.\nChandler: You mean the part where he compared your lasagna to insulation?\nRachel:'

In [None]:
import torch

scripts_generator.prompt = prompt
custom_fine_tuned_model.to("cuda")
new_script = scripts_generator.create_new_script(custom_fine_tuned_model, custom_model.tokenizer)
scripts_generator.pretty_print_script(new_script)

You are going to generate a new episode of the show *Friends*.

The episode should include multiple scenes, natural conversations, character-specific humor, and a clear ending.

[Scene: Restaurant, Monica, Rachel are having dinner.]
------------------------------------------------------


Monica: It was one time! And it was structurally sound!
Rachel: Well, he does have a point.
Chandler: You mean the part where he compared your lasagna to insulation?
Rachel: Okay, I'm sorry about all that, but you guys gotta understand something...
Monica: No we don't get anything, because this is not our wedding!!
[Pause]   Chandler: So there's some stuff in this guy's life you probably didn't know about right??
---------------------------------------------------------------------------------------------------
Joey: Yeah man I was kinda scared when I found out about his sex obsession with anatomical illustration books from the turn of the century.
Monica: Yes it took me almost six months to build my 

## Try to generate by llama_2_7b without fine-tuning

In [None]:
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

data_preprocessor_llama = utils.DataPreprocessor(llama_2_7)

scripts_df = df['script'].apply(data_preprocessor_llama.clean_script)

scripts_df.to_csv("/content/drive/MyDrive/NLP_Project/dataset/friends_scripts_by_episode.csv")
scripts_df.head()

train_scripts, val_scripts = data_preprocessor_llama.data_split(scripts_df)

train_tokens, val_tokens, train_dataset, val_dataset = data_preprocessor_llama.tokenize_scripts(train_scripts, val_scripts)

custom_llama2 = utils.CustomModel(
    llama_2_7,
    data_preprocessor_llama.tokenizer,
    train_dataset,
    val_dataset,
    lr=3e-5,
    warmup=0.03,
    L2=0.05,
    batch_size=4,
    epochs=10,
    enable_lora=True,
    enable_bitsbytes=True
)

# custom_gpt2.trainer.train()

scripts_generator_llama2 = utils.ScriptGenerator()

prompt_llama2 = """
Monica: I can't believe he said that!
Rachel: I know, but I don't want to hear it!
Monica:
"""

scripts_generator_llama2.prompt = prompt_llama2

new_script = scripts_generator_llama2.create_new_script(custom_llama2.model, custom_llama2.tokenizer)
scripts_generator_llama2.pretty_print_script(new_script)

Initialize data propressor...
Select meta-llama/Llama-2-7b-hf for Tokenization...



tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Tokenizing [training] scripts...


Scripts Received! 

Begin to chunk scripts into pieces length less than 1024...


Chunking Finished! Ready to return the new scripts dataset...

Tokenizing [validation] scripts...


Scripts Received! 

Begin to chunk scripts into pieces length less than 1024...


Chunking Finished! Ready to return the new scripts dataset...


Tokenizations all done!
Initialize custom pretrained model...


config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Hyperparams Received! Started to generate Trainer...



Started to generate Trainer...

[Training Arguments] and [Trainer] Generated! 

Initialize TV Show Scripts generator...
Monica: I can't believe he said that!
Rachel: I know, but I don't want to hear it!
Monica: 
> _[sarcastically]_ Oh...so let me get this straight. You think you made out with Mr Cool? Big deal!...Oh please Rach - just give him a break!! [to herself] Oi oi oi - how much is there of her left in those clothes??
Ross and Mona arrive home from the airport looking extremely tired. Ross: How was it? Mona: It wasn't too bad, except for an hour on standby we got some sleep at least. She'd like to say thanks again for letting us stay here whilst she looks around, and now if you could just point us in the direction of our room...
Monica (still angry): And so long as they don't take off all their clothing before bedtime......I wish my brother hadn't slept with his best friend's wife. But then why didn't you stop them? Why did