In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Project: TV Show Script Generation

For this project, we decide to leverage the power of NLP and ML by fine-tuning a pre-trained NLP model and generate a script for a new episode of Friends.

Our project has 4 main parts:

1. Data Collection
2. Data Preprocessing
3. Load and Fine Tune the model
4. Model Evaluation
5. Generate Scripts



## Preparation

In [2]:
!python /content/drive/MyDrive/NLP_Project/operations/setup_env.py

Installing datasets...
✅ datasets installed.
Installing evaluate...
✅ evaluate installed.
Installing rouge_score...
✅ rouge_score installed.
Installing bitsandbytes...
✅ bitsandbytes installed.


Save hugging face token to Colab Secrets so we don't need to enter it every time login.

In [3]:
import pandas as pd
import numpy as np
import importlib
import evaluate
import sys
import os

from datasets import Dataset

sys.path.append('/content/drive/MyDrive/NLP_Project/operations')
import utils

gpt2 = "gpt2-xl"
llama_2_7 = "meta-llama/Llama-2-7b-hf"
llama_2_13 = "meta-llama/Llama-2-13b-hf"

mixtral_87 = "mistralai/Mixtral-8x7B-Instruct-v0.1"
nousH_13 = "NousResearch/Nous-Hermes-13b"

output_path = "/content/drive/MyDrive/NLP_Project/model"

In [4]:
utils.huggingface_login()

Hugging Face Successfully Login!


# Data Collection
We collect "friends" dataset from Kaggle, and extract the script part.

In [5]:
df = pd.read_csv("/content/drive/MyDrive/NLP_Project/dataset/friends_episodes.csv")

In [6]:
df.head()

Unnamed: 0,episode_title,script
0,THE ONE WHERE MONICA GETS A NEW ROOMATE (THE P...,THE ONE WHERE MONICA GETS A NEW ROOMATE (THE P...
1,THE ONE WITH THE SONOGRAM AT THE END\nW,THE ONE WITH THE SONOGRAM AT THE END\nWritten ...
2,THE ONE WITH THE THUMB\nW,THE ONE WITH THE THUMB\nWritten by: Jeffrey As...
3,THE ONE WITH GEORGE STEPHANOPOULOS\nW,THE ONE WITH GEORGE STEPHANOPOULOS\nWritten by...
4,THE ONE WITH THE EAST GERMAN LAUNDRY DETERGENT\nW,THE ONE WITH THE EAST GERMAN LAUNDRY DETERGENT...


In [7]:
df.shape

(223, 2)

# Data Preprocessing
In this part, we focused on the following steps:

1. Clean the data and divide it by episodes.
2. Train-Validation-Test Split.
3. Tokenize the scripts.




In [8]:
data_preprocessor = utils.DataPreprocessor(llama_2_7)

Initialize data propressor...
Select meta-llama/Llama-2-7b-hf for Tokenization...



tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## Data Cleaning

In [9]:
scripts_df = df['script'].apply(data_preprocessor.clean_script)

scripts_df.to_csv("/content/drive/MyDrive/NLP_Project/dataset/friends_scripts_by_episode.csv")
scripts_df.head()

Unnamed: 0,script
0,THE ONE WHERE MONICA GETS A NEW ROOMATE (THE P...
1,THE ONE WITH THE SONOGRAM AT THE END\n[Scene C...
2,"THE ONE WITH THE THUMB\n[Scene: Central Perk, ..."
3,THE ONE WITH GEORGE STEPHANOPOULOS\n[Scene: Ce...
4,THE ONE WITH THE EAST GERMAN LAUNDRY DETERGENT...


After data cleaning process above, each element in the `scripts_df` refers to the full scripts of one episode.

## Train Validation Split

In [10]:
train_scripts, val_scripts = data_preprocessor.data_split(scripts_df)
print(train_scripts)
print(len(train_scripts))
print(len(val_scripts))

Output hidden; open in https://colab.research.google.com to view.

## Tokenize the scripts data

In [11]:
scripts = scripts_df.tolist()
len(scripts)

223

In [12]:
train_tokens, val_tokens, train_dataset, val_dataset = data_preprocessor.tokenize_scripts(train_scripts, val_scripts)

Tokenizing [training] scripts...


Scripts Received! 

Begin to chunk scripts into pieces length less than 1024...


Chunking Finished! Ready to return the new scripts dataset...

Tokenizing [validation] scripts...


Scripts Received! 

Begin to chunk scripts into pieces length less than 1024...


Chunking Finished! Ready to return the new scripts dataset...


Tokenizations all done!


In [13]:
data_preprocessor.tokenizer

LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

# Create Model and Fine-Tune it
In this part, we focused on creating a custom pre-trained model, and fine-tune it on our tv scripts data.

1. Define a Pre-trained model
2. Use `LoRA` and `BitsAndBytesConfig` to scale down the size of the model, save GPU RAM and speed up training process.
3. Train the model with scripts data and Fine-tune the hyperparameters.



## Define a custom pre-trained model

In [14]:
custom_model = utils.CustomModel(
    llama_2_7,
    data_preprocessor.tokenizer,
    data_preprocessor.train_set,
    data_preprocessor.val_set,
    lr=2e-4,
    warmup=0.03,
    L2=0.05,
    batch_size=4,
    epochs=3,
    enable_lora=True,
    enable_bitsbytes=True
)

Initialize custom pretrained model...


config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Hyperparams Received! Started to generate Trainer...



Started to generate Trainer...

[Training Arguments] and [Trainer] Generated! 



In [15]:
custom_model

<utils.CustomModel at 0x7b65209a14d0>

## Use 'BitsAndBytesConfig' to fix the GPU RAM Exploding problem

In [16]:
custom_model.bitsbytes

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

## Use LoRA to fine tune the GPT2 model.

In [17]:
custom_model.lora

LoraConfig(task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='meta-llama/Llama-2-7b-hf', revision=None, inference_mode=False, r=8, target_modules={'q_proj', 'v_proj'}, exclude_modules=None, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, eva_config=None, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False)

## Train and Fine-tune the model

In [18]:
custom_model.trainer.train()

Step,Training Loss,Validation Loss
500,1.779,1.79617
1000,1.6937,1.799219
1500,1.5861,1.833773


TrainOutput(global_step=1740, training_loss=1.6867751373641793, metrics={'train_runtime': 1830.3696, 'train_samples_per_second': 3.799, 'train_steps_per_second': 0.951, 'total_flos': 2.8248044357025792e+17, 'train_loss': 1.6867751373641793, 'epoch': 3.0})

## Save the model

In [19]:
custom_model.trainer.save_model(output_path)
custom_model.tokenizer.save_pretrained(output_path)

('/content/drive/MyDrive/NLP_Project/model/tokenizer_config.json',
 '/content/drive/MyDrive/NLP_Project/model/special_tokens_map.json',
 '/content/drive/MyDrive/NLP_Project/model/tokenizer.model',
 '/content/drive/MyDrive/NLP_Project/model/added_tokens.json',
 '/content/drive/MyDrive/NLP_Project/model/tokenizer.json')

# New Script Generation
In this part, we focused on generate scripts of a new episode based on our fine-tuned pretrained model and training scripts data.

1. Customize a prompt.
2. Generate a new script based on the prompt.



In [20]:
scripts_generator = utils.ScriptGenerator()

Initialize TV Show Scripts generator...


## Customize the Prompt

In [21]:
importlib.reload(utils)
seed_dialogue = {
    "Monica": "I can't believe he said that!",
    "Rachel": "Well, he does have a point."
}

prompt = scripts_generator.create_prompt(
    characters=['Monica', 'Rachel'],
    location='Restaurant',
    scenario='having dinner',
    seed_dialogue=seed_dialogue
)

prompt

"[Scene: Restaurant, Monica, Rachel are having dinner.]\n\nMonica: I can't believe he said that!\nRachel: Well, he does have a point."

## Generate New Script

In [22]:
new_script = scripts_generator.create_new_script(custom_model.model, custom_model.tokenizer)

In [23]:
new_script

"[Scene: Restaurant, Monica, Rachel are having dinner.]\n\nMonica: I can't believe he said that!\nRachel: Well, he does have a point. You don't really know how you two would handle having sex if everything was right and there were no obstacles in the way. I mean what happens? Does it just happen once and then the next day you feel guilty about it or do you not care at all because nothing changes between the two of you?\nMonica: (with a sad expression) What happened to me?  Everything stayed exactly the same...\n[Scene: Hallway outside Ross's room]   Joey: Listen pal, I didn't tell Chandler to cheat on his wife ok? It wasn't part of my plan! And now he is, we both are paying the price for this and...  Ross: Hey Jackass, you better explain yourself before I slap some sense into ya!!  Joey: You can't, your hand still stinks from when he smacked you across the face last weekend!!!"

# Evaluation
For causal language models, we tried to evaluate it by the following methods:

1. Perplexity
2. Rouge


### Perplexity

In [24]:
evaluate_results = custom_model.trainer.evaluate()
loss = evaluate_results["eval_loss"]
perplexity = np.exp(loss)
perplexity

np.float64(6.298087912338648)

### Rouge

In [25]:
prompt = """
THE ONE WITH RACHEL'S INADVERTANT KISS
Written by: Andrew Reich & Ted Cohen Transcribed by: Eric Aasen
[Scene: Central Perk, everyone is there as Rachel enters, happily.]
Rachel: Good, you guys are all here!
Ross: Hey! What's up?
Rachel: Well, I have a job interview at Ralph Lauren tomorrow!
All: Congratulations! Ohh, that's great!
Rachel: I know!
Joey: Boy, that guy's underwear sucks!
Rachel: Wh-what?!
Joey: I got this pair marked excess, I gotta tell ya, there was no room for excess anything in there.
Rachel: Anyway, I'm going to be the coordinator of the woman's collection, I'll work right under the director, it's the perfect, perfect job for me!
Phoebe: Wow! Well, if you nail the interview, you'll get it!
Rachel: Yeah.
Phoebe: You wanna work on your interview skills?
Rachel: O-okay!
Phoebe: Okay! All right, let's start with the handshake. Hi.
Rachel: Hi.
(They shake hands.)
Phoebe:
"""

references = """
Phoebe: Very good handshake, good wrist action.
Monica: Let me try. (Gets up to join them.)
Phoebe: Okay. (They shake hands and she pulls away suddenly) Oh my God! What did I ever do to you?! (Rubbing her hand.)
Monica: Did I squeeze it too hard?
Phoebe: Let's just say, I'm glad I'm not Chandler.
(Chandler tries to comprehend that remark.)
Opening Credits
[Scene: Monica and Rachel's, Joey is standing at the window waving at Ross.]
Joey: That's right Ross, I can see you in your new apartment! And you can see me! Same as yesterday, (To Monica) same as the day before.
Monica: Is he doing his shark attack bit yet?
Joey: Nope. Op, wait! There he goes.
(We see Ross through the window and he acts like a swimmer that gets attacked by a shark, picture one of the many, many, many Jaws movies they made and you get the idea.)
Joey: (waving) Very funny Ross! Very life-like and funny. Okay. (Notices that a woman is waving back.) Oh no-no-no, I wasn't waving at you lady. (She just stares at him.) (Joey sees how beautiful she is.) Whoa, maybe I was! Hey, Monica, this totally hot girl in Ross's building is flirting with me.
Monica: Get in there man! Flirt back, mix it up!
"""

references_tokens_len = len(data_preprocessor.tokenizer(references, return_tensors="pt")["input_ids"])
scripts_generator.prompt = prompt
predictions = scripts_generator.create_new_script(custom_model.model, custom_model.tokenizer, max_new_tokens=references_tokens_len)

In [26]:
rouge_metric = evaluate.load("rouge")
pred_lines = predictions.strip().split("\n")
ref_lines = references.strip().split("\n")

for pred, ref in zip(pred_lines, ref_lines):
    rouge_metric.add(prediction=pred, reference=ref)

rouge_score = rouge_metric.compute()
print(rouge_score)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

{'rouge1': np.float64(0.04937085945489307), 'rouge2': np.float64(0.00234192037470726), 'rougeL': np.float64(0.05060024009603842), 'rougeLsum': np.float64(0.050744742341381)}


In [27]:
importlib.reload(utils)

<module 'utils' from '/content/drive/MyDrive/NLP_Project/operations/utils.py'>