This is our data script. Run the cells below and download the txt files and store them in our Google Drive for model traiing.

**Part A. FRIENDS scripts**
  1. Import FRIENDS scripts from Kaggle.
  2. Sort each episode in an order to ensure data coherence.
  3. Add a marker for End of Episode at the end of each episode.
  4. Split the data into train, val, test sets (ratio is approximately 70/10/20).
  5. Remove any website addresses that might be embedded in the data.

**Part B. WIKI-Text-103**
  1. Import Wiki-text-103 dataset from Hugging Face, take only the test set, and write it into a wikitext103_test.txt file for download.

**Part A. FRIENDS scripts**

In [None]:
import os
import re
from collections import defaultdict

In [None]:
import kagglehub
path = kagglehub.dataset_download("blessondensil294/friends-tv-series-screenplay-script")
print("Path to dataset files:", path)

In [None]:
# let's count how many episodes we have
input_dir = "/root/.cache/kagglehub/datasets/blessondensil294/friends-tv-series-screenplay-script/versions/1"
files = os.listdir(input_dir)
file_count = len([f for f in os.listdir(input_dir) if f.endswith(".txt")])
print(file_count)

Updated the function below to sort and then read files. This keeps all the episodes in the order (Season 1 to Season 10).

In [None]:
# read and write episodes
def create_concatenated_data(input_dir, output_file, max_files, start_index):

    # read all files in directory that ends with "".txt"
    all_files = sorted([f for f in os.listdir(input_dir) if f.endswith(".txt")]) # in a sorted order to ensure episode coherence

    # slice the list from start_index to start_index+max_files
    sliced_files = all_files[start_index:start_index + max_files]

    # open up output_file in write mode
    with open(output_file, "w", encoding="utf-8") as out_f:
        for i, fname in enumerate(sliced_files, start=1): # loop through every episode
            with open(os.path.join(input_dir, fname), "r", encoding="utf-8") as ep_f: # read every episode
                text = ep_f.read().strip() # take its text
            out_f.write(text + "\n\n### END OF EPISODE ###\n\n") # at the end of one episode, mark it for the model clearly that this is the end.

    # print confirmation
    print(f"Taking file index {start_index} to {start_index + max_files}, a total of {len(sliced_files)} episodes, and saving them into {output_file}.")

In [None]:
 # 1) manually generate the train/val/test data files with a 70/10/20 split
create_concatenated_data(input_dir=input_dir, output_file="train_data.txt", max_files=160, start_index=0)
create_concatenated_data(input_dir=input_dir, output_file="val_data.txt", max_files=22, start_index=160)
create_concatenated_data(input_dir=input_dir, output_file="test_data.txt", max_files=46, start_index=182)

During model training we observed the fine-tuned model someitmes generates website addresses out of nowhere. We thought maybe there are some website information embedded in our training data. Hence, we are using the code below to clean it off.

Note: We've explored a few preprocessing techniques like manually specifying rules to label character dialogues and actions, stage directions and time lapses. We also tried to use the autocorrector library to check for spelling issues and typos. But all create more problems than they solve (like labeling the wrong context, or auto-correcting a short form of character name into something else, like "Rach" was supposed to mean "Rachel" but it got corrected into "each"). So in the end we just kept the website cleaning function.

In [None]:
# clean websites
def clean_websites(input_file, output_file):
    # URL patterns to get rid of
    url_patterns = [r'https?://\S+',
                    r'www\.\S+',
                    r'\b\w+\.(com|org|net|edu|gov|io|co)\S*']
    # combine patterns
    pattern = re.compile('|'.join(url_patterns), re.IGNORECASE)
    try:
        # read input file
        with open(input_file, 'r', encoding='utf-8') as infile:
            text = infile.read()
        # remove URLs
        cleaned_text = pattern.sub('', text).strip()
        # write cleaned text to output file
        with open(output_file, 'w', encoding='utf-8') as outfile:
            outfile.write(cleaned_text)
        print(f"Cleaned text has been written to {output_file}")
    except Exception as e:
        print(f"An error occurred: {e}")

# run the function for all the data splits
clean_websites('train_data.txt', 'cleaned_train_data.txt')
clean_websites('val_data.txt', 'cleaned_val_data.txt')
clean_websites('test_data.txt', 'cleaned_test_data.txt')

**Part B. Download WikiText103**

In [None]:
%%capture
!pip install datasets

In [None]:
from datasets import load_dataset

# take only the test set
ds = load_dataset("Salesforce/wikitext", "wikitext-103-v1", split="test")

# write it into a .txt file for download
with open("wikitext103_test.txt", "w", encoding="utf-8") as f:
  for example in ds:
    text = example["text"]
    f.write(text.strip() + "\n")

After running this code we should have 7 .txt files on Colab. We can save the last 4 .txt files on our Google Drive and call them in the Training Script.