# Download and convert publicly available transcripts to OA format
- Friends dialogue (git repository)
- The Office dialogue (Kaggle dataset)
- Marvel Cinematic Universe dialogue (Kaggle dataset)
- Doctor Who dialogue (Kaggle dataset)
- Star Trek dialogue (git repository)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/data/datasets/tv_dialogue/public.ipynb)

In [2]:
# uncomment and run below lines to set up if running in colab
# !git clone https://github.com/LAION-AI/Open-Assistant.git
# %cd Open-Assistant/data/datasets/tv_dialogue
# !pip install -r requirements.txt

In [16]:
# download data, you can get your kaggle.json file from your account page https://www.kaggle.com/me/account
import kaggle

In [4]:
# import required packages
import os
import io
import re
import requests
import json
import time
import warnings

try:
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
from tqdm import tqdm

import numpy as np
import pandas as pd

from typing import Tuple, Optional, Any

# Friends

In [3]:
# friends via https://github.com/emorynlp/character-mining
friends = pd.read_csv(
    "https://raw.githubusercontent.com/emorynlp/character-mining/master/tsv/friends_transcripts.tsv", sep="\t"
)
friends

Unnamed: 0,season_id,episode_id,scene_id,utterance_id,speaker,tokens,transcript
0,s01,e01,c01,u001,Monica Geller,"[['There', ""'s"", 'nothing', 'to', 'tell', '!']...",There's nothing to tell! He's just some guy I ...
1,s01,e01,c01,u002,Joey Tribbiani,"[[""C'mon"", ',', 'you', ""'re"", 'going', 'out', ...","C'mon, you're going out with the guy! There's ..."
2,s01,e01,c01,u003,Chandler Bing,"[['All', 'right', 'Joey', ',', 'be', 'nice', '...","All right Joey, be nice. So does he have a hum..."
3,s01,e01,c01,u004,Phoebe Buffay,"[['Wait', ',', 'does', 'he', 'eat', 'chalk', '...","Wait, does he eat chalk?"
4,s01,e01,c01,u005,unknown,[],
...,...,...,...,...,...,...,...
67368,s10,e18,c11,u017,Chandler Bing,"[['Oh', ',', 'it', ""'s"", 'gon', 'na', 'be', 'o...","Oh, it's gonna be okay."
67369,s10,e18,c11,u018,Rachel Green,"[['Do', 'you', 'guys', 'have', 'to', 'go', 'to...",Do you guys have to go to the new house right ...
67370,s10,e18,c11,u019,Monica Geller,"[['We', 'got', 'some', 'time', '.']]",We got some time.
67371,s10,e18,c11,u020,Rachel Green,"[['Okay', ',', 'should', 'we', 'get', 'some', ...","Okay, should we get some coffee?"


In [4]:
friends["group"] = friends[["season_id", "episode_id"]].apply(lambda x: f"{x[0]}_{x[1]}", axis=1)
friends

Unnamed: 0,season_id,episode_id,scene_id,utterance_id,speaker,tokens,transcript,group
0,s01,e01,c01,u001,Monica Geller,"[['There', ""'s"", 'nothing', 'to', 'tell', '!']...",There's nothing to tell! He's just some guy I ...,s01_e01
1,s01,e01,c01,u002,Joey Tribbiani,"[[""C'mon"", ',', 'you', ""'re"", 'going', 'out', ...","C'mon, you're going out with the guy! There's ...",s01_e01
2,s01,e01,c01,u003,Chandler Bing,"[['All', 'right', 'Joey', ',', 'be', 'nice', '...","All right Joey, be nice. So does he have a hum...",s01_e01
3,s01,e01,c01,u004,Phoebe Buffay,"[['Wait', ',', 'does', 'he', 'eat', 'chalk', '...","Wait, does he eat chalk?",s01_e01
4,s01,e01,c01,u005,unknown,[],,s01_e01
...,...,...,...,...,...,...,...,...
67368,s10,e18,c11,u017,Chandler Bing,"[['Oh', ',', 'it', ""'s"", 'gon', 'na', 'be', 'o...","Oh, it's gonna be okay.",s10_e18
67369,s10,e18,c11,u018,Rachel Green,"[['Do', 'you', 'guys', 'have', 'to', 'go', 'to...",Do you guys have to go to the new house right ...,s10_e18
67370,s10,e18,c11,u019,Monica Geller,"[['We', 'got', 'some', 'time', '.']]",We got some time.,s10_e18
67371,s10,e18,c11,u020,Rachel Green,"[['Okay', ',', 'should', 'we', 'get', 'some', ...","Okay, should we get some coffee?",s10_e18


In [6]:
episode_list = """
Season 1
The Pilot
The One With the Sonogram at the End
The One With the Thumb
The One With George Stephanopoulos
The One With the East German Laundry Detergent
The One With the Butt
The One With the Blackout
The One Where Nana Dies Twice
The One Where Underdog Gets Away
The One With the Monkey
The One With Mrs Bing
The One With the Dozen Lasagnas
The One With the Boobies
The One With the Candy Hearts
The One With the Stoned Guy
The One With Two Parts, Part 1
The One With Two Parts, Part 2
The One With All the Poker
The One Where the Monkey Gets Away
The One with the Evil Orthodontist
The One with Fake Monica
The One with the Ick Factor
The One with the Birth
The One where Rachel Finds Out

Season 2
The One With Ross' New Girlfriend
The One With the Breast Milk
The One Where Heckles Dies
The One With Phoebe's Husband
The One With Five Steaks and an Eggplant
The One With the Baby on the Bus
The One Where Ross Finds Out
The One With the List
The One With Phoebe's Dad
The One With Russ
The One With The Lesbian Wedding
The One After the Superbowl, Part 1
The One After the Superbowl, Part 2
The One With The Prom Video
The One Where Ross and RachelYou Know
The One Where Joey Moves Out
The One Where Eddie Moves In
The One Where Dr Ramoray Dies
The One Where Eddie Won't Go
The One Where Old Yeller Dies
The One With The Bullies
The One With Two Parties
The One With The Chicken Pox
The One With Barry & Mindy's Wedding

Season 3
The One With The Princess Leia Fantasy
The One Where No One's Ready
The One With The Jam
The One With The Metaphorical Tunnel
The One With Frank, Jr
The One With The Flashback
The One With The Racecar Bed
The One With The Giant Poking Device
The One With The Football
The One Where Rachel Quits
The One Where Chandler Can't Remember Which Sister
The One With All The Jealousy
The One Where Monica and Richard Are Just Friends
The One With Phoebe's Ex-Partner
The One Where Ross And Rachel Take A Break
The One The Morning After
The One Without The Ski Trip
The One With The Hypnosis Tape
The One With The Tiny T-Shirt
The One With The Dollhouse
The One With a Chick And a Duck
The One With The Screamer
The One With Ross's Thing
The One With The Ultimate Fighting Champion
The One At The Beach

Season 4
The One With The Jellyfish
The One With The Cat
The One With The 'Cuffs
The One With The Ballroom Dancing
The One With Joey's New Girlfriend
The One With The Dirty Girl
The One Where Chandler Crosses The Line
The One With Chandler In A Box
The One Where They're Going To PARTY!
The One With The Girl From Poughkeepsie
The One With Phoebe's Uterus
The One With The Embryos
The One With Rachel's Crush
The One With Joey's Dirty Day
The One With All The Rugby
The One With The Fake Party
The One With The Free Porn
The One With Rachel's New Dress
The One With All The Haste
The One With All The Wedding Dresses
The One With The Invitation
The One With The Worst Best Man Ever
The One With Ross's Wedding, Part 1
The One With Ross's Wedding, Part 2

Season 5
The One After Ross Says Rachel
The One With All The Kissing
The One With The Triplets
The One Where Phoebe Hates PBS
The One With The Kips
The One With The Yeti
The One Where Ross Moves In
The One With All The Thanksgivings
The One With Ross's Sandwich
The One With The Inappropriate Sister
The One With All The Resolutions
The One With Chandler's Work Laugh
The One With Joey's Bag
The One Where Everybody Finds Out
The One With The Girl Who Hits Joey
The One With The Cop
The One With Rachel's Inadvertent Kiss
The One Where Rachel Smokes
The One Where Ross Can't Flirt
The One With The Ride-Along
The One With The Ball
The One With Joey's Big Break
The One In Vegas, Part 1
The One In Vegas, Part 2

Season 6
The One After Vegas
The One Where Ross Hugs Rachel
The One With Ross's Denial
The One Where Joey Loses His Insurance
The One With Joey's Porsche
The One On The Last Night
The One Where Phoebe Runs
The One With Ross's Teeth
The One Where Ross Got High
The One With The Routine
The One With The Apothecary Table
The One With The Joke
The One With Rachel's Sister
The One Where Chandler Can't Cry
The One That Could Have Been, Part 1
The One That Could Have Been, Part 2
The One With Unagi
The One Where Ross Dates a Student
The One With Joey's Fridge
The One With Mac & C.H.E.E.S.E.
The One Where Ross Meets Elizabeth's Dad
The One Where Paul's The Man
The One With The Ring
The One With The Proposal, Part 1
The One With The Proposal, Part 2

Season 7
The One With Monica's Thunder
The One With Rachel's Book
The One With Phoebe's Cookies
The One With Rachel's Assistant
The One With The Engagement Picture
The One With The Nap Partners
The One With Ross's Library Book
The One Where Chandler Doesn't Like Dogs
The One With All The Candy
The One With the Holiday Armadillo
The One With All The Cheesecakes
The One Where They're Up All Night
The One Where Rosita Dies
The One Where They All Turn Thirty
The One With Joey's New Brain
The One With The Truth About London
The One With The Cheap Wedding Dress
The One With Joey's Award
The One With Ross and Monica's Cousin
The One With Rachel's Big Kiss
The One With The Vows
The One With Chandler's Dad
The One With Monica and Chandler's Wedding, Part 1
The One With Monica and Chandler's Wedding, Part 2

Season 8
The One After "I Do"
The One With The Red Sweater
The One Where Rachel Tells
The One With The Video Tape
The One With Rachel's Date
The One With The Halloween Party
The One With The Stain
The One With The Stripper
The One With The Rumor
The One With Monica's Boots
The One With Ross' Step Forward
The One Where Joey Dates Rachel
The One Where Chandler Takes a Bath
The One With The Secret Closet
The One With The Birthing Video
The One Where Joey Tells Rachel
The One With The Tea Leaves
The One In Massapequa
The One With Joey's Interview
The One With The Baby Shower
The One With The Cooking Class
The One Where Rachel is Late
The One Where Rachel Has a Baby, Part 1
The One Where Rachel Has a Baby, Part 2

Season 9
The One Where No One Proposes
The One Where Emma Cries
The One With the Pediatrician
The One With the Sharks
The One With Phoebe's Birthday Dinner
The One With the Male Nanny
The One With Ross's Inappropriate Song
The One With Rachel's Other Sister
The One With Rachel's Phone Number
The One With Christmas in Tulsa
The One Where Rachel Goes Back To Work
The One With Phoebe's Rats
The One Where Monica Sings
The One With The Blind Dates
The One With The Mugging
The One With The Boob Job
The One With The Memorial Service
The One With The Lottery
The One With Rachel's Dream
The One With The Soap Opera Party
The One With The Fertility Test
The One With The Donor
The One In Barbados, Part 1
The One In Barbados, Part 2

Season 10
The One After Joey and Rachel Kiss
The One Where Ross is Fine
The One With Ross's Tan
The One With the Cake
The One Where Rachel's Sister Baby-sits
The One With Ross's Grant
The One With the Home Study
The One With the Late Thanksgiving
The One With The Birth Mother
The One Where Chandler Gets Caught
The One Where The Stripper Cries
The One With Phoebe's Wedding
The One Where Joey Speaks French
The One With Princess Consuela
The One Where Estelle Dies
The One With Rachel's Going Away Party
The Last One, Part 1
The Last One, Part 2
"""

In [7]:
episodes, season, cnt = {}, "", 0
for line in episode_list.split("\n"):
    if not line:
        continue
    if line.startswith("Season "):
        season = f"s{line.split('Season ', 1)[1].zfill(2)}"
        cnt = 1
        if season not in episodes:
            episodes[season] = {}
    else:
        episodes[season][f"e{str(cnt).zfill(2)}"] = line.strip()
        cnt += 1
episodes

{'s01': {'e01': 'The Pilot',
  'e02': 'The One With the Sonogram at the End',
  'e03': 'The One With the Thumb',
  'e04': 'The One With George Stephanopoulos',
  'e05': 'The One With the East German Laundry Detergent',
  'e06': 'The One With the Butt',
  'e07': 'The One With the Blackout',
  'e08': 'The One Where Nana Dies Twice',
  'e09': 'The One Where Underdog Gets Away',
  'e10': 'The One With the Monkey',
  'e11': 'The One With Mrs Bing',
  'e12': 'The One With the Dozen Lasagnas',
  'e13': 'The One With the Boobies',
  'e14': 'The One With the Candy Hearts',
  'e15': 'The One With the Stoned Guy',
  'e16': 'The One With Two Parts, Part 1',
  'e17': 'The One With Two Parts, Part 2',
  'e18': 'The One With All the Poker',
  'e19': 'The One Where the Monkey Gets Away',
  'e20': 'The One with the Evil Orthodontist',
  'e21': 'The One with Fake Monica',
  'e22': 'The One with the Ick Factor',
  'e23': 'The One with the Birth',
  'e24': 'The One where Rachel Finds Out'},
 's02': {'e01'

In [8]:
friends["utterance"] = friends["utterance_id"].apply(lambda x: int(x[1:]))
friends["scene"] = friends["scene_id"].apply(lambda x: int(x[1:]))
friends

Unnamed: 0,season_id,episode_id,scene_id,utterance_id,speaker,tokens,transcript,group,utterance,scene
0,s01,e01,c01,u001,Monica Geller,"[['There', ""'s"", 'nothing', 'to', 'tell', '!']...",There's nothing to tell! He's just some guy I ...,s01_e01,1,1
1,s01,e01,c01,u002,Joey Tribbiani,"[[""C'mon"", ',', 'you', ""'re"", 'going', 'out', ...","C'mon, you're going out with the guy! There's ...",s01_e01,2,1
2,s01,e01,c01,u003,Chandler Bing,"[['All', 'right', 'Joey', ',', 'be', 'nice', '...","All right Joey, be nice. So does he have a hum...",s01_e01,3,1
3,s01,e01,c01,u004,Phoebe Buffay,"[['Wait', ',', 'does', 'he', 'eat', 'chalk', '...","Wait, does he eat chalk?",s01_e01,4,1
4,s01,e01,c01,u005,unknown,[],,s01_e01,5,1
...,...,...,...,...,...,...,...,...,...,...
67368,s10,e18,c11,u017,Chandler Bing,"[['Oh', ',', 'it', ""'s"", 'gon', 'na', 'be', 'o...","Oh, it's gonna be okay.",s10_e18,17,11
67369,s10,e18,c11,u018,Rachel Green,"[['Do', 'you', 'guys', 'have', 'to', 'go', 'to...",Do you guys have to go to the new house right ...,s10_e18,18,11
67370,s10,e18,c11,u019,Monica Geller,"[['We', 'got', 'some', 'time', '.']]",We got some time.,s10_e18,19,11
67371,s10,e18,c11,u020,Rachel Green,"[['Okay', ',', 'should', 'we', 'get', 'some', ...","Okay, should we get some coffee?",s10_e18,20,11


In [9]:
data = {"TEXT": [], "METADATA": [], "SOURCE": []}
for name, group in tqdm(friends.groupby("group")):
    metadata = {
        "show": "Friends",
        "season": group["season_id"].values[0],
        "episode": group["episode_id"].values[0],
        "title": episodes[group["season_id"].values[0]][group["episode_id"].values[0]],
    }
    text, last_scene = f"Friends - {metadata['title']}\r\n\r\n", None
    group.sort_values(by=["scene_id", "utterance"], ascending=True, inplace=True)
    for index, row in group.iterrows():
        if last_scene is None:
            last_scene = row["scene_id"]
        elif last_scene != row["scene_id"]:
            last_scene = row["scene_id"]
            text += "\r\n---------------------------------------\r\n\r\n"
        if row["speaker"] == "unknown" or row["tokens"] == "[]" or pd.isna(row["transcript"]):
            continue
        text += f"[{row['speaker'].strip()}] {row['transcript'].strip()}\r\n"
    data["TEXT"].append(text)
    data["METADATA"].append(json.dumps(metadata))
    data["SOURCE"].append("friends/emorynlp")
data = pd.DataFrame(data)
data

100%|████████████████████████████████████████████████████████████████████████████████| 236/236 [00:07<00:00, 32.37it/s]


Unnamed: 0,TEXT,METADATA,SOURCE
0,Friends - The Pilot\r\n\r\n[Monica Geller] The...,"{""show"": ""Friends"", ""season"": ""s01"", ""episode""...",friends/emorynlp
1,Friends - The One With the Sonogram at the End...,"{""show"": ""Friends"", ""season"": ""s01"", ""episode""...",friends/emorynlp
2,Friends - The One With the Thumb\r\n\r\n[Phoeb...,"{""show"": ""Friends"", ""season"": ""s01"", ""episode""...",friends/emorynlp
3,Friends - The One With George Stephanopoulos\r...,"{""show"": ""Friends"", ""season"": ""s01"", ""episode""...",friends/emorynlp
4,Friends - The One With the East German Laundry...,"{""show"": ""Friends"", ""season"": ""s01"", ""episode""...",friends/emorynlp
...,...,...,...
231,Friends - The One With Princess Consuela\r\n\r...,"{""show"": ""Friends"", ""season"": ""s10"", ""episode""...",friends/emorynlp
232,Friends - The One Where Estelle Dies\r\n\r\n[C...,"{""show"": ""Friends"", ""season"": ""s10"", ""episode""...",friends/emorynlp
233,Friends - The One With Rachel's Going Away Par...,"{""show"": ""Friends"", ""season"": ""s10"", ""episode""...",friends/emorynlp
234,"Friends - The Last One, Part 1\r\n\r\n[Jennife...","{""show"": ""Friends"", ""season"": ""s10"", ""episode""...",friends/emorynlp


In [10]:
print(data["TEXT"].values[0])

Friends - The Pilot

[Monica Geller] There's nothing to tell! He's just some guy I work with!
[Joey Tribbiani] C'mon, you're going out with the guy! There's gotta be something wrong with him!
[Chandler Bing] All right Joey, be nice. So does he have a hump? A hump and a hairpiece?
[Phoebe Buffay] Wait, does he eat chalk?
[Phoebe Buffay] Just, 'cause, I don't want her to go through what I went through with Carl- oh!
[Monica Geller] Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.
[Chandler Bing] Sounds like a date to me.
[Chandler Bing] Alright, so I'm back in high school, I'm standing in the middle of the cafeteria, and I realize I am totally naked.
[#ALL#] Oh, yeah. Had that dream.
[Chandler Bing] Then I look down, and I realize there's a phone... there.
[Joey Tribbiani] Instead of...?
[Chandler Bing] That's right.
[Joey Tribbiani] Never had that dream.
[Phoebe Buffay] No.
[Chandler Bing] All of a sudden, the 

In [11]:
data.to_parquet("friends.pq", row_group_size=100, engine="pyarrow", index=False)
data.head()  # https://github.com/emorynlp/character-mining

Unnamed: 0,TEXT,METADATA,SOURCE
0,Friends - The Pilot\r\n\r\n[Monica Geller] The...,"{""show"": ""Friends"", ""season"": ""s01"", ""episode""...",friends/emorynlp
1,Friends - The One With the Sonogram at the End...,"{""show"": ""Friends"", ""season"": ""s01"", ""episode""...",friends/emorynlp
2,Friends - The One With the Thumb\r\n\r\n[Phoeb...,"{""show"": ""Friends"", ""season"": ""s01"", ""episode""...",friends/emorynlp
3,Friends - The One With George Stephanopoulos\r...,"{""show"": ""Friends"", ""season"": ""s01"", ""episode""...",friends/emorynlp
4,Friends - The One With the East German Laundry...,"{""show"": ""Friends"", ""season"": ""s01"", ""episode""...",friends/emorynlp


In [12]:
len(data)

236

# The Office

In [19]:
# office via https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript
kaggle.api.dataset_download_files("nasirkhalid24/the-office-us-complete-dialoguetranscript", "office", unzip=True)

In [8]:
office = pd.read_csv("office/The-Office-Lines-V4.csv", sep=",")
office.drop(columns=["Unnamed: 6"], inplace=True)
office["group"] = office[["season", "episode"]].apply(lambda x: f"{x[0]}_{x[1]}", axis=1)
office

Unnamed: 0,season,episode,title,scene,speaker,line,group
0,1,1,Pilot,1,Michael,All right Jim. Your quarterlies look very good...,1_1
1,1,1,Pilot,1,Jim,"Oh, I told you. I couldn't close it. So...",1_1
2,1,1,Pilot,1,Michael,So you've come to the master for guidance? Is ...,1_1
3,1,1,Pilot,1,Jim,"Actually, you called me in here, but yeah.",1_1
4,1,1,Pilot,1,Michael,"All right. Well, let me show you how it's done.",1_1
...,...,...,...,...,...,...,...
54621,9,24,Finale,8153,Creed,It all seems so very arbitrary. I applied for ...,9_24
54622,9,24,Finale,8154,Meredith,I just feel lucky that I got a chance to share...,9_24
54623,9,24,Finale,8155,Phyllis,I'm happy that this was all filmed so I can re...,9_24
54624,9,24,Finale,8156,Jim,I sold paper at this company for 12 years. My ...,9_24


In [9]:
data = {"TEXT": [], "METADATA": [], "SOURCE": []}
for name, group in tqdm(office.groupby("group")):
    metadata = {
        "show": "The Office",
        "season": f"s{str(group['season'].values[0]).zfill(2)}",
        "episode": f"e{str(group['episode'].values[0]).zfill(2)}",
        "title": group["title"].values[0],
    }
    text, last_scene = f"The Office - {metadata['title']}\r\n\r\n", None
    for index, row in group.iterrows():
        if last_scene is None:
            last_scene = row["scene"]
        elif last_scene != row["scene"]:
            last_scene = row["scene"]
            text += "\r\n---------------------------------------\r\n\r\n"
        if pd.isna(row["speaker"]) or pd.isna(row["line"]):
            continue
        text += f"[{row['speaker'].strip()}] {row['line'].strip()}\r\n"
    data["TEXT"].append(text)
    data["METADATA"].append(json.dumps(metadata))
    data["SOURCE"].append("office/nasirkhalid24")
data = pd.DataFrame(data)
data

100%|████████████████████████████████████████████████████████████████████████████████| 186/186 [00:06<00:00, 29.75it/s]


Unnamed: 0,TEXT,METADATA,SOURCE
0,The Office - Pilot\r\n\r\n[Michael] All right ...,"{""show"": ""The Office"", ""season"": ""s01"", ""episo...",office/nasirkhalid24
1,The Office - Diversity Day\r\n\r\n[Michael] He...,"{""show"": ""The Office"", ""season"": ""s01"", ""episo...",office/nasirkhalid24
2,The Office - Health Care\r\n\r\n[Michael] Pam....,"{""show"": ""The Office"", ""season"": ""s01"", ""episo...",office/nasirkhalid24
3,The Office - The Alliance\r\n\r\n[Dwight] Mich...,"{""show"": ""The Office"", ""season"": ""s01"", ""episo...",office/nasirkhalid24
4,"The Office - Basketball\r\n\r\n[Michael] Hey, ...","{""show"": ""The Office"", ""season"": ""s01"", ""episo...",office/nasirkhalid24
...,...,...,...
181,The Office - Here Comes Treble\r\n\r\n[Dwight]...,"{""show"": ""The Office"", ""season"": ""s09"", ""episo...",office/nasirkhalid24
182,The Office - The Boat\r\n\r\n[Oscar] Can you g...,"{""show"": ""The Office"", ""season"": ""s09"", ""episo...",office/nasirkhalid24
183,"The Office - The Whale\r\n\r\n[Andy] Ah, what ...","{""show"": ""The Office"", ""season"": ""s09"", ""episo...",office/nasirkhalid24
184,The Office - The Target\r\n\r\n[Oscar] Yesterd...,"{""show"": ""The Office"", ""season"": ""s09"", ""episo...",office/nasirkhalid24


In [10]:
print(data["TEXT"].values[0])

The Office - Pilot

[Michael] All right Jim. Your quarterlies look very good. How are things at the library?
[Jim] Oh, I told you. I couldn't close it. So...
[Michael] So you've come to the master for guidance? Is this what you're saying, grasshopper?
[Jim] Actually, you called me in here, but yeah.
[Michael] All right. Well, let me show you how it's done.

---------------------------------------

[Michael] Yes, I'd like to speak to your office manager, please. Yes, hello. This is Michael Scott. I am the Regional Manager of Dunder Mifflin Paper Products. Just wanted to talk to you manager-a-manger.  All right. Done deal. Thank you very much, sir. You're a gentleman and a scholar. Oh, I'm sorry. OK. I'm sorry. My mistake.  That was a woman I was talking to, so... She had a very low voice. Probably a smoker, so...  So that's the way it's done.

---------------------------------------

[Michael] I've, uh, I've been at Dunder Mifflin for 12 years, the last four as Regional Ma

In [16]:
data.to_parquet("office.pq", row_group_size=100, engine="pyarrow", index=False)
data.head()  # https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript

Unnamed: 0,TEXT,METADATA,SOURCE
0,The Office - Pilot\r\n\r\n[Michael] All right ...,"{""show"": ""The Office"", ""season"": ""s01"", ""episo...",office/nasirkhalid24
1,The Office - Diversity Day\r\n\r\n[Michael] He...,"{""show"": ""The Office"", ""season"": ""s01"", ""episo...",office/nasirkhalid24
2,The Office - Health Care\r\n\r\n[Michael] Pam....,"{""show"": ""The Office"", ""season"": ""s01"", ""episo...",office/nasirkhalid24
3,The Office - The Alliance\r\n\r\n[Dwight] Mich...,"{""show"": ""The Office"", ""season"": ""s01"", ""episo...",office/nasirkhalid24
4,"The Office - Basketball\r\n\r\n[Michael] Hey, ...","{""show"": ""The Office"", ""season"": ""s01"", ""episo...",office/nasirkhalid24


In [28]:
len(data)

186

# Marvel Cinematic Universe

In [21]:
# marvel via https://www.kaggle.com/datasets/pdunton/marvel-cinematic-universe-dialogue
kaggle.api.dataset_download_files("pdunton/marvel-cinematic-universe-dialogue", "marvel", unzip=True)

In [23]:
names = {
    "iron-man-script-slug.txt": "Iron Man",
    "iron_man_2.txt": "Iron Man 2",
    "thor-script-slug.txt": "Thor",
    "captain_america.txt": "Captain America: The First Avenger",
    "avengers-script-slug.txt": "The Avengers",
    "iron_man_3.txt": "Iron Man 3",
    "thor_dark_world.txt": "Thor: The Dark World",
    "winter_soldier.txt": "Captain America: The Winter Soldier",
    "ant_man.txt": "Ant-Man",
    "age_of_ultron.txt": "Avengers: Age of Ultron",
    "civil_war.txt": "Captain America: Civil War",
    "thor-ragnarok-script-slug.txt": "Thor: Ragnarok",
    "guardians_2.txt": "Guardians of the Galaxy Vol. 2",
    "spider_man_homecoming.txt": "Spider-Man: Homecoming",
    "black-panther-script-slug.txt": "Black Panther",
    "infinity_war.txt": "Avengers: Infinity War",
    "captain_marvel.txt": "Captain Marvel",
    "avengers-endgame-script-slug.txt": "Avengers: Endgame",
}

for txt in os.listdir("marvel/script txts"):
    assert txt in names, txt

In [24]:
marvel = {"TEXT": [], "METADATA": [], "SOURCE": []}
for txt in tqdm(os.listdir("marvel/script txts")):
    with open(os.path.join("marvel/script txts", txt), "r", encoding="utf-8") as f:
        data = f.read()
    data = data.replace("[", "(").replace("]", ")")
    text = ""
    for line in data.splitlines():
        match = re.findall(r"^(.{2,}?)\:\s+(.+?)$", line)
        if match and match[0][0][0] not in (")", "("):
            text += f"[{match[0][0]}] {match[0][1]}\r\n"
        else:
            text += f"{line}\r\n"
    marvel["TEXT"].append(text)
    marvel["METADATA"].append(
        json.dumps(
            {
                "title": names[txt],
            }
        )
    )
    marvel["SOURCE"].append("marvel/pdunton")

marvel = pd.DataFrame(marvel)
marvel

100%|██████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 42.64it/s]


Unnamed: 0,TEXT,METADATA,SOURCE
0,[Announcer] (first lines; announcement over sp...,"{""title"": ""Avengers: Age of Ultron""}",marvel/pdunton
1,(1989 – Hank Pym enters a SHIELD facility and ...,"{""title"": ""Ant-Man""}",marvel/pdunton
2,F O R Y O U R C O N S I D E R AT I O N\r\n\r...,"{""title"": ""Avengers: Endgame""}",marvel/pdunton
3,Marvel’s THE AVENGERS\r\n\r\nWritten By\r\n\r\...,"{""title"": ""The Avengers""}",marvel/pdunton
4,BLACK PANTHER \r\n\r\nAdapted \r\nScreenplay \...,"{""title"": ""Black Panther""}",marvel/pdunton
5,(first lines; in the Arctic)\r\n[Search Team L...,"{""title"": ""Captain America: The First Avenger""}",marvel/pdunton
6,(Marvel Studios Opening Sequence begins but in...,"{""title"": ""Captain Marvel""}",marvel/pdunton
7,"(1991, a HYDRA base in a snowy landscape. A ma...","{""title"": ""Captain America: Civil War""}",marvel/pdunton
8,GUARDIANS OF THE GALAXY VOL. 2\r\n\r\nWritten ...,"{""title"": ""Guardians of the Galaxy Vol. 2""}",marvel/pdunton
9,(Marvel Opening Credits)\r\n\r\n(Radio transmi...,"{""title"": ""Avengers: Infinity War""}",marvel/pdunton


In [15]:
print(marvel["TEXT"].values[0])

[Announcer] (first lines; announcement over speaker) Report to your stations immediately. This is not a drill. We are under attack. We are under attack.
(the Avengers are seen attacking an unknown base, and Iron Man bounces off of the base's force field)

[Tony Stark] Shit!
[Steve Rogers] Language! JARVIS, what's the view from upstairs?
[JARVIS] The central building is protected by some kind of energy shield. Strucker's technology is well beyond any other Hydra base we've taken.
[Thor] Loki's scepter must be here. Strucker couldn't mount this defense without it. At long last.
(Natasha knocks out some soldiers)
[Natasha Romanoff] At long last is lasting a little long, boys.
(As some soldiers shoot at him)
[Clint Barton] Yeah. I think we lost the element of surprise.
[Tony Stark] Wait a second. No one else is going to deal with the fact that Cap just said "language?"
[Steve Rogers] I know.
(Steve throws his bike at some soldiers driving up in their truck)
[Steve Rogers] It 

In [47]:
marvel.to_parquet("marvel.pq", row_group_size=100, engine="pyarrow", index=False)
marvel.head()  # https://www.kaggle.com/datasets/pdunton/marvel-cinematic-universe-dialogue

Unnamed: 0,TEXT,METADATA,SOURCE
0,[Announcer] (first lines; announcement over sp...,"{""title"": ""Avengers: Age of Ultron""}",marvel/pdunton
1,(1989 – Hank Pym enters a SHIELD facility and ...,"{""title"": ""Ant-Man""}",marvel/pdunton
2,F O R Y O U R C O N S I D E R AT I O N\r\n\r...,"{""title"": ""Avengers: Endgame""}",marvel/pdunton
3,Marvel’s THE AVENGERS\r\n\r\nWritten By\r\n\r\...,"{""title"": ""The Avengers""}",marvel/pdunton
4,BLACK PANTHER \r\n\r\nAdapted \r\nScreenplay \...,"{""title"": ""Black Panther""}",marvel/pdunton


In [48]:
len(marvel)

18

# Doctor Who

In [25]:
# doctor who via # https://www.kaggle.com/datasets/jeanmidev/doctor-who?select=all-scripts.csv
kaggle.api.dataset_download_files("jeanmidev/doctor-who", "drwho", unzip=True)

In [27]:
episodes, diffusion = {}, {}
for index, row in pd.read_csv("drwho/all-detailsepisodes.csv").iterrows():
    assert row["episodeid"] not in episodes, row["episodeid"]
    episodes[row["episodeid"]] = row["title"]
    diffusion[row["episodeid"]] = row["first_diffusion"]
doctors = {
    1: "First Doctor",
    2: "Second Doctor",
    3: "Third Doctor",
    4: "Fourth Doctor",
    5: "Fifth Doctor",
    6: "Sixth Doctor",
    7: "Seventh Doctor",
    8: "Eighth Doctor",
    9: "Ninth Doctor",
    10: "Tenth Doctor",
    11: "Eleventh Doctor",
    12: "Twelfth Doctor",
    13: "Thirteenth Doctor",
    14: "Fourteenth Doctor",
    15: "Fifteenth Doctor",
}

In [28]:
who = pd.read_csv("drwho/all-scripts.csv")
who["episode"] = who["episodeid"].apply(lambda x: int(x.split("-")[1]) if len(x.split("-")) > 1 else -1)
who["season"] = who["episodeid"].apply(lambda x: int(x.split("-")[0]) if len(x.split("-")) > 1 else -1)
who.head(20)

Unnamed: 0,idx,text,type,details,episodeid,doctorid,episode,season
0,0,Sylvest home,location,,21-7,6,7,21
1,1,Twin boys are playing a cross between chess an...,context,,21-7,6,7,21
2,2,Where's mother?,talk,REMUS,21-7,6,7,21
3,3,She's busy.,talk,SYLVEST,21-7,6,7,21
4,4,Does that mean she isn't talking to us?,talk,ROMULUS,21-7,6,7,21
5,5,"No, she's just busy.",talk,SYLVEST,21-7,6,7,21
6,6,We would like to see her.,talk,BOTH,21-7,6,7,21
7,7,She isn't here.,talk,SYLVEST,21-7,6,7,21
8,8,She's gone out without saying goodbye?,talk,REMUS,21-7,6,7,21
9,9,"Well, yes.",talk,SYLVEST,21-7,6,7,21


In [29]:
doctor = {"TEXT": [], "METADATA": [], "SOURCE": []}
for name, group in tqdm(who.groupby("episodeid")):
    metadata = {
        "show": "Doctor Who",
        "season": f"s{str(group['season'].values[0]).zfill(2)}" if group["season"].values[0] > 0 else "",
        "episode": f"e{str(group['episode'].values[0]).zfill(2)}" if group["episode"].values[0] > 0 else "",
        "title": episodes[group["episodeid"].values[0]],
    }
    text, talk = (
        f"Doctor Who ({diffusion[group['episodeid'].values[0]]}; {doctors[group['doctorid'].values[0]]}) - {metadata['title']}\r\n\r\n",
        False,
    )
    for index, row in group.iterrows():
        if row["type"] == "location":
            if talk:
                text += "\r\n---------------------------------------\r\n\r\n"
                talk = False
            text += f"({str(row['text']).strip()})\r\n"
        elif row["type"] == ("context", "unknown"):
            if talk:
                text += "\r\n"
                talk = False
            text += f"{str(row['text']).strip()}\r\n\r\n"
        elif pd.notna(row["details"]):
            text += f"[{row['details']}] {str(row['text']).strip()}\r\n"
            talk = True
    doctor["TEXT"].append(text)
    doctor["METADATA"].append(json.dumps(metadata))
    doctor["SOURCE"].append("drwho/jeanmidev")
doctor = pd.DataFrame(doctor)
doctor

100%|████████████████████████████████████████████████████████████████████████████████| 306/306 [00:41<00:00,  7.32it/s]


Unnamed: 0,TEXT,METADATA,SOURCE
0,"Doctor Who (23 Nov, 1963; First Doctor) - An U...","{""show"": ""Doctor Who"", ""season"": ""s01"", ""episo...",drwho/jeanmidev
1,"Doctor Who (21 Dec, 1963; First Doctor) - The ...","{""show"": ""Doctor Who"", ""season"": ""s01"", ""episo...",drwho/jeanmidev
2,"Doctor Who (8 Feb, 1964; First Doctor) - The E...","{""show"": ""Doctor Who"", ""season"": ""s01"", ""episo...",drwho/jeanmidev
3,"Doctor Who (22 Feb, 1964; First Doctor) - Marc...","{""show"": ""Doctor Who"", ""season"": ""s01"", ""episo...",drwho/jeanmidev
4,"Doctor Who (11 Apr, 1964; First Doctor) - The ...","{""show"": ""Doctor Who"", ""season"": ""s01"", ""episo...",drwho/jeanmidev
...,...,...,...
301,"Doctor Who (17 Nov, 2007; Tenth Doctor) - Time...","{""show"": ""Doctor Who"", ""season"": """", ""episode""...",drwho/jeanmidev
302,"Doctor Who (16 Nov, 2012; Eleventh Doctor) - T...","{""show"": ""Doctor Who"", ""season"": """", ""episode""...",drwho/jeanmidev
303,"Doctor Who (21 Nov, 2009; Tenth Doctor) - Drea...","{""show"": ""Doctor Who"", ""season"": """", ""episode""...",drwho/jeanmidev
304,"Doctor Who (12 May, 1996 (Canada); Eighth Doct...","{""show"": ""Doctor Who"", ""season"": """", ""episode""...",drwho/jeanmidev


In [30]:
print(doctor["TEXT"].values[0])

Doctor Who (23 Nov, 1963; First Doctor) - An Unearthly Child

(Coal Hill School corridor)
[GIRL] Night, Miss Wright.
[BARBARA] Wait in here, please, Susan. I won't be long.
[BOY] Goodnight, Miss Wright.

---------------------------------------

(Laboratory)
[IAN] Oh? Not gone yet?
[BARBARA] Obviously not.
[IAN] Right, ask a silly question.
[BARBARA] I'm sorry.
[IAN] That's all right. I'll forgive you this time.
[BARBARA] Oh, I had a terrible day. I don't know what to make of it.
[IAN] Oh, what's the trouble? Can I help?
[BARBARA] Oh, it's one of the girls, Susan Foreman.
[IAN] Susan Foreman? She your problem too?
[BARBARA] Yes.
[IAN] You don't know what to make of her?
[BARBARA] No.
[IAN] How old is she, Barbara?
[BARBARA] Fifteen.
[IAN] Fifteen. She lets her knowledge out a bit at a time so as not toembarrass me. That's what I feel about her. She knows more science thanI'll ever know. She's a genius. Is that what she's doing with history?
[BARBARA] Something l

In [29]:
doctor.to_parquet("drwho.pq", row_group_size=100, engine="pyarrow", index=False)
doctor.head()  # https://www.kaggle.com/datasets/jeanmidev/doctor-who?select=all-scripts.csv

Unnamed: 0,TEXT,METADATA,SOURCE
0,"Doctor Who (23 Nov, 1963; First Doctor) - An U...","{""show"": ""Doctor Who"", ""season"": ""s01"", ""episo...",drwho/jeanmidev
1,"Doctor Who (21 Dec, 1963; First Doctor) - The ...","{""show"": ""Doctor Who"", ""season"": ""s01"", ""episo...",drwho/jeanmidev
2,"Doctor Who (8 Feb, 1964; First Doctor) - The E...","{""show"": ""Doctor Who"", ""season"": ""s01"", ""episo...",drwho/jeanmidev
3,"Doctor Who (22 Feb, 1964; First Doctor) - Marc...","{""show"": ""Doctor Who"", ""season"": ""s01"", ""episo...",drwho/jeanmidev
4,"Doctor Who (11 Apr, 1964; First Doctor) - The ...","{""show"": ""Doctor Who"", ""season"": ""s01"", ""episo...",drwho/jeanmidev


In [30]:
len(doctor)

306

# Star Trek

In [5]:
# Star Trek via http://www.chakoteya.net/StarTrek/index.html and also https://github.com/GJBroughton/Star_Trek_Scripts/
r = requests.get("https://github.com/GJBroughton/Star_Trek_Scripts/raw/master/data/all_scripts_raw.json")
trek = r.json()
trek.keys()

dict_keys(['DS9', 'TOS', 'TAS', 'TNG', 'VOY', 'ENT'])

In [41]:
picard = {"TEXT": [], "METADATA": [], "SOURCE": []}
for series in tqdm(trek.keys()):
    for episode in trek[series].keys():
        script = trek[series][episode].replace("[", "(").replace("]", ")").strip()
        try:
            title = " ".join(re.findall(r"^.+?\-\s*(.+?)\r?\n", script)[0].splitlines()).strip()
        except IndexError:
            title = " ".join(script.split("Stardate:")[0].splitlines()).strip()
        metadata = {
            "show": "Star Trek",
            "season": series,
            "episode": f"e{episode.split()[1].zfill(2)}",
            "title": title,
        }
        text = ""
        for i, line in enumerate(script.splitlines()):
            if i == 0:
                text += re.sub(r"(?i)((?:trans?)scripts?)\s*", "", line.strip()) + "\r\n"
                continue
            if line == "<Back":
                break
            match = re.findall(r"(?i)\s*([\w\d\s\.]+(?:\([\w\d\s\.]+\))?)\s*\:\s*(.+?)$", line)
            if match:
                speaker, voice = match[0]
                if speaker not in ("Stardate", "Original Airdate"):
                    text += f"[{speaker}] {voice}\r\n"
                else:
                    text += f"{line.strip()}\r\n"
            else:
                text += f"{line.strip()}\r\n"

        text = text.strip().replace("&amp;", "&")
        text = "\r\n".join(text.splitlines())
        text = re.sub(r"(\r*\n)", "\n", text)
        text = re.sub(r"\n{2,}", "\n\n", text).strip()

        picard["TEXT"].append(text)
        picard["METADATA"].append(json.dumps(metadata))
        picard["SOURCE"].append("startrek/chakoteya")

picard = pd.DataFrame(picard)
picard

100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:22<00:00,  3.68s/it]


Unnamed: 0,TEXT,METADATA,SOURCE
0,The Deep Space Nine - Emissary\n\nEmissary\nSt...,"{""show"": ""Star Trek"", ""season"": ""DS9"", ""episod...",startrek/chakoteya
1,The Deep Space Nine - Past Prologue\n\nPast\nP...,"{""show"": ""Star Trek"", ""season"": ""DS9"", ""episod...",startrek/chakoteya
2,The Deep Space Nine - A Man Alone\n\nA\nMan Al...,"{""show"": ""Star Trek"", ""season"": ""DS9"", ""episod...",startrek/chakoteya
3,The Deep Space Nine - Babel\n\nBabel\nStardate...,"{""show"": ""Star Trek"", ""season"": ""DS9"", ""episod...",startrek/chakoteya
4,The Deep Space Nine - Captive Pursuit\n\nCapti...,"{""show"": ""Star Trek"", ""season"": ""DS9"", ""episod...",startrek/chakoteya
...,...,...,...
703,"The Enterprise - In A Mirror, Darkly - part 1\...","{""show"": ""Star Trek"", ""season"": ""ENT"", ""episod...",startrek/chakoteya
704,"The Enterprise - In A Mirror, Darkly - part 2\...","{""show"": ""Star Trek"", ""season"": ""ENT"", ""episod...",startrek/chakoteya
705,The Enterprise - Demons\n\nDemons\n[Mission Da...,"{""show"": ""Star Trek"", ""season"": ""ENT"", ""episod...",startrek/chakoteya
706,The Enterprise - Terra Prime\n\nTerra\nPrime\n...,"{""show"": ""Star Trek"", ""season"": ""ENT"", ""episod...",startrek/chakoteya


In [42]:
print(picard["TEXT"].values[400])

The Next Generation - Time's Arrow part two

Time's
Arrow, part 2
Stardate:
46001.3
Original Airdate: 21 Sep, 1992

Last
[the Next Generation. LAFORGE] They found Data's head a mile beneath San Francisco. Been down
there about five centuries.
[DATA] At some future date I will be transported back to nineteenth
century Earth, where I will die. It has occurred. It will occur.
[GUINAN] Do I know you, Mister? 
[DATA] Data. Yes. We were on a ship together. The Enterprise. 
[GUINAN] Is that a clipper ship? 
[DATA] It is a starship. 
[CLEMENS] Starship? 
[RIKER] My God. They're delivering more of them for the others to
ingest.
[GUINAN] Did my father send you here? Because if he did, you must go
back and tell him I'm not done listening to
[DATA] I was not sent by your father. Our ship encountered a species who
appears to be threatening nineteenth century Earth.
[RIKER] I'm not willing to accept that he's dead and just leave it at
that.
[PICARD] We cannot make Mister Data our priority. 
[RIKER] 

In [43]:
picard.to_parquet("picard.pq", row_group_size=100, engine="pyarrow", index=False)
picard.head()  # http://www.chakoteya.net/StarTrek/index.html and also https://github.com/GJBroughton/Star_Trek_Scripts/

Unnamed: 0,TEXT,METADATA,SOURCE
0,The Deep Space Nine - Emissary\n\nEmissary\nSt...,"{""show"": ""Star Trek"", ""season"": ""DS9"", ""episod...",startrek/chakoteya
1,The Deep Space Nine - Past Prologue\n\nPast\nP...,"{""show"": ""Star Trek"", ""season"": ""DS9"", ""episod...",startrek/chakoteya
2,The Deep Space Nine - A Man Alone\n\nA\nMan Al...,"{""show"": ""Star Trek"", ""season"": ""DS9"", ""episod...",startrek/chakoteya
3,The Deep Space Nine - Babel\n\nBabel\nStardate...,"{""show"": ""Star Trek"", ""season"": ""DS9"", ""episod...",startrek/chakoteya
4,The Deep Space Nine - Captive Pursuit\n\nCapti...,"{""show"": ""Star Trek"", ""season"": ""DS9"", ""episod...",startrek/chakoteya


In [44]:
len(picard)

708