# subtitle_dataframes
We can populate two dataframes containing various information on subtitles. One will focus on the actual subtitles and features we can extract from them. The other will focus on actual sentences, for NLP analysis. This second dataframe is necessary because certain pieces of dialogue, like very long sentences, may span multiple subtitle objects.

We start with an example from *Before Sunrise* (1995), a romance consisting entirely of naturalistic dialogue between two characters falling in love.

In [1]:
import pysrt
import spacy
import pandas as pd
import datetime
from subtitle_cleaning_io import *
from subtitle_dataframes_io import *
from subtitle_auxiliary_io import *
from phrases_io import *
import spacy

pd.set_option('display.max_colwidth', None)
nlp = spacy.load('en')

In [2]:
subs = pysrt.open('../subtitles/before_sunrise.srt')
subs.insert(0, subs[0])

# Raw DataFrame
As an example before we get to the main DataFrames, we'll create a raw DataFrame with basic subtitle information.

For each subtitle object, we extract the index number, the text, and the start and end times. Sometimes subtitle text has two separate lines to indicate two charaters speaking — we can detect this and break them out into their own column. We also detect subtitle text that spans both lines, but are part of the same sentence. These are separated by a newline character, which we remove.

In [3]:
indices = []
texts = []
start_times = []
end_times = []
top_lines = []
bottom_lines = []

for sub in subs[1:]:
    indices.append(sub.index)
    texts.append(sub.text)
    start_times.append(sub.start.to_time())
    end_times.append(sub.end.to_time())
    top_line, bottom_line = concat_sep_lines(sub.text)
    if bottom_line != 0:
        top_lines.append(top_line)
        bottom_lines.append(bottom_line)
    else:
        top_lines.append(top_line)
        bottom_lines.append('')

We zip all the lists together to get the raw DataFrame.

In [4]:
raw_df = pd.DataFrame(list(zip(indices, texts, start_times, end_times, top_lines, bottom_lines)), columns = ['index', 'text', 'start_time', 'end_time', 'top_line', 'bottom_line']) 
raw_df = raw_df.set_index('index')

In [5]:
raw_df[452:456]

Unnamed: 0_level_0,text,start_time,end_time,top_line,bottom_line
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
453,CELINE: That's the Danube over there.,00:31:01.319000,00:31:03.923000,CELINE: That's the Danube over there.,
454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,"JESSE: That's a river, right?",(chuckling) Yeah.
455,This is gorgeous.,00:31:14.832000,00:31:18.294000,This is gorgeous.,
456,"Yeah, this is beautiful.",00:31:19.045000,00:31:21.214000,"Yeah, this is beautiful.",


Subtitle 454 is an example of a single subtitle that contains dialogue from two characters. We'll want to treat these separately.

# Subtitle DataFrame
Next, we can create a DataFrame which has one row per subtitle line of dialogue. The above example of subtitle 454 will be broken into rows in the DataFrame.

In [6]:
indices = []
original_texts = []
start_times = []
end_times = []
concat_sep_texts = []
separated_flag = []

for sub in subs:
    indices.append(sub.index)
    original_texts.append(sub.text)
    start_times.append(sub.start.to_time())
    end_times.append(sub.end.to_time())
    top_line, bottom_line = concat_sep_lines(sub.text)
    if bottom_line != 0:
        concat_sep_texts.append(top_line)
        separated_flag.append(1)
        separated_flag.append(1)
        
        indices.append(sub.index)
        original_texts.append(sub.text)
        start_times.append(sub.start.to_time())
        end_times.append(sub.end.to_time())
        concat_sep_texts.append(bottom_line)
    else:
        concat_sep_texts.append(top_line)
        separated_flag.append(0)

In [7]:
subtitle_df = pd.DataFrame(list(zip(indices, original_texts, start_times, end_times, concat_sep_texts, separated_flag)), columns = ['subtitle_index', 'original_text', 'start_time', 'end_time', 'concat_sep_text', 'separated_flag']) 

In [8]:
subtitle_df[540:545]

Unnamed: 0,subtitle_index,original_text,start_time,end_time,concat_sep_text,separated_flag
540,453,CELINE: That's the Danube over there.,00:31:01.319000,00:31:03.923000,CELINE: That's the Danube over there.,0
541,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,"JESSE: That's a river, right?",1
542,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,(chuckling) Yeah.,1
543,455,This is gorgeous.,00:31:14.832000,00:31:18.294000,This is gorgeous.,0
544,456,"Yeah, this is beautiful.",00:31:19.045000,00:31:21.214000,"Yeah, this is beautiful.",0


Since we're breaking out two-character subtitles into separate rows, the DataFrame index no longer matches up with the original subtitle_index, so we'll keep track of it here.

## Feature Extraction
Next, we can go over each piece of text and extract features. We can look for laughter, music, and parentheticals (which indicate actions or non-dialogue sounds). In certain cases, we can even identify the speaker, which will help with dialogue attribution and character tracking.

In [9]:
subtitle_df['laugh'] = subtitle_df['concat_sep_text'].map(find_laugh)
subtitle_df['speaker'] = subtitle_df['concat_sep_text'].map(find_speaker)
subtitle_df['music'] = subtitle_df['concat_sep_text'].map(find_music)
subtitle_df['parenthetical'] = subtitle_df['concat_sep_text'].map(find_parenthetical)
subtitle_df['el_parenthetical'] = subtitle_df['concat_sep_text'].map(find_el_parenthetical) # entire-line paren
subtitle_df['el_italic'] = subtitle_df['concat_sep_text'].map(find_el_italic) # entire-line italic

In [10]:
subtitle_df[540:545]

Unnamed: 0,subtitle_index,original_text,start_time,end_time,concat_sep_text,separated_flag,laugh,speaker,music,parenthetical,el_parenthetical,el_italic
540,453,CELINE: That's the Danube over there.,00:31:01.319000,00:31:03.923000,CELINE: That's the Danube over there.,0,0,CELINE,0,,0,0
541,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,"JESSE: That's a river, right?",1,0,JESSE,0,,0,0
542,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,(chuckling) Yeah.,1,1,,0,chuckling,0,0
543,455,This is gorgeous.,00:31:14.832000,00:31:18.294000,This is gorgeous.,0,0,,0,,0,0
544,456,"Yeah, this is beautiful.",00:31:19.045000,00:31:21.214000,"Yeah, this is beautiful.",0,0,,0,,0,0


## Dialogue Cleaning
We have a few functions that will clean our dialogue to prepare it for the next DataFrame, focused on NLP analysis. We only want dialogue — no speaker names, no actions or descriptions, no music lyrics. We can also remove mid-sentence interjections, things like "um", "uh", or "you know" that make the sentence more complex. We have a function that can clean the dialogue, so we use this to populate a column in the DataFrame.

In [11]:
subtitle_df['cleaned_text'] = subtitle_df['concat_sep_text'].map(clean_line)

In [12]:
subtitle_df[540:545]

Unnamed: 0,subtitle_index,original_text,start_time,end_time,concat_sep_text,separated_flag,laugh,speaker,music,parenthetical,el_parenthetical,el_italic,cleaned_text
540,453,CELINE: That's the Danube over there.,00:31:01.319000,00:31:03.923000,CELINE: That's the Danube over there.,0,0,CELINE,0,,0,0,That's the Danube over there.
541,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,"JESSE: That's a river, right?",1,0,JESSE,0,,0,0,"That's a river, right?"
542,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,(chuckling) Yeah.,1,1,,0,chuckling,0,0,Yeah.
543,455,This is gorgeous.,00:31:14.832000,00:31:18.294000,This is gorgeous.,0,0,,0,,0,0,This is gorgeous.
544,456,"Yeah, this is beautiful.",00:31:19.045000,00:31:21.214000,"Yeah, this is beautiful.",0,0,,0,,0,0,"Yeah, this is beautiful."


# Sentence DataFrame
Here, we populate the dialogue-only, one-row-per-sentence DataFrame. We've cleaned the text so that only dialogue remains, and we'll use `spaCy` for sentence boundary detection. We'll populate one row per individual sentence — useful because sometimes long pieces of dialogue will span multiple subtitles.

In [13]:
nlp = spacy.load('en')

In [14]:
sentences = partition_sentences(remove_blanks(subtitle_df['cleaned_text'].tolist()), nlp)
sentence_df = pd.DataFrame(sentences, columns=['sentence'])

The DataFrame has been populated; below is an example of a long sentence that was combined from multiple subtitles.

In [15]:
sentence_df[575:580]

Unnamed: 0,sentence
575,Then my father went on to become this successful architect and we began to travel all around the world while he built bridges and towers and stuff.
576,"I mean, I really can't complain about anything."
577,"You know, they love me more than anything in the world."
578,I've been raised with all the freedom they had fought for.
579,"And yet, for me now, it's another type of fight."


Let's continue exploring the `sentence_df` dataframe with *Plus One* (2019). We have a few functions to create these dataframes.

In [16]:
subs = pysrt.open('../subtitles/plus_one.srt')
subs.insert(0, subs[0])
subtitle_df = generate_base_subtitle_df(subs)
subtitle_df = generate_subtitle_features(subtitle_df)
subtitle_df['cleaned_text'] = subtitle_df['concat_sep_text'].map(clean_line)
sentences = partition_sentences(remove_blanks(subtitle_df['cleaned_text'].tolist()), nlp)
sentence_df = pd.DataFrame(sentences, columns=['sentence'])
sentence_df[60:62]

Unnamed: 0,sentence
60,I'm happy for Matt.
61,Clearly.


### Introductions
As a romantic comedy featuring tons of weddings, *Plus One* has plenty of instances where characters introduce themselves.

We've previously defined some functions to identify when characters identify themselves ("My name is Mike.") or identify others ("This is Mike."). These are NLP-based, so they're applied here, in the sentence DataFrame.

In [17]:
sentence_df['self_intro'] = sentence_df['sentence'].apply(self_intro, args=[nlp])
sentence_df['other_intro'] = sentence_df['sentence'].apply(other_intro, args=[nlp])

In [18]:
sentence_df[sentence_df.self_intro.notnull()][6:10]

Unnamed: 0,sentence,self_intro,other_intro
2324,"Uh, I'm Ben.",Ben,
2327,I'm Alice.,Alice,
2331,I'm Jackie.,Jackie,
2961,"I'm Ben, and most of you here probably know me as Chuck's son.",Ben,


In [19]:
sentence_df[sentence_df.other_intro.notnull()][0:3]

Unnamed: 0,sentence,self_intro,other_intro
750,This is Maggie.,,Maggie
828,"Ben, this is Maggie.",,Maggie
1223,This is Alice.,,Alice


### Direct Address
We also have a function to identify when someone is being addressed directly by name. ("Hello, Mike." or "Mike, are you coming?") We may be able to use this as a hint that this person might be speaking the next line, and we're also almost sure that Mike is present in this scene.

In [20]:
sentence_df['direct_address'] = sentence_df['sentence'].apply(direct_address, args=[nlp])

In [21]:
sentence_df[sentence_df.direct_address.notnull()][135:139]

Unnamed: 0,sentence,self_intro,other_intro,direct_address
2987,"Gina, Dad, congratulations.",,,Dad
3008,"You crawled, Benjamin.",,,Benjamin
3014,"Alice, are you actually gonna help me with this speech?",,,Alice
3016,"Okay, Ben.",,,Ben


# Additional Speaker Identification
Subtitle files have various formats for labeling offscreen speakers. Our original subtitle_df dataframe identifies offscreen speakers in this form: `ADAM: No way` but not this form: `[Adam] No way`. So the `speaker` column in subtitle_df is only populated correctly when it's labled in the first format. 

We have some difficulty identifying the second format, because other, non-character name parentheticals may also be labeled in brackets, such as `[chuckling] No way`. Luckily, we've defined a function in `subtitle_auxiliary_io`  called `character_subtitle_mentions()` that can generate a list of characters by looking at every sentence in the film and identifying character names.

The below function uses the above logic to replace the `speaker` column in subtitle_df, which would have been populated with offscreen speakers in the first format. (Subtitle files will never use both formats, but just as a precaution, the below function will check the `speaker` column and won't "overwrite" values if they're already populated.)

In [22]:
def add_paren_offscreen_speaker(subtitle_df, sentence_df, nlp):
    speaker_amended = []
    sentences = sentence_df.sentence.tolist()
    mentioned_characters = character_subtitle_mentions(sentences, nlp)
    generic_characters = ['man', 'woman', 'boy', 'girl', 'both', 'all']
    
    paren_list = subtitle_df.parenthetical.tolist()
    speaker_list = subtitle_df.speaker.tolist()
    
    for paren, speaker in zip(paren_list, speaker_list):
        if speaker:
            speaker_amended.append(speaker)
        elif paren:
            if paren in mentioned_characters or paren in generic_characters:
                speaker_amended.append(paren)
            else:
                speaker_amended.append(None)
        else:
            speaker_amended.append(None)
    
    subtitle_df['speaker'] = speaker_amended
    
    return subtitle_df

In [23]:
subs = pysrt.open('../subtitles/lady_bird_2017.srt')
subtitle_df = generate_base_subtitle_df(subs)
subtitle_df = generate_subtitle_features(subtitle_df)
subtitle_df['cleaned_text'] = subtitle_df['concat_sep_text'].map(clean_line)
sentences = partition_sentences(remove_blanks(subtitle_df['cleaned_text'].tolist()), nlp)
subtitle_indices = tie_sentence_subtitle_indices(sentences, subtitle_df)
sentence_df = pd.DataFrame(list(zip(sentences, subtitle_indices)), columns=['sentence', 'subtitle_indices'])

In [24]:
character_subtitle_mentions(sentences, nlp)

['Lady Bird',
 'Shelly',
 'Miguel',
 'Danny',
 'Julie',
 'Davis',
 'Christine',
 'Dad',
 'Kyle',
 'Larry']

In [25]:
new_subtitle_df = add_paren_offscreen_speaker(subtitle_df, sentence_df, nlp)

In [26]:
new_subtitle_df[new_subtitle_df.parenthetical.notnull()][56:61]

Unnamed: 0,srt_index,original_text,start_time,end_time,concat_sep_text,separated_flag,laugh,hesitation,speaker,music,parenthetical,el_parenthetical,el_italic,cleaned_text
279,227,[Lady Bird]\nUgh! That car should be illegal.,00:11:13.590000,00:11:16.260000,[Lady Bird] Ugh! That car should be illegal.,0,0,0,Lady Bird,0,Lady Bird,0,0,Ugh! That car should be illegal.
283,231,[Lady Bird] She is so pretty.,00:11:23.684000,00:11:25.686000,[Lady Bird] She is so pretty.,0,0,0,Lady Bird,0,Lady Bird,0,0,She is so pretty.
284,232,[Julie] Her skin is luminous.,00:11:25.769000,00:11:28.188000,[Julie] Her skin is luminous.,0,0,0,Julie,0,Julie,0,0,Her skin is luminous.
287,235,- [laughter]\n- [Lady Bird]...in the tub...,00:11:33.610000,00:11:36.363000,[laughter],1,1,0,,0,laughter,1,0,
288,235,- [laughter]\n- [Lady Bird]...in the tub...,00:11:33.610000,00:11:36.363000,[Lady Bird]...in the tub...,1,0,0,Lady Bird,0,Lady Bird,0,0,...in the tub...
