# subtitle_dataframes
We can populate two dataframes containing various information on subtitles. One will focus on the actual subtitles and features we can extract from them. The other will focus on actual sentences, for NLP analysis. This second dataframe is necessary because certain pieces of dialogue, like very long sentences, may span multiple subtitle objects.

We start with an example from *Before Sunrise* (1995), a romance consisting entirely of naturalistic dialogue between two characters falling in love.

In [1]:
import pysrt
import spacy
import pandas as pd
import datetime
from subtitle_cleaning_io import *
from phrases_io import *

pd.set_option('display.max_colwidth', None)

In [2]:
subs = pysrt.open('../subtitles/before_sunrise.srt')
subs.insert(0, subs[0])

# Raw DataFrame
As an example before we get to the main DataFrames, we'll create a raw DataFrame with basic subtitle information.

For each subtitle object, we extract the index number, the text, and the start and end times. Sometimes subtitle text has two separate lines to indicate two charaters speaking — we can detect this and break them out into their own column. We also detect subtitle text that spans both lines, but are part of the same sentence. These are separated by a newline character, which we remove.

In [3]:
indices = []
texts = []
start_times = []
end_times = []
top_lines = []
bottom_lines = []

for sub in subs[1:]:
    indices.append(sub.index)
    texts.append(sub.text)
    start_times.append(sub.start.to_time())
    end_times.append(sub.end.to_time())
    top_line, bottom_line = concat_sep_lines(sub.text)
    if bottom_line != 0:
        top_lines.append(top_line)
        bottom_lines.append(bottom_line)
    else:
        top_lines.append(top_line)
        bottom_lines.append('')

We zip all the lists together to get the raw DataFrame.

In [4]:
raw_df = pd.DataFrame(list(zip(indices, texts, start_times, end_times, top_lines, bottom_lines)), columns = ['index', 'text', 'start_time', 'end_time', 'top_line', 'bottom_line']) 
raw_df = raw_df.set_index('index')

In [5]:
raw_df[452:456]

Unnamed: 0_level_0,text,start_time,end_time,top_line,bottom_line
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
453,CELINE: That's the Danube over there.,00:31:01.319000,00:31:03.923000,CELINE: That's the Danube over there.,
454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,"JESSE: That's a river, right?",(chuckling) Yeah.
455,This is gorgeous.,00:31:14.832000,00:31:18.294000,This is gorgeous.,
456,"Yeah, this is beautiful.",00:31:19.045000,00:31:21.214000,"Yeah, this is beautiful.",


Subtitle 454 is an example of a single subtitle that contains dialogue from two characters. We'll want to treat these separately.

# Subtitle DataFrame
Next, we can create a DataFrame which has one row per subtitle line of dialogue. The above example of subtitle 454 will be broken into rows in the DataFrame.

In [6]:
indices = []
original_texts = []
start_times = []
end_times = []
concat_sep_texts = []
separated_flag = []

for sub in subs:
    indices.append(sub.index)
    original_texts.append(sub.text)
    start_times.append(sub.start.to_time())
    end_times.append(sub.end.to_time())
    top_line, bottom_line = concat_sep_lines(sub.text)
    if bottom_line != 0:
        concat_sep_texts.append(top_line)
        separated_flag.append(1)
        separated_flag.append(1)
        
        indices.append(sub.index)
        original_texts.append(sub.text)
        start_times.append(sub.start.to_time())
        end_times.append(sub.end.to_time())
        concat_sep_texts.append(bottom_line)
    else:
        concat_sep_texts.append(top_line)
        separated_flag.append(0)

In [7]:
subtitle_df = pd.DataFrame(list(zip(indices, original_texts, start_times, end_times, concat_sep_texts, separated_flag)), columns = ['subtitle_index', 'original_text', 'start_time', 'end_time', 'concat_sep_text', 'separated_flag']) 

In [8]:
subtitle_df[540:545]

Unnamed: 0,subtitle_index,original_text,start_time,end_time,concat_sep_text,separated_flag
540,453,CELINE: That's the Danube over there.,00:31:01.319000,00:31:03.923000,CELINE: That's the Danube over there.,0
541,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,"JESSE: That's a river, right?",1
542,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,(chuckling) Yeah.,1
543,455,This is gorgeous.,00:31:14.832000,00:31:18.294000,This is gorgeous.,0
544,456,"Yeah, this is beautiful.",00:31:19.045000,00:31:21.214000,"Yeah, this is beautiful.",0


Since we're breaking out two-character subtitles into separate rows, the DataFrame index no longer matches up with the original subtitle_index, so we'll keep track of it here.

## Feature Extraction
Next, we can go over each piece of text and extract features. We can look for laughter, music, and parentheticals (which indicate actions or non-dialogue sounds). In certain cases, we can even identify the speaker, which will help with dialogue attribution and character tracking.

In [9]:
subtitle_df['laugh'] = subtitle_df['concat_sep_text'].map(find_laugh)
subtitle_df['speaker'] = subtitle_df['concat_sep_text'].map(find_speaker)
subtitle_df['music'] = subtitle_df['concat_sep_text'].map(find_music)
subtitle_df['parenthetical'] = subtitle_df['concat_sep_text'].map(find_parenthetical)
subtitle_df['el_parenthetical'] = subtitle_df['concat_sep_text'].map(find_el_parenthetical) # entire-line paren
subtitle_df['el_italic'] = subtitle_df['concat_sep_text'].map(find_el_italic) # entire-line italic

In [10]:
subtitle_df[540:545]

Unnamed: 0,subtitle_index,original_text,start_time,end_time,concat_sep_text,separated_flag,laugh,speaker,music,parenthetical,el_parenthetical,el_italic
540,453,CELINE: That's the Danube over there.,00:31:01.319000,00:31:03.923000,CELINE: That's the Danube over there.,0,0,CELINE,0,,0,
541,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,"JESSE: That's a river, right?",1,0,JESSE,0,,0,
542,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,(chuckling) Yeah.,1,1,,0,chuckling,0,
543,455,This is gorgeous.,00:31:14.832000,00:31:18.294000,This is gorgeous.,0,0,,0,,0,
544,456,"Yeah, this is beautiful.",00:31:19.045000,00:31:21.214000,"Yeah, this is beautiful.",0,0,,0,,0,


## Dialogue Cleaning
We have a few functions that will clean our dialogue to prepare it for the next DataFrame, focused on NLP analysis. We only want dialogue — no speaker names, no actions or descriptions, no music lyrics. We can also remove mid-sentence interjections, things like "um", "uh", or "you know" that make the sentence more complex. We have a function that can clean the dialogue, so we use this to populate a column in the DataFrame.

In [11]:
subtitle_df['cleaned_text'] = subtitle_df['concat_sep_text'].map(clean_line)

In [12]:
subtitle_df[540:545]

Unnamed: 0,subtitle_index,original_text,start_time,end_time,concat_sep_text,separated_flag,laugh,speaker,music,parenthetical,el_parenthetical,el_italic,cleaned_text
540,453,CELINE: That's the Danube over there.,00:31:01.319000,00:31:03.923000,CELINE: That's the Danube over there.,0,0,CELINE,0,,0,,That's the Danube over there.
541,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,"JESSE: That's a river, right?",1,0,JESSE,0,,0,,"That's a river, right?"
542,454,"- JESSE: That's a river, right?\n- (chuckling) Yeah.",00:31:03.947000,00:31:07.158000,(chuckling) Yeah.,1,1,,0,chuckling,0,,Yeah.
543,455,This is gorgeous.,00:31:14.832000,00:31:18.294000,This is gorgeous.,0,0,,0,,0,,This is gorgeous.
544,456,"Yeah, this is beautiful.",00:31:19.045000,00:31:21.214000,"Yeah, this is beautiful.",0,0,,0,,0,,"Yeah, this is beautiful."
