# subtitle_analysis
We can extract a film's English-language subtitle track to get the ground-truth dialogue. We can glean clues about a scene's location, characters, and context. While we're not yet ready to use subtitles to analyze the film's entire plot, we can start small and see what localized information we can learn.

We'll be using the `pysrt` library to parse .srt subtitle files and the `spaCy` library for NLP analysis.

In [1]:
import pysrt
import spacy
from collections import Counter

In [2]:
subs = pysrt.open('../subtitles/booksmart.srt')

In [3]:
len(subs)

2373

Since each two-line dialogue in a subtitle file is explicitly numbered, starting at 1, there's an off-by-one discrepency with the list object (starting at 0). We can offset the list by just duplicating the first subtitle item.

In [4]:
# subtitle files (.srt) are explicitly numbered, and start at 1
subs.insert(0, subs[0])

Each SubRipItem contains the subtitle text as well as the start and end time.

In [5]:
print(subs[4].text)
print(subs[4].start)
print(subs[4].end)

Take a deep breath.
00:00:09,177
00:00:11,011


# spaCy
### Initial Analysis
We can use the `spaCy` library for natural-language processing (NLP). We'll eventually look at all of the lines as a whole, but for now, we'll see what we can do for a single line of dialogue.

In [6]:
nlp = spacy.load('en')

In [7]:
subs[1883].text

'No one in this entire school\nknows me at all.'

In [8]:
line = subs[1883].text

In [9]:
doc = nlp(line)

We can separate the line into individual tokens. We can also remove all the stop words or see parts of speech. Note that the line break causes some weird results, but we'll deal with this later.

In [10]:
tokens = [token for token in doc]
tokens

[No, one, in, this, entire, school, , knows, me, at, all, .]

In [11]:
non_stop = []
for word in doc:
    if word.is_stop == False:
        non_stop.append(word)

non_stop

[entire, school, , knows, .]

In [12]:
for token in doc:
    print(token.text, token.pos_)

No DET
one NOUN
in ADP
this DET
entire ADJ
school NOUN

 SPACE
knows VERB
me PRON
at ADV
all ADV
. PUNCT


## Named Entity Recognition
### Character Names 
It's easy to conduct Named Entity Recogntion (NER) with spaCy. The most prominent use of NER would be to discover character names. In film, the audience usually learns character names when their names are spoken aloud. We can get a count of all spoken names and see if it lines up with the actual character names.

In [13]:
all_dialogue = []
for sub_object in subs:
    all_dialogue.append(sub_object.text)

In [14]:
nlp = spacy.load('en')
doc = nlp('\n'.join(all_dialogue))

In [15]:
people = []

for ent in doc.ents:
    if ent.label_ == 'PERSON':
        people.append(ent.text)

In [16]:
count = Counter(people)

In [17]:
count.most_common(10)

[('Amy', 44),
 ('AMY', 24),
 ('Molly', 16),
 ('Nick', 12),
 ('Ryan', 12),
 ('Malala', 8),
 ('Fine', 7),
 ('Jesus Christ', 6),
 ('Alan', 6),
 ('Jesus', 6)]

Six of the ten most common names are actual character names. But we can improve on this: we know "Jesus" is usually used as an exclamation, and the all-caps AMY is most likely a subtitle indication that the character of Amy is speaking lines (as opposed to someone saying "Amy").

# Subtitle Cleanup
Subtitle files are already formatted very neatly. It shouldn't be too hard to clean and shape this data into a format we can use.
### Individual Line Parsing
Subtitle text spans either one or two lines. Text that span two lines may contain dialogue from either one character, or two separate characters.

This is a one-liner, which is just one character speaking.

`29
00:01:19,747 --> 00:01:21,081
I missed you.`

Here, a single character spoke enough dialogue to span two lines.

`69
00:02:43,331 --> 00:02:45,248
I mean, he's you know,
he's the vice president.`

And this is a two-liner that has two characters speaking. (Molly's name is printed because she's speaking from offscreen.) It starts with a dash on each line.

`30
00:01:21,165 --> 00:01:22,832
-I missed you so much.
-MOLLY: Been one night.`

In [18]:
subs[29].text # one-liner

'I missed you.'

In [19]:
subs[69].text # two-liner from one character

"I mean, he's you know,\nhe's the vice president."

In [20]:
subs[30].text # two-liner spoken by two characters

'-I missed you so much.\n-MOLLY: Been one night.'

For best results during NLP processing, we'll want to separate the two line, two character text into two separate lines. We'll also want to combine the two line, one character text into a single line. The key to this is searching for the newline escape sequence.

If there's no newline escape, then it's a one-liner.  If it has a newline sequence and both the top and bottom lines start with a dash, it's a two line, two character text and should be broken into two separate lines (and discarding both dashes). And if it has a newline sequence but without the dashes, it's a single character speaking across two lines, and we'll concatenate the two.

In [21]:
def clean_line(text):
    newline = text.find('\n')
    if newline == -1:                     # one-liner
        return text, 0
    elif text[0] == '-' and text[newline + 1] == '-': # two-liner spoken by two characters
        top_line = text[1:newline]
        bottom_line = text[newline + 2:]
        return top_line, bottom_line
    else:                                        # two-liner from one character
        concat_line = text[:newline] + ' ' + text[newline + 1:]
        return concat_line, 0

In [22]:
clean_line(subs[29].text) # one-liner

('I missed you.', 0)

In [23]:
clean_line(subs[30].text) # two-liner spoken by two characters

('I missed you so much.', 'MOLLY: Been one night.')

In [24]:
clean_line(subs[69].text) # two-liner from one character

("I mean, he's you know, he's the vice president.", 0)