# subtitle_analysis
We can extract a film's English-language subtitle track to get the ground-truth dialogue. We can glean clues about a scene's location, characters, and context. While we're not yet ready to use subtitles to analyze the film's entire plot, we can start small and see what localized information we can learn.

We'll be using the `pysrt` library to parse .srt subtitle files and the `spaCy` library for NLP analysis.

In [1]:
import pysrt
import spacy

In [2]:
subs = pysrt.open('../subtitles/booksmart.srt')

In [3]:
len(subs)

2373

Since each two-line dialogue in a subtitle file is explicitly numbered, starting at 1, there's an off-by-one discrepency with the list object (starting at 0). We can offset the list by just duplicating the first subtitle item.

In [4]:
# subtitle files (.srt) are explicitly numbered, and start at 1
subs.insert(0, subs[0])

Each SubRipItem contains the subtitle text as well as the start and end time.

In [10]:
print(subs[4].text)
print(subs[4].start)
print(subs[4].end)

Take a deep breath.
00:00:09,177
00:00:11,011


# spaCy
### Initial Analysis
We can use the `spaCy` library for natural-language processing (NLP). We'll eventually look at all of the lines as a whole, but for now, we'll see what we can do for a single line of dialogue.

In [12]:
nlp = spacy.load('en')

In [16]:
subs[1883].text

'No one in this entire school\nknows me at all.'

In [13]:
line = subs[1883].text

In [14]:
doc = nlp(line)

We can separate the line into individual tokens. We can also remove all the stop words or see parts of speech. Note that the line break causes some weird results, but we'll deal with this later.

In [15]:
tokens = [token for token in doc]
tokens

[No, one, in, this, entire, school, , knows, me, at, all, .]

In [19]:
non_stop = []
for word in doc:
    if word.is_stop == False:
        non_stop.append(word)

non_stop

[entire, school, , knows, .]

In [22]:
for token in doc:
    print(token.text, token.pos_)

No DET
one NOUN
in ADP
this DET
entire ADJ
school NOUN

 SPACE
knows VERB
me PRON
at ADV
all ADV
. PUNCT
