# subtitle_nlp
With the subtitles' lines of dialogue cleaned, we can begin our language analyses using the `spaCy` NLP library. We've previously defined a few functions to load the subtitles file, clean the subtitle text, and populate them into a single list. We'll look at the subtitle file for *Booksmart* (2019).

In [1]:
import pysrt
import spacy
from collections import Counter
from subtitle_cleaning_io import *

In [2]:
subs = pysrt.open('../subtitles/booksmart.srt')
subs.insert(0, subs[0])
cleaned_lines = clean_subs(subs)
input_lines = remove_blanks(cleaned_lines)

In [3]:
len(input_lines)

2270

With a cleaned list of subtitle lines, we can load it into `spaCy`. First we define a spacy English model object, and then pass in a single long string of the subtitle text.

In [4]:
nlp = spacy.load('en')
doc = nlp(' '.join(input_lines))

## Named Entity Recognition
`spaCy` allows for easy NER, or identification of proper nouns. They'll be grouped into categories like "PERSON" or "ORG". We'll look at these two.
### Characters
The audience learns who the names of characters by listening to the dialogue (except for the cases where character names are displayed onscreen, most often in documentaries or docu-dramatizations). By counting the most common "PERSON" entites, we may be able to get an idea of who the characters are.

In [5]:
people = []

for ent in doc.ents:
    if ent.label_ == 'PERSON':
        people.append(ent.text)

In [6]:
count = Counter(people)

In [7]:
count.most_common(15)

[('Amy', 49),
 ('Molly', 20),
 ('Nick', 17),
 ('Ryan', 13),
 ('Dude', 9),
 ('Gigi', 9),
 ('Fine', 8),
 ('Mm-hmm', 8),
 ('Malala', 8),
 ('Jesus Christ', 7),
 ('Whoo', 6),
 ('Alan', 6),
 ('Jesus', 6),
 ('Principal Brown', 5),
 ('Gege', 5)]

This worked out pretty well, Amy and Molly are the main characters, and many of the secondary names appear in this list. Epithets and slang like "Dude" and "Jesus Christ" can be hard-coded in an exception list.

### Organization
This entity category has a wider reach than its name implies. It will not only catch organizations, companies, and institutions, but also do its best to identify made-up groups or organizations through context. Below, it has identified two colleges, as well as "Nick's". The entire film revolves around Amy and Molly trying to get to a party at Nick's house.

In [8]:
entities = []

for ent in doc.ents:
    if ent.label_ == 'ORG':
        entities.append(ent.text)
count = Counter(entities)
count.most_common(5)

[("Nick's", 19), ('Yale', 7), ('Nick', 3), ('WOMAN', 2), ('Columbia', 2)]

## Sentence Boundary Detection
A short sentence may only need one subtitle, but longer sentences may span two or more (in series, one at a time).

`516
00:18:17,722 --> 00:18:20,683
I'm going to experience
a seminal, fun anecdote,`

`517
00:18:21,518 --> 00:18:23,394
and we are gonna change our stories.`

Even after cleaning them with `clean_subs()`, this text is still in two separate objects. We can't recognize that these are supposed to be in a single sentence, but `spaCy` can. We can use the Sentence Boundary Detection functionality to separate these into sentences.

In [9]:
sentences = []
for sent in doc.sents:
    sentences.append(sent.text)
len(sentences)

2534

In [10]:
for sent in sentences[58:63]:
    print(sent)

You can just say Yale, please.
Well, our class's official policy is to not discuss where anyone is attending next year.
We don't want them to feel insecure.
Very thoughtful.
Anyway, I need to go over the end-of-the-year budget numbers we have.


The longer sentences originally spanned two separate subtitles, but were combined into a single sentence. We can more accurately analyze text when they're probably divided into sentences.