# subtitle_auxiliary
We can define a few other subtitle functions to help us interpret films, which aren't related to the NLP-style analyses defined in other notebooks. These functions help with other interprations, such as determining the film's characters, or identifying scene boundaries.

In [1]:
from subtitle_dataframes_io import *
from collections import Counter
nlp = spacy.load('en')

In [2]:
subs = pysrt.open('../subtitles/plus_one.srt')
subtitle_df = generate_base_subtitle_df(subs)
subtitle_df = generate_subtitle_features(subtitle_df)
subtitle_df['cleaned_text'] = subtitle_df['concat_sep_text'].map(clean_line)
sentences = partition_sentences(remove_blanks(subtitle_df['cleaned_text'].tolist()), nlp)

## Character Identification
### Dialogue Mentions
By passing in a list of (cleaned) text consisting solely of dialogue, `spaCy` can count the number of times a name was spoken. This is a rough way of determining of the film's characters, the main characters are usually mentioned the most. We also define a blacklist of names we know aren't character names, but epithets like "Jesus" and "God".

In [3]:
doc = nlp(' '.join(sentences))

In [4]:
people_blacklist = ['Jesus', 'Jesus Christ', 'Whoo', 'God', 'Mm', 'Dude', 'Mm-hmm', 'Huh']

people = []

for ent in doc.ents:
    if ent.label_ == 'PERSON' and ent.text not in people_blacklist:
        people.append(ent.text)
count = Counter(people)
count.most_common(10)

[('Ben', 95),
 ('Alice', 33),
 ('Dad', 21),
 ('Gina', 11),
 ('Brett', 7),
 ('Nick', 6),
 ('Matt', 5),
 ('Jess Ramsey', 5),
 ('Amanda', 4),
 ('Ben King', 4)]

### Offscreen Speaker Names
In many subtitle formats, especially SDH (subtitles for the deaf and hard-of-hearing), any character speaking from offscreen has their name in the subtitles for clarity, like *Anna: Open the door!*. We can count up the number of times this happens; like before, the main characters bubble up to the top. Again, we define a small blacklist.

In [5]:
speaker_counts = subtitle_df.speaker.value_counts()
speakers = []
speaker_blacklist = ['MAN', 'WOMAN', 'BOY', 'GIRL', 'BOTH', 'ALL']

x = 0
while x < len(speaker_counts):
    if speaker_counts.index[x] not in speaker_blacklist:
        speakers.append((speaker_counts.index[x], speaker_counts[x]))
    x+=1
speakers[0:10]

[('ALICE', 51),
 ('BEN', 47),
 ('CHUCK', 14),
 ('ANGELA', 5),
 ('DAVIS', 3),
 ('NICK', 3),
 ('MATT', 3),
 ('DEEJAY', 2),
 ('BRETT', 2),
 ('RECEPTIONIST', 1)]