# subtitle_auxiliary
We can define a few other subtitle functions to help us interpret films, which aren't related to the NLP-style analyses defined in other notebooks. These functions help with other interprations, such as determining the film's characters, or identifying scene boundaries.

In [1]:
from subtitle_dataframes_io import *
from collections import Counter
from datetime import datetime, date, timedelta
nlp = spacy.load('en')

In [2]:
subs = pysrt.open('../subtitles/plus_one.srt')
subtitle_df = generate_base_subtitle_df(subs)
subtitle_df = generate_subtitle_features(subtitle_df)
subtitle_df['cleaned_text'] = subtitle_df['concat_sep_text'].map(clean_line)
sentences = partition_sentences(remove_blanks(subtitle_df['cleaned_text'].tolist()), nlp)

## Character Identification
### Dialogue Mentions
By passing in a list of (cleaned) text consisting solely of dialogue, `spaCy` can count the number of times a name was spoken. This is a rough way of determining of the film's characters, the main characters are usually mentioned the most. We also define a blacklist of names we know aren't character names, but epithets like "Jesus" and "God".

In [3]:
doc = nlp(' '.join(sentences))

In [4]:
people_blacklist = ['Jesus', 'Jesus Christ', 'Whoo', 'God', 'Mm', 'Dude', 'Mm-hmm', 'Huh']

people = []

for ent in doc.ents:
    if ent.label_ == 'PERSON' and ent.text not in people_blacklist:
        people.append(ent.text)
count = Counter(people)
count.most_common(10)

[('Ben', 95),
 ('Alice', 33),
 ('Dad', 21),
 ('Gina', 11),
 ('Brett', 7),
 ('Nick', 6),
 ('Matt', 5),
 ('Jess Ramsey', 5),
 ('Amanda', 4),
 ('Ben King', 4)]

### Offscreen Speaker Names
In many subtitle formats, especially SDH (subtitles for the deaf and hard-of-hearing), any character speaking from offscreen has their name in the subtitles for clarity, like *Anna: Open the door!*. We can count up the number of times this happens; like before, the main characters bubble up to the top. Again, we define a small blacklist.

In [5]:
speaker_counts = subtitle_df.speaker.value_counts()
speakers = []
speaker_blacklist = ['MAN', 'WOMAN', 'BOY', 'GIRL', 'BOTH', 'ALL']

x = 0
while x < len(speaker_counts):
    if speaker_counts.index[x] not in speaker_blacklist:
        speakers.append((speaker_counts.index[x], speaker_counts[x]))
    x+=1
speakers[0:10]

[('ALICE', 51),
 ('BEN', 47),
 ('CHUCK', 14),
 ('ANGELA', 5),
 ('NICK', 3),
 ('MATT', 3),
 ('DAVIS', 3),
 ('DEEJAY', 2),
 ('BRETT', 2),
 ('ELLIE', 1)]

## Scene Boundary Detection
### Break in Dialogue
We can use the subtitles to assist in determining where scenes begin and end. There's usually some "breathing room" after a scene ends, and the next one starts. We can look at the cleaned, dialogue-only text, and detect any 10-second span where there isn't any spoken dialogue (or laughter).

In [6]:
x = 1
delay_threshold = timedelta(seconds=10)

print('sentence_df index,', 'subtitle_df index,', 'start_time,', 'delay length')

while x < len(subtitle_df):
    if subtitle_df.iloc[x].cleaned_text or subtitle_df.iloc[x].laugh == 1:
        y = 1
        
        while not subtitle_df.iloc[x - y].cleaned_text and subtitle_df.iloc[x - y].laugh == 0:
            y += 1
        delay = datetime.combine(date.today(), subtitle_df.iloc[x].start_time) - datetime.combine(date.today(), subtitle_df.iloc[x - y].end_time)
        
        if delay > delay_threshold:
            print(x, subtitle_df.iloc[x].srt_index, subtitle_df.iloc[x].start_time, delay)
    x += 1

sentence_df index, subtitle_df index, start_time, delay length
345 302 00:11:50.044000 0:00:33.826000
365 319 00:12:43.264000 0:00:36.245000
383 337 00:13:31.770000 0:00:12.429000
389 343 00:14:04.678000 0:00:24.983000
485 429 00:17:15.160000 0:00:11.970000
664 588 00:23:37.918000 0:00:13.097000
792 700 00:27:40.160000 0:00:23.524000
817 722 00:28:43.473000 0:00:19.895000
905 797 00:31:44.362000 0:00:22.398000
1035 901 00:35:39.555000 0:00:27.361000
1141 985 00:39:01.298000 0:00:20.604000
1183 1020 00:40:46.695000 0:00:41.500000
1351 1164 00:46:52.644000 0:00:27.611000
1398 1198 00:48:27.823000 0:00:32.408000
1427 1224 00:49:51.323000 0:00:12.096000
1438 1232 00:50:28.777000 0:00:20.813000
1695 1434 00:57:55.723000 0:00:12.137000
1719 1454 00:59:08.004000 0:00:57.267000
2131 1797 01:11:47.805000 0:00:10.428000
2163 1825 01:12:55.163000 0:00:11.136000
2254 1905 01:16:26.124000 0:00:14.055000
2376 2018 01:21:25.006000 0:00:22.397000
2414 2052 01:23:02.312000 0:00:15.266000
2494 2128 01:2