# attribution_subtitles
This notebook explores the use of the `pysrt` library to analyze subtitle files. These .srt files are simply text files with specific formatting that allows them to be read by media players. The goal is to be able to assemble a scene's entire conversation and attribute individual lines to characters.

In [1]:
import pandas as pd
import pysrt
import pyAudioAnalysis.audioSegmentation
import datetime

# Loading Subtitles

.srt subtitle files have a very specific format. Each entry has an index number, a start and end time (both in HH:MM:SS,mmm format), and either one or two lines of dialogue. This formatting makes it very easy for the `pysrt` library to convert an entire .srt file into a list of subtitle objects, `SubRipItem`.

We'll look at an example from *Hobbs and Shaw*. Below is an example of formatting from an .srt file.

266

00:12:49,102 --> 00:12:51,103

No wonder we left

the family business.



In [2]:
subs = pysrt.open('../subtitles/hobbs_shaw.srt')

The .srt file index starts at 1, while the `subs` list starts at 0. As a quick way to align the two, we simply duplicate the first subtitle object (`SubRipItem`).

In [3]:
# subtitle files (.srt) are explicitly numbered, and start at 1
subs.insert(0, subs[0])

In [4]:
len(subs)

2704

## Reading Subtitle Objects
`SubRipItem` objects contain the `text` attribute, which is the one or two lines of dialogue. They also contain `start` and `end`, which can be broken down into hours, minutes, seconds, and milliseconds.

In [5]:
subs[1].text

'("Time in a Bottle"\nby Yungblud playing)'

In [6]:
print(subs[62].text)

I want her on the run
with no place to turn.


In [7]:
print(subs[62].start.hours)
print(subs[62].start.minutes)
print(subs[62].start.seconds)
print(subs[62].start.milliseconds)

0
4
32
522


## Time
One of the challenges is working with separate timestamp systems. The subtitle file (and the extracted subtitle list) starts from the beginning of the film, while (for this exploratory example) the audio file starts at a manually-designated start time. Later, we'll have the issue of using the frame number to identify a time.

The subtitles have three methods we can use for time:

`.slice()` returns all subtitles occurring in a range before/after start/end times

`.at()` returns a single subtitle occurring at a specific time

`.to_time()` returns a `datetime.time` object containing hours, minutes, seconds, and milliseconds, that can be used universally

In [8]:
part = subs.slice(starts_after={'minutes': 12, 'seconds': 47, 'milliseconds': 400}, ends_before={'minutes': 12, 'seconds': 49, 'milliseconds': 0})

In [9]:
part = subs.at(minutes=12, seconds=47, milliseconds=400)

In [10]:
part

[<pysrt.srtitem.SubRipItem object at 0x7f699bd3bdd0>]

In [11]:
part[0].index

264

In [12]:
time_obj = subs[271].start.to_time()

In [13]:
time_obj

datetime.time(0, 13, 0, 613000)

As a reminder, we'll eventually be working with frames, the visual images generated once per second. The frames' filenames will contain a frame number, which corresponds cleanly to the number of seconds elapsed. So `hobbs_shaw_frame48.jpg` shows what happens 48 seconds into the film.

The below function will be able to convert that number into a `datetime.time` object.

In [14]:
def frame_to_time(seconds): 
    seconds = seconds % (24 * 3600) 
    hours = seconds // 3600
    seconds %= 3600
    minutes = seconds // 60
    seconds %= 60
    
    timestamp = datetime.time(hours, minutes, seconds)
    
    return timestamp

# Subtitles Onscreen Flag
To assist with dialogue attribution, we can create a flag to identify if there are subtitle onscreen during a given frame. This will require converting a frame number (in its filename) to a subtitle timestamp. This gives us a HH:MM:SS time object, but remember that subtitles also work with milliseconds. For each time, we should check if there's a subtitle at HH:MM:SS,000 or at HH:MM:SS,999.

In [15]:
if subs.at(datetime.time(0, 12, 47, 0)) or subs.at(datetime.time(0, 12, 47, 999)):
    print('subtitle found')

subtitle found


In [16]:
first_frame = frame_to_time(766)

In [17]:
if subs.at(first_frame) or subs.at(first_frame.replace(microsecond=999000)):
    print('subtitle found')

subtitle found


There's a specific type of subtitles that we don't want to trigger the subtitle_onscreen flag: parentheticals. These are used to communicate scene audio, like in-scene sound effects, or non-dialogue sounds made by characters, like laughter. These are written in subtitles as parentheticals, and we can exclude these from our subtitle_onscreen check.

In [18]:
laugh_frame = frame_to_time(766)
print(subs.at(laugh_frame).text)

(laughs)


In [19]:
scene_music_frame = frame_to_time(824)
print(subs.at(scene_music_frame).text)

(music playing quietly
over speakers)


In [20]:
if subs.at(laugh_frame).text[0] == '(' and subs.at(laugh_frame).text[-1] == ')':
    print('Parenthetical subtitle: no spoken dialogue')

Parenthetical subtitle: no spoken dialogue


In [21]:
# inverse of above
if subs.at(laugh_frame).text[0] != '(' or subs.at(laugh_frame).text[-1] != ')':
    pass
else:
    print('Subtitle with spoken dialogue')

Subtitle with spoken dialogue


The below code block will iterate through each frame and see if there's a subtitle onscreen at the beginning or end of the frame's duration. If a subtitle is found, it checks if it's a parenthetical subtitle. It creates a size-58 list that we'll use later in the frame DataFrame for dialogue attribution. 

In [22]:
frame_choice = list(range(766, 824))

subtitle_onscreen = []
for frame in frame_choice:
    time = frame_to_time(frame)
    
    if subs.at(time) and (subs.at(time).text[0] != '(' or subs.at(time).text[-1] != ')'):
        subtitle_onscreen.append(1)
    elif subs.at(time.replace(microsecond=999000)) and (subs.at(time.replace(microsecond=999000)).text[0] != '(' or subs.at(time.replace(microsecond=999000)).text[-1] != ')'):
        subtitle_onscreen.append(1)
    else:
        subtitle_onscreen.append(0)

In [23]:
len(subtitle_onscreen)

58

In [24]:
subtitle_onscreen[0:5]

[0, 1, 1, 1, 1]