# Episode 12x01 data preparation

This is the first version of the software, which was created to align the files that were provided at the beginning of the project: 

 
1.   An XLSX file containing the data relative to episode 12x01.
2.   An SRT file with the subtitles of episode 12x01.

Here, the alignment was only performed at episode level. The aim was to identify a method that could be used to match subtitles with the corresponding segments. 

# Libraries

The main libraries used in this notebook are pandas and re. Pandas allows to easily import and export Excel files, and provides additional tools for working with tables. Google Colab also includes an extension that makes pandas DataFrames interactive, meaning that the resulting tables can be previewed at any moment. Lastly, the regex library was used to parse the subtitles based on the specificities of the SRT format.

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
pd.options.mode.chained_assignment = None 

# Importing the XLSX file

In [3]:
# Opening .xlsx file 

excel_path = input('Enter .xlsx file path: ')
df_excel = pd.read_excel(excel_path)
df_excel.rename(columns={'Start,time':'Start', 'End,time':'End'}, inplace=True)

Enter .xlsx file path: /content/GAS12E01.xlsx


In [4]:
# Structure of the .xlsx file

df_excel

Unnamed: 0,Series,Season,Code,Start,End,Time,PP,SP,MP
0,GA,GAS12,GAS12E01,00:00:00,00:00:07,00:00:07,,,
1,GA,GAS12,GAS12E01,00:00:07,00:00:28,00:00:21,0.0,6.0,0.0
2,GA,GAS12,GAS12E01,00:00:28,00:00:30,00:00:02,,,
3,GA,GAS12,GAS12E01,00:00:30,00:00:56,00:00:26,0.0,6.0,0.0
4,GA,GAS12,GAS12E01,00:00:56,00:01:10,00:00:14,0.0,6.0,0.0
...,...,...,...,...,...,...,...,...,...
63,GA,GAS12,GAS12E01,00:40:53,00:40:54,00:00:01,0.0,6.0,0.0
64,GA,GAS12,GAS12E01,00:40:54,00:40:55,00:00:01,6.0,0.0,0.0
65,GA,GAS12,GAS12E01,00:40:55,00:40:58,00:00:03,0.0,6.0,0.0
66,GA,GAS12,GAS12E01,00:40:58,00:41:00,00:00:02,6.0,0.0,0.0


# Importing and parsing the SRT file

In [5]:
# Opening .srt file 

srt_path = input('Enter .srt file path: ')
with open(srt_path, 'r') as f:
  subs = f.read().splitlines()

Enter .srt file path: /content/Grey's Anatomy - 12x01 - Sledgehammer.REPACK-KILLERS.English.HI.C.updated.Addic7ed.com.srt


Each subtitle has four parts in an SRT file: a counter indicating the number of the subtitle; start and end timestamps; one or more lines of text; an empty line indicating the end of the subtitle[¹](https://docs.fileformat.com/video/srt/).

In [6]:
# Structure of the .srt file

print(subs[5:10])
print(subs[10:14])
print(subs[14:19])

['2', '00:00:02,702 --> 00:00:04,619', 'MIRANDA: I have five rules.', 'Memorize them.', '']
['3', '00:00:04,620 --> 00:00:07,288', 'Can anybody name...', '']
['4', '00:00:07,289 --> 00:00:08,456', 'MEREDITH:', '<i>So, you might be thinking...</i>', '']


Based on these features, the timestamps and the text were extracted from each subtitle. 


In [7]:
# Parsing .srt file

re_pattern = r'[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} -->' # Regex that matches lines containing timestamps
regex = re.compile(re_pattern) 

start_times = list(filter(regex.search, subs)) # Stores the lines containing a timespan in a list (HH:MM:SS ---> HH:MM:SS)        
start_times = [time.split(' ')[0] for time in start_times] # Retrieves all timestamps to the left (start)
start_times_datetime1 = pd.to_datetime(start_times, format='%H:%M:%S,%f') # Converts to datetime.datetime 
start_times_datetime2 = pd.Series(start_times_datetime1, name='Start').dt.time # Converts to datetime.time

end_times = list(filter(regex.search, subs))
end_times = [time.split(' ')[2] for time in end_times] # Retrieves all timestamps to the right (end)
end_times_datetime1 = pd.to_datetime(end_times, format='%H:%M:%S,%f')
end_times_datetime2 = pd.Series(end_times_datetime1, name='End').dt.time

subtitles = [[]] # Start with an empty list of lists

for sub in subs:
    if re.match(re_pattern, sub): # If the regex matches the timestamps
        subtitles[-1].pop() # Removes the subtitle number (-1, because it precedes the timestamps)
        subtitles.append([]) # Appends an empty list when it finds a timestamp
    else:
        subtitles[-1].append(sub) # If there's no timestamp in the line, append the text in the new empty list

subtitles = subtitles[1:] 
subtitles = [' '.join(x) for x in subtitles] # Joins the lines that are part of the same subtitle

In [8]:
# Preview of the parsed subtitle data

df_subs = pd.DataFrame()
df_subs['Sub start'] = start_times_datetime2
df_subs['Sub end'] = end_times_datetime2
df_subs['Sub text'] = subtitles
df_subs

Unnamed: 0,Sub start,Sub end,Sub text
0,00:00:00.804000,00:00:02.701000,RICHARD: Each of you comes here today hopeful...
1,00:00:02.702000,00:00:04.619000,MIRANDA: I have five rules. Memorize them.
2,00:00:04.620000,00:00:07.288000,Can anybody name...
3,00:00:07.289000,00:00:08.456000,"MEREDITH: <i>So, you might be thinking...</i>"
4,00:00:08.457000,00:00:10.291000,"Rule number five... When I move, you move."
...,...,...,...
964,00:40:55.128000,00:40:56.379000,<i>Press firmly.
965,00:40:56.380000,00:40:58.497000,"♪ That do that, do that, I-G-G-Y ♪"
966,00:40:58.498000,00:41:02.168000,<i>No regrets. And let's begin.
967,00:41:02.169000,00:41:05.037000,"♪ Oh, oh, oh, oh, oh, oh ♪"


# Computing the average of each subtitle's timespan

When the timespan of a subtitle fits entirely into a segment, the alignment is straightforward. However, there are cases in which a subtitle overlaps with two different segments. In order to disambiguate those instances, the mean of each subtitle's timespan is considered. This way, it is possible to identify the segment to which a subtitle belongs "the most" on average (see also 3. Approach). 

It should be noted that the following operation cannot be performed on datetime.time objects (the ones without the date), which is why start_times_datetime1 and end_times_datetime1 are called. The date is removed at the end.

In [9]:
# Computing the mean of each subtitle's timespan, which is used later to align the subs with the Excel segments

average_times = []

for x, y in zip(start_times_datetime1, end_times_datetime1):
  ts1 = x
  ts2 = y
  average_time = ts1+(ts2-ts1)/2 
  average_times.append(average_time)

average_times = pd.Series(average_times).dt.time 

In [10]:
# Adding the averages to the existing DataFrame

df_subs['Sub average'] = average_times
df_subs = df_subs[['Sub start', 'Sub end', 'Sub average', 'Sub text']]
df_subs

Unnamed: 0,Sub start,Sub end,Sub average,Sub text
0,00:00:00.804000,00:00:02.701000,00:00:01.752500,RICHARD: Each of you comes here today hopeful...
1,00:00:02.702000,00:00:04.619000,00:00:03.660500,MIRANDA: I have five rules. Memorize them.
2,00:00:04.620000,00:00:07.288000,00:00:05.954000,Can anybody name...
3,00:00:07.289000,00:00:08.456000,00:00:07.872500,"MEREDITH: <i>So, you might be thinking...</i>"
4,00:00:08.457000,00:00:10.291000,00:00:09.374000,"Rule number five... When I move, you move."
...,...,...,...,...
964,00:40:55.128000,00:40:56.379000,00:40:55.753500,<i>Press firmly.
965,00:40:56.380000,00:40:58.497000,00:40:57.438500,"♪ That do that, do that, I-G-G-Y ♪"
966,00:40:58.498000,00:41:02.168000,00:41:00.333000,<i>No regrets. And let's begin.
967,00:41:02.169000,00:41:05.037000,00:41:03.603000,"♪ Oh, oh, oh, oh, oh, oh ♪"


# Aligning subtitles to the corresponding segments

The following function is looped over every segment from the Excel file. A Boolean mask is used to filter the subtitles that are part of a segment based on the averages from the previous section. If the average of a subtitle is between the start and the end of a segment, then the subtitle is part of that segment. The filtered DataFrame is then appended to dfs_list, where all of the matched subtitles and segments are stored. 

In [11]:
dfs_list = []

def filter_subs(start, end, pp, sp, mp):

    mask = (df_subs['Sub average'] > start) & (df_subs['Sub average'] <= end)   
    mask_df = df_subs.loc[mask]

    #if len(mask_df.index.values) == 0:    # Optional -> Uncomment this block to display empty segments as well

      #empty_df_template = {
      #'Start' : start,
      #'End' : end,
      #'Sub start' : ['NaN'],
      #'Sub end' : ['NaN'],
      #'Sub text' : ['NaN'],
      #'PP' : pp,
      #'SP' : sp,
      #'MP' : mp,
      #}

      #empty_df = pd.DataFrame(empty_df_template)
      #empty_df.replace('NaN', np.NaN)
      #dfs_list.append(empty_df)

    #else:

    mask_df['Start'] = start
    mask_df['End'] = end
    mask_df['PP'] = pp
    mask_df['SP'] = sp
    mask_df['MP'] = mp
    dfs_list.append(mask_df)

# Filtering all of the subtitles

for a, b, c, d, e in zip(df_excel.Start, df_excel.End, df_excel.PP, df_excel.SP, df_excel.MP):
  filter_subs(a, b, c, d, e)

# Preparing the output dataset

Lastly, the DataFrames contained in dfs_list are combined into one single DataFrame with pd.concat(). Here, the following changes were made to display the data in a clearer way: 
*   Excel segments were merged into a single column
*   Extra 0s were removed from the subtitles' timestamps
*   Columns were reordered and renamed
*   NaNs were replaced with 0s
*   Floats were converted to integers







In [12]:
combined_data = pd.concat(dfs_list)
combined_data = combined_data.reset_index(drop=True)
combined_data['Excel segment'] = combined_data['Start'].astype(str) + '-' + combined_data['End'].astype(str)
combined_data = combined_data.drop(['Start', 'End'], axis=1)
combined_data = combined_data[['Excel segment', 'Sub start', 'Sub end', 'PP', 'SP', 'MP', 'Sub text']]
combined_data = combined_data.rename(columns={'Sub start':'Subtitle start', 'Sub end':'Subtitle end', 'Sub text':'Subtitle text'})
combined_data = combined_data.fillna(0)

In [13]:
combined_data['PP'] = combined_data['PP'].astype(int)
combined_data['SP'] = combined_data['SP'].astype(int)
combined_data['MP'] = combined_data['MP'].astype(int)

In [14]:
combined_data['Subtitle start'] = combined_data['Subtitle start'].astype(str)
combined_data['Subtitle start'] = combined_data['Subtitle start'].str.replace(r'000$', '')
combined_data['Subtitle end']   = combined_data['Subtitle end'].astype(str)
combined_data['Subtitle end']   = combined_data['Subtitle end'].str.replace(r'000$', '')

In [15]:
combined_data

Unnamed: 0,Excel segment,Subtitle start,Subtitle end,PP,SP,MP,Subtitle text
0,00:00:00-00:00:07,00:00:00.804,00:00:02.701,0,0,0,RICHARD: Each of you comes here today hopeful...
1,00:00:00-00:00:07,00:00:02.702,00:00:04.619,0,0,0,MIRANDA: I have five rules. Memorize them.
2,00:00:00-00:00:07,00:00:04.620,00:00:07.288,0,0,0,Can anybody name...
3,00:00:07-00:00:28,00:00:07.289,00:00:08.456,0,6,0,"MEREDITH: <i>So, you might be thinking...</i>"
4,00:00:07-00:00:28,00:00:08.457,00:00:10.291,0,6,0,"Rule number five... When I move, you move."
...,...,...,...,...,...,...,...
962,00:40:52-00:40:53,00:40:51.792,00:40:53.459,6,0,0,<i>Place them below the xiphoid process.
963,00:40:54-00:40:55,00:40:53.460,00:40:55.127,6,0,0,"♪ Who that, who that? I-G-G-Y ♪"
964,00:40:55-00:40:58,00:40:55.128,00:40:56.379,0,6,0,<i>Press firmly.
965,00:40:55-00:40:58,00:40:56.380,00:40:58.497,0,6,0,"♪ That do that, do that, I-G-G-Y ♪"


# Exporting to Excel

In [16]:
def yes_or_no(question):
    reply = str(input(question+' (y/n): ')).lower().strip()
    if reply[0] == 'y':
        return combined_data.to_excel('episode_12x01_with_subtitles.xlsx')
    if reply[0] == 'n':
        pass
    else:
        pass

In [17]:
yes_or_no('Export to Excel?') 

Export to Excel? (y/n): y
