In [1]:
import os
from datetime import datetime
import xml.etree.ElementTree as ET
import numpy as np

## P-tag timestamp matching to speaker ids for full videos

### Issues

1. The timestamps in the p-tags to do not match to the actual time the line is spoken in the video. The subtitles/p-tag timestamps are roughly 0.5-2.0 seconds behind the speech.
2. In many p-tags, each speaker id is different so it is impossible to determine a single speaker for the line. For clarification, if the length of a p-tag is four seconds then it contains five diarization samples. Each sample is two seconds long, and we move the sample window forward 0.5 seconds at a time.
3. The results look random no matter how we parametrise the speaker diarization (parameters: range for the number of speakers we estimate to be in the video and the number of times we refine the clustering). For example, the first five p-tags printed in the output of cell 4 contain seven different speaker IDs (1,4,5,7,8,10 & 18) even though there should be only two people talking (the beginning of video 175).

In [2]:
def round_to_half_second(microsecond):
    return round(microsecond*10e-7*2) / 2

Any ampersands (&) and pound signs in the xml-file need to be replaced with their corresponding xml codes before the file can be parsed.

ampersand:   `&amp;`     (vim substitute: `%s/&/&amp;/gc`)

pound sign:  `&#163;`    (vim substitute: `%s/£/\&#163;/gc`)

In [3]:
speaker_ids = np.load('./embeds/175_speaker_labels.npy')
parser = ET.XMLParser(encoding="utf-8")
tree = ET.parse('./data/175.xml', parser=parser)

namespace = "{http://www.w3.org/2006/10/ttaf1}"
root = tree.getroot()
ps = root.findall(f"./{namespace}body/{namespace}div/{namespace}p")

In [4]:
for p in ps:
    t_begin = datetime.strptime(p.get('begin'), "%H:%M:%S.%f")
    t_end = datetime.strptime(p.get('end'), "%H:%M:%S.%f")
    # Convert to half second accuracy
    sec_begin = t_begin.hour*3600.0 + t_begin.minute*60.0 + t_begin.second + round_to_half_second(t_begin.microsecond)
    sec_end = t_end.hour*3600.0 + t_end.minute*60.0 + t_end.second + round_to_half_second(t_end.microsecond)
    # Compute index in speaker_ids
    ibeg = int(sec_begin*2.0)
    iend = int((sec_end-2.0)*2.0)
    if ibeg > iend:
        ibeg = ibeg - 1
    print(speaker_ids[ibeg:iend+1])

[18  8  5]
[18]
[8 8 1 8 4]
[ 7 18 10]
[18 10  4]
[4]
[16  4  4]
[18  4]
[18]
[ 7 18 18  1  1  7]
[18  7]
[18 10  8  4 18]
[4]
[7]
[7 7]
[18]
[4 4]
[18]
[18 18 18 18 18  8  8  1  4]
[16 16  5  4]
[1]
[ 7  4  4  4  4  4 10]
[18 18]
[7]
[14  7  1 17  1 19]
[19  1]
[8 8 1 1 8 1]
[ 5 16 19  1  5]
[ 1 18  1  7  4]
[4 8]
[19  3 18  1  1]
[ 1 19]
[ 8 10  1]
[8]
[1 8 7 1 1]
[5]
[1]
[5]
[1]
[7 1 7]
[ 7  4 18  8]
[1 1]
[]
[]
[18  8  4  4  4]
[18]
[4 4]
[20  6  1]
[ 4 15  8  4  1 18]
[ 5 20  1  4]
[18  1  8]
[1]
[18  4 18]
[18  8  1]
[ 4 18]
[ 7  8 18  8 14  5]
[13]
[]
[ 7 18]
[16  1]
[4 4 4 4 4 4 5]
[1 1 4]
[ 4 18  4 18  4  4 16 16  4]
[16]
[16 16 16 16  4]
[1 4]
[16]
[16 16 19  1]
[ 8  7 18  8 16 16 16]
[ 5 19  4]
[1 4 4 4 4]
[8 4 4 4]
[ 4  8 10 10 10]
[14 14  6]
[17 17 17]
[17 20  3  2 17 17  2]
[19 14 19  3]
[14  6 20 17 20 20]
[17  2  2]
[ 7 19 15  1  7  7  2  2]
[17 17]
[ 7 20]
[14  1 17 14 14]
[17]
[17]
[19  1]
[16 13 16]
[ 7  4  4  4 15]
[2]
[11]
[16 16 16  5]
[16]
[ 5 16]
[20  4  4  8  4