## Generate corpus and gruond-truth references of released videos

### Corpus file contents
0. train_data: captions and idxs of training videos in format [corpus_widxs, vidxs, corpus_pidxs], where:
    - corpus_widxs is a list of lists with the index of words in the vocabulary
    - vidxs is a list of indexes of video features in the features file
    - corpus_pidxs is a list of lists with the index of POS tags in the POS tagging vocabulary
1. val_data: same format of train_data.
2. test_data: same format of train_data.
3. vocabulary: in format {'word': count}.
4. idx2word: is the vocabulary in format {idx: 'word'}.
5. word_embeddings: are the vectors of each word. The i-th row is the word vector of the i-th word in the vocabulary.
6. idx2pos: is the vocabulary of POS tagging in format {idx: 'POSTAG'}

### Generate split for training and validation

In [6]:
import pandas as pd
# data = pd.read_csv('../../../data/MPII-MD/annotations-original.csv', '\t', usecols=[0,1], names=['video-id', 'sentence'], engine='python')  
data = pd.read_csv('../../../data/MPII-MD/annotations-someone.csv', '\t', usecols=[0,1], names=['video-id', 'sentence'], engine='python')  
data

Unnamed: 0,video-id,sentence
0,0001_American_Beauty_00.00.51.926-00.00.54.129,Her mind wanders for a beat.
1,0001_American_Beauty_00.00.56.224-00.01.03.394,Someone looks at us and sits up.
2,0001_American_Beauty_00.01.14.635-00.01.36.380,"We are FLYING above suburban America, DESCENDI..."
3,0001_American_Beauty_00.01.37.227-00.01.38.586,We are looking down at a king-sized BED from O...
4,0001_American_Beauty_00.01.38.586-00.01.40.722,An irritating ALARM CLOCK RINGS.
...,...,...
68370,2054_Harry_Potter_and_the_prisoner_of_azkaban_...,Someone is gone.
68371,2054_Harry_Potter_and_the_prisoner_of_azkaban_...,Someone stands amid a circle of excited Gryffi...
68372,2054_Harry_Potter_and_the_prisoner_of_azkaban_...,"As Someone arrives, he glances at Someone, who..."
68373,2054_Harry_Potter_and_the_prisoner_of_azkaban_...,"The others turn, begin all speaking at once."


In [7]:
# list(data['sentence'])

train_vidxs, train_corpus = list(data['video-id']), list(data['sentence'])
# valid_vidxs, valid_corpus = zip(*[(int(d['id']), d['label']) for d in valid_data])
# test_vidxs = [(int(d['id'])) for d in test_data]

### Get pretrained embeddings

In [8]:
import os
import numpy as np

wordvectors = {}
# with open('./glove.42B.300d.txt') as f:
with open('./glove.6B.300d.txt') as f:
    for line in f:
        s = line.strip().split(' ')
        if len(s) == 301:
            wordvectors[s[0]] = np.array(s[1:], dtype=float)
    print(len(wordvectors))

400000


### Determine the vocabulary from train split

In [9]:
import nltk
nltk.download('punkt')

vocab, total_len = {}, 0
for cap in train_corpus:
    tokens = nltk.word_tokenize(cap.lower())
    total_len += len(tokens)
    for w in tokens:
        try:
            vocab[w] += 1
        except:
            vocab[w] = 1

print('Avg. count of words per caption:', total_len/len(train_corpus))
print('Count of unique words: ', len(vocab))

to_del = []
for w in vocab.keys():
    if not w in wordvectors:
        to_del.append(w)
        print('missing word: {}'.format(w))

print('count of missing words: ', len(to_del))
        
for w in to_del:
    del vocab[w]
        
idx2word = {idx: word for idx, word in enumerate(['<eos>', '<unk>'] + list(vocab.keys()))}
word2idx = {word: idx for idx, word in enumerate(['<eos>', '<unk>'] + list(vocab.keys()))}
EOS, UNK = 0, 1

print(len(vocab), len(idx2word), len(word2idx))

word_embeddings = np.zeros((len(idx2word), 300))
for idx, word in idx2word.items():
    if idx == EOS:
        word_embeddings[idx] = wordvectors['eos']
    elif idx == UNK:
        word_embeddings[idx] = wordvectors['unk']
    else:
        word_embeddings[idx] = wordvectors[word]

[nltk_data] Downloading package punkt to /home/jeperez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Avg. count of words per caption: 11.049681901279708
Count of unique words:  20936
missing word: king-sized
missing word: fogged-up
missing word: well-put
missing word: color-coordinated
missing word: digicam
missing word: zen-like
missing word: much-younger
missing word: cheap-looking
missing word: well-rehearsed
missing word: spartanettes
missing word: cleaver-ish
missing word: d'eouvres
missing word: state-of-the-
missing word: cup-sized
missing word: zip-loc
missing word: post-sex
missing word: half-smoked
missing word: self-
missing word: glassy-eyed
missing word: fresh-cut
missing word: spinning-teacup
missing word: street-freak
missing word: de-humidifying
missing word: throw-up
missing word: slack-jawed
missing word: sweet-faced
missing word: o.s
missing word: half-laugh
missing word: almost-crowded
missing word: washclothes
missing word: tearstained
missing word: dress/bathrobe
missing word: half-whisper
missing word: begowned
missing word: salmon-colored
missing word: kepis
mi

missing word: grimmauld
missing word: polyjuice
missing word: nervous-looking
missing word: up-key
missing word: long-boned
missing word: imagistically
missing word: roof-less
missing word: disapparate
missing word: frost-covered
missing word: crypt-like
missing word: wand-maker
missing word: delimunator
missing word: apparates
missing word: un-carpeted
missing word: proclaming
missing word: durmstrang
missing word: owlery
missing word: horntail
missing word: hogwards
missing word: dreagon
missing word: griffindor
missing word: beauxbatons
missing word: fur-trimmed
missing word: gillyweed
missing word: rocks-strewn
missing word: merperson
missing word: merpeople
missing word: grindylows
missing word: heaped-in
missing word: tadpole-like
missing word: pensieve
missing word: riving
missing word: black-robed
missing word: dementor
missing word: -eaten
missing word: crookshanks
missing word: sugar-pink
missing word: half-horse
missing word: thwacks
missing word: o.w.l
missing word: shoulde

### Determine POS-tagging vocabulary from train split

In [10]:
import nltk

pos_vocab = {}
pos_unique_words = {}
for cap in train_corpus:
    for tag in nltk.pos_tag(nltk.word_tokenize(cap.lower())):
        try:
            pos_vocab[tag[1]] += 1
            try: 
                pos_unique_words[tag[1]][tag[0]] += 1
            except:
                pos_unique_words[tag[1]][tag[0]] = 1
        except:
            pos_vocab[tag[1]] = 1
            pos_unique_words[tag[1]] = {tag[0]: 1}

print('Unique words per tag:')
print('\n'.join([f' {k}:\t{len(words)}' for k, words in pos_unique_words.items()]))
            
idx2pos = {idx: tag for idx, tag in enumerate(['eos', 'unk'] + list(pos_vocab.keys()))}
pos2idx = {tag: idx for idx, tag in enumerate(['eos', 'unk'] + list(pos_vocab.keys()))}
EOS, UNK = 0, 1
print(len(idx2pos))

Unique words per tag:
 PRP$:	7
 NN:	9390
 NNS:	3628
 IN:	160
 DT:	26
 .:	3
 VBZ:	1846
 PRP:	27
 CC:	25
 RP:	29
 VBP:	1190
 VBG:	1703
 ,:	1
 RB:	1344
 JJ:	5475
 ::	4
 TO:	2
 VB:	1389
 POS:	7
 VBN:	1387
 VBD:	1093
 JJR:	82
 (:	1
 ):	1
 PDT:	11
 WP:	8
 MD:	19
 WRB:	8
 WDT:	7
 RBR:	36
 CD:	164
 JJS:	45
 RBS:	3
 EX:	1
 FW:	21
 WP$:	1
 NNP:	19
 '':	2
 UH:	4
 SYM:	6
 ``:	1
 $:	1
44


### Determine Universal POS-tagging from train split

In [11]:
import nltk
nltk.download('universal_tagset')

upos_vocab = {}
upos_unique_words = {}
for cap in train_corpus:
    for tag in nltk.pos_tag(nltk.word_tokenize(cap.lower()), tagset='universal'):
        try:
            upos_vocab[tag[1]] += 1
            try: 
                upos_unique_words[tag[1]][tag[0]] += 1
            except:
                upos_unique_words[tag[1]][tag[0]] = 1
        except:
            upos_vocab[tag[1]] = 1
            upos_unique_words[tag[1]] = {tag[0]: 1}

print('Unique words per universal tag:')
print('\n'.join([f' {k}:\t{len(words)}' for k, words in upos_unique_words.items()]))
            
idx2upos = {idx: word for idx, word in enumerate(['eos', 'unk'] + list(upos_vocab.keys()))}
upos2idx = {word: idx for idx, word in enumerate(['eos', 'unk'] + list(upos_vocab.keys()))}
print(len(idx2upos))

[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/jeperez/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


Unique words per universal tag:
 PRON:	42
 NOUN:	12781
 ADP:	160
 DET:	41
 .:	14
 VERB:	6842
 CONJ:	25
 PRT:	38
 ADV:	1377
 ADJ:	5586
 NUM:	164
 X:	31
14


### Generate ground-truth references files

In [20]:
with open('../results/20B-SS-v2_val_references.txt', 'w') as f:
    for vidx, cap in zip(valid_vidxs, valid_corpus):
        f.write('{}\t{}\n'.format(vidx, cap.lower()))

### Generate corpus.pkl file

In [21]:
import pickle

train_corpus_widxs = [[word2idx[w] if w in vocab else UNK for w in nltk.word_tokenize(cap.lower())] + [EOS] for cap in train_corpus]
valid_corpus_widxs = [[word2idx[w] if w in vocab else UNK for w in nltk.word_tokenize(cap.lower())] + [EOS] for cap in valid_corpus]

train_corpus_pidxs = [[pos2idx[w[1]] if w[1] in pos_vocab else UNK for w in nltk.pos_tag(nltk.word_tokenize(cap.lower()))] + [EOS] for cap in train_corpus]
valid_corpus_pidxs = [[pos2idx[w[1]] if w[1] in pos_vocab else UNK for w in nltk.pos_tag(nltk.word_tokenize(cap.lower()))] + [EOS] for cap in valid_corpus]

assert len(train_corpus_widxs) == len(train_vidxs) and len(train_vidxs) == len(train_corpus_pidxs) and len(train_vidxs) == len(train_corpus)
assert len(valid_corpus_widxs) == len(valid_vidxs) and len(valid_vidxs) == len(valid_corpus_pidxs) and len(valid_vidxs) == len(valid_corpus)

train_data = [train_corpus_widxs, train_vidxs, train_corpus_pidxs, train_corpus]
valid_data = [valid_corpus_widxs, valid_vidxs, valid_corpus_pidxs, valid_corpus]
test_data = [None, test_vidxs, None]

with open('../../../data/Something-Something-v2/20b-ss-v2_corpus_pos.pkl', 'wb') as outfile:
    pickle.dump([train_data, valid_data, test_data, vocab, idx2word, word_embeddings, idx2pos], outfile)