# Build vocabulary for all tests

We have problems with our main code because the vocab is built on the fly. 

If we use the full train2017 file to initiate COCO object, then we use the first 5 captions of each image to build our vocab.

If we use subset of train2017 file to initiate COCO object, then we use first 5 captoins of just the subset of images, and so have a smaller vocab.

We want to set up the vocab independent of the size and selection of train / val datasets.

Here we can build a vocab and then save as json files which we can load into our model for training.

In [1]:
from Vocabulary import Vocabulary
import json
from pathlib import Path

In [19]:
# for Vocabulary object:
FREQ_THRESHOLD = 2
SEQUENCE_LENGTH = 60

# choose captions file to build vocab. Default is that we use the full training dataset
CAPTIONS_FILE = 'captions_train2017.json'
# option to set max captions per image when building vocab
anns_path = Path('Datasets/coco/annotations/')
vocab_path = Path('vocabulary/')

with open(anns_path/CAPTIONS_FILE, 'r') as f:
    annotations = json.load(f)
    
print(f"There are {len(annotations['annotations'])} captions in the data set")

There are 591753 captions in the data set


In [33]:
vocab = Vocabulary(FREQ_THRESHOLD, SEQUENCE_LENGTH)
captions = []
word_counts = [0] * 60
for d in annotations['annotations']:
    captions.append(d['caption'])
    if len(captions[-1].split()) >= 60:
        print(captions[-1])
    else:
        word_counts[len(captions[-1].split())] += 1
vocab.build_vocabulary(captions)

print("With FREQ_THRESHOLD = {}, vocab size is {}"
      .format(freq_threshold, len(vocab.idx_to_string)))

With FREQ_THRESHOLD = 6, vocab size is 16232


In [43]:
from matplotlib import pyplot as plt


plt.bar(x = list(range(1, 30+1)), height = word_counts[:30])

In [21]:
with open(vocab_path/'idx_to_string.json', 'w') as f:
    json.dump(vocab.idx_to_string, f)
with open(vocab_path/'string_to_index.json', 'w') as f:
    json.dump(vocab.string_to_index, f)  

### How many captions do the images have anyway?

Almost all have 5. A few have 6 or 7.

In [8]:
from collections import Counter
im_ids = []
for d in annotations['annotations']:
    im_ids.append(d['image_id'])
capt_counts = Counter(im_ids)
capt_counts = Counter(list(capt_counts.values()))
capt_counts

Counter({5: 117972, 6: 312, 7: 3})

### What is the size of vocab for different Freq thresholds?


In [14]:
for freq_threshold in range(1,7):
    vocab = Vocabulary(freq_threshold, SEQUENCE_LENGTH)
    captions = []
    for d in annotations['annotations']:
        captions.append(d['caption'])
    vocab.build_vocabulary(captions)
    print("With FREQ_THRESHOLD = {}, vocab size is {}"
          .format(freq_threshold, len(vocab.idx_to_string)))

With FREQ_THRESHOLD = 1, vocab size is 26852
With FREQ_THRESHOLD = 2, vocab size is 16232
With FREQ_THRESHOLD = 3, vocab size is 13139
With FREQ_THRESHOLD = 4, vocab size is 11360
With FREQ_THRESHOLD = 5, vocab size is 10192
With FREQ_THRESHOLD = 6, vocab size is 9338
