# Training Notebook

All our training is done using 10k / 2k / 2k sized data for train / val / test.

This notebook narrows images to only those with sports in them. We then randomly sample these to get 10k / 2k / 2k. This gives us a slightly smaller vocab, but same size dataset. 

We only train on images containing sports categories. 

This is an alternative version of the v2 sports training. We reduce embed size to 512 from 1024, and increase the size of the training set and vocab. Finally, because the training set is ~10% of full val2017 training set, we reduce the frequency threshold parameter for the vocabulary generation to 4. This increases vocab size from 2576 with threshold = 5 to 2921 with threshold = 4.

## Import libraries

In [1]:
from get_loader import get_loader
from models import Encoder, Decoder
import torch
import torch.nn as nn
import torch.optim as optim
from utils import *
from data_prep_utils import *
from pathlib import Path
import json

## Load train and validation loaders

In [None]:
#image_path = '../../CW/Data/train2017'
#captions_path = '../../CW/Data/annotations_trainval2017/annotations/captions_train2017.json'
IMAGE_PATH = '../Datasets/coco/images/train2017'
CAPTIONS_PATH = '../Datasets/coco/annotations/' #captions_train2017.json'
FREQ_THRESHOLD = 4
CAPS_PER_IMAGE = 5
BATCH_SIZE = 128
SHUFFLE = True

# root of the name to save or load captions files
CAPTIONS_NAME = 'sports_v2'
SUPER_CATEGORIES = ['sports'] # should be list of eligible coco super categories, or None to include all images

# for encoder and decoder
EMBED_SIZE = 512  # dimension of vocab embedding vector
HIDDEN_SIZE = 512
NUM_LAYERS = 3  # hidden layers in LTSM

# training parameters
PRINT_EVERY = 100
TOTAL_EPOCH = 50
CHECKPOINT = '../model/model_sport_v3' # there is no v1 for sports:
# v2 is consistent with previous tests as v2 parameters are shared across data sets

In [4]:
# of images or reduce the size of the data
# this will write files to 'Datasets/coco/annotations' as 
#     [save_name]_captions_train.json
#     [save_name]_captions_val.json
#     [save_name]_captions_test.json

prepare_datasets(train_percent = 0.87, super_categories=['sports'],
                 max_train=15000, max_val=2000, max_test=2000,
                 save_name=CAPTIONS_NAME, random_seed=42)

# we explicitly build the vocab here. We use frequency threshold, and we build
# vocab from the specified captions file: we're using the training data
# we save the vocab to a name consistent with our training captions data so that 
# we can load a vocab consistent with the specific training run we've used.
build_vocab(freq_threshold = FREQ_THRESHOLD, 
            captions_file=f'{CAPTIONS_NAME}_captions_train.json',
            vocab_save_name=CAPTIONS_NAME)

train dataset has 15000 images
 val dataset has 2000 images
 test dataset has 938 images
There are 75039 captions in the data set
With FREQ_THRESHOLD = 4, vocab size is 2921


In [7]:
with open(f'../vocabulary/{CAPTIONS_NAME}word2idx.json', 'r') as f:
    word2idx = json.load(f)
vocab_size = len(word2idx)

In [8]:
train_loader_params = {
    'images_path': IMAGE_PATH,
    'captions_path': CAPTIONS_PATH + f'{CAPTIONS_NAME}_captions_train.json',
    'freq_threshold': FREQ_THRESHOLD,
    'caps_per_image': 5,
    'batch_size': BATCH_SIZE,
    'shuffle': SHUFFLE,
    'mode': 'train',
    # 'idx2word': None,
    'word2idx': word2idx
}

train_loader, train_dataset = get_loader(**train_loader_params)

val_loader_params = {
    'images_path': IMAGE_PATH,
    'captions_path': CAPTIONS_PATH + f'{CAPTIONS_NAME}_captions_val.json',
    'freq_threshold': FREQ_THRESHOLD,
    'caps_per_image': 3,
    'batch_size': BATCH_SIZE,
    'shuffle': SHUFFLE,
    'mode': 'validation',
    # 'idx2word': train_dataset.vocab.idx2word,
    'word2idx': word2idx
}

val_loader, val_dataset = get_loader(**val_loader_params)

print(f"Length of training dataloader: {len(train_loader)}, Length of testing dataloader: {len(val_loader)}")
print(f"Length of vocabulary: {len(train_dataset.vocab.idx2word)}")

Length of training dataloader: 586, Length of testing dataloader: 47
Length of vocabulary: 2921


## Load the model

In [9]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"We are using {device}.")

We are using cuda.


In [None]:
encoder = Encoder(embed_size=EMBED_SIZE, pretrained=True)
decoder = Decoder(embed_size=EMBED_SIZE, hidden_size=HIDDEN_SIZE, vocab_size=vocab_size, num_layers=NUM_LAYERS)

In [None]:
# the loss is a cross entropy loss and ignore the index of <PAD> since it doesn't make any difference
criterion = nn.CrossEntropyLoss(ignore_index=train_dataset.vocab.word2idx["<PAD>"]).cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss(ignore_index=train_dataset.vocab.word2idx["<PAD>"])

# combine the parameters of decoder and encoder
params = list(decoder.parameters()) + list(encoder.embed.parameters())

# Adam optimizer
opt_pars = {'lr':1e-3, 'weight_decay':1e-3, 'betas':(0.9, 0.999), 'eps':1e-08}
optimizer = optim.Adam(params, **opt_pars)

In [None]:
model_params = {
    'save_path': CHECKPOINT,
    'batch_size': BATCH_SIZE,
    'embed_size': EMBED_SIZE,
    'hidden_size': HIDDEN_SIZE,
    'num_layers': NUM_LAYERS,
    'vocab_size': len(train_dataset.vocab.idx2word)
}

save_params(**model_params)

## Training

In [None]:
train_params = {
    'encoder': encoder,
    'decoder': decoder,
    'criterion': criterion,
    'optimizer': optimizer,
    'train_loader': train_loader,
    'val_loader': val_loader,
    'total_epoch': TOTAL_EPOCH,
    'device': device,
    'checkpoint_path': CHECKPOINT,
    'print_every': PRINT_EVERY,
    'load_checkpoint': False
}

training_loss, validation_loss = train(**train_params) 

Epoch: [0/50]          || Step: [0/586]         || Average Training Loss: 7.9903
Epoch: [0/50]          || Step: [100/586]       || Average Training Loss: 4.6392
Epoch: [0/50]          || Step: [200/586]       || Average Training Loss: 4.1221
Epoch: [0/50]          || Step: [300/586]       || Average Training Loss: 3.8262
Epoch: [0/50]          || Step: [400/586]       || Average Training Loss: 3.6221
Epoch: [0/50]          || Step: [500/586]       || Average Training Loss: 3.4742
Epoch: [0/50]          || Step: [0/47]          || Average Validation Loss: 2.6269
****************************************************************************************************
Epoch: [0/50] || Training Loss = 3.37 || Validation Loss: 2.67 || Time: 20.881170
****************************************************************************************************
Epoch: [1/50]          || Step: [0/586]         || Average Training Loss: 2.9051
Epoch: [1/50]          || Step: [100/586]       || Average Trainin

## Modify the parameters

Use entire train2017 data set to build the vocab

In [4]:
#image_path = '../../CW/Data/train2017'
#captions_path = '../../CW/Data/annotations_trainval2017/annotations/captions_train2017.json'
IMAGE_PATH = '../Datasets/coco/images/train2017'
CAPTIONS_PATH = '../Datasets/coco/annotations/' #captions_train2017.json'
FREQ_THRESHOLD = 5
CAPS_PER_IMAGE = 5
BATCH_SIZE = 64
SHUFFLE = True

# root of the name to save or load captions files
CAPTIONS_NAME = 'sports_v4'
SUPER_CATEGORIES = ['sports'] # should be list of eligible coco super categories, or None to include all images

# for encoder and decoder
EMBED_SIZE = 512  # dimension of vocab embedding vector
HIDDEN_SIZE = 512
NUM_LAYERS = 3  # hidden layers in LTSM

# training parameters
PRINT_EVERY = 100
TOTAL_EPOCH = 20
CHECKPOINT = '../model/model_sport_v4' 

In [5]:
# of images or reduce the size of the data
# this will write files to 'Datasets/coco/annotations' as 
#     [save_name]_captions_train.json
#     [save_name]_captions_val.json
#     [save_name]_captions_test.json

prepare_datasets(train_percent = 0.87, super_categories=['sports'],
                 max_train=15000, max_val=2000, max_test=2000,
                 save_name=CAPTIONS_NAME, random_seed=42)

# we explicitly build the vocab here. We use frequency threshold, and we build
# vocab from the specified captions file: we're using the training data
# we save the vocab to a name consistent with our training captions data so that 
# we can load a vocab consistent with the specific training run we've used.
build_vocab(freq_threshold = FREQ_THRESHOLD, 
            captions_file='captions_train2017.json',
            vocab_save_name=CAPTIONS_NAME)

train dataset has 15000 images
 val dataset has 2000 images
 test dataset has 938 images
There are 591753 captions in the data set
With FREQ_THRESHOLD = 5, vocab size is 10192


In [6]:
with open(f'../vocabulary/{CAPTIONS_NAME}word2idx.json', 'r') as f:
    word2idx = json.load(f)
vocab_size = len(word2idx)

In [7]:
train_loader_params = {
    'images_path': IMAGE_PATH,
    'captions_path': CAPTIONS_PATH + f'{CAPTIONS_NAME}_captions_train.json',
    'freq_threshold': FREQ_THRESHOLD,
    'caps_per_image': 5,
    'batch_size': BATCH_SIZE,
    'shuffle': SHUFFLE,
    'mode': 'train',
    # 'idx2word': None,
    'word2idx': word2idx
}

train_loader, train_dataset = get_loader(**train_loader_params)

val_loader_params = {
    'images_path': IMAGE_PATH,
    'captions_path': CAPTIONS_PATH + f'{CAPTIONS_NAME}_captions_val.json',
    'freq_threshold': FREQ_THRESHOLD,
    'caps_per_image': 3,
    'batch_size': BATCH_SIZE,
    'shuffle': SHUFFLE,
    'mode': 'validation',
    # 'idx2word': train_dataset.vocab.idx2word,
    'word2idx': word2idx
}

val_loader, val_dataset = get_loader(**val_loader_params)

print(f"Length of training dataloader: {len(train_loader)}, Length of testing dataloader: {len(val_loader)}")
print(f"Length of vocabulary: {len(train_dataset.vocab.idx2word)}")

Length of training dataloader: 1172, Length of testing dataloader: 94
Length of vocabulary: 10192


## Load the model

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"We are using {device}.")

We are using cuda.


In [9]:
encoder = Encoder(embed_size=EMBED_SIZE, pretrained=True)
decoder = Decoder(embed_size=EMBED_SIZE, hidden_size=HIDDEN_SIZE, vocab_size=vocab_size, num_layers=NUM_LAYERS)

In [10]:
# the loss is a cross entropy loss and ignore the index of <PAD> since it doesn't make any difference
criterion = nn.CrossEntropyLoss(ignore_index=train_dataset.vocab.word2idx["<PAD>"]).cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss(ignore_index=train_dataset.vocab.word2idx["<PAD>"])

# combine the parameters of decoder and encoder
params = list(decoder.parameters()) + list(encoder.embed.parameters())

# Adam optimizer
opt_pars = {'lr':1e-3, 'weight_decay':1e-3, 'betas':(0.9, 0.999), 'eps':1e-08}
optimizer = optim.Adam(params, **opt_pars)

In [11]:
model_params = {
    'save_path': CHECKPOINT,
    'batch_size': BATCH_SIZE,
    'embed_size': EMBED_SIZE,
    'hidden_size': HIDDEN_SIZE,
    'num_layers': NUM_LAYERS,
    'vocab_size': len(train_dataset.vocab.idx2word)
}

save_params(**model_params)

## Training

In [12]:
train_params = {
    'encoder': encoder,
    'decoder': decoder,
    'criterion': criterion,
    'optimizer': optimizer,
    'train_loader': train_loader,
    'val_loader': val_loader,
    'total_epoch': TOTAL_EPOCH,
    'device': device,
    'checkpoint_path': CHECKPOINT,
    'print_every': PRINT_EVERY,
    'load_checkpoint': False
}

training_loss, validation_loss = train(**train_params) 

Epoch: [0/20]          || Step: [0/1172]        || Average Training Loss: 9.2194
Epoch: [0/20]          || Step: [100/1172]      || Average Training Loss: 4.9198
Epoch: [0/20]          || Step: [200/1172]      || Average Training Loss: 4.4146
Epoch: [0/20]          || Step: [300/1172]      || Average Training Loss: 4.1013
Epoch: [0/20]          || Step: [400/1172]      || Average Training Loss: 3.8984
Epoch: [0/20]          || Step: [500/1172]      || Average Training Loss: 3.7452
Epoch: [0/20]          || Step: [600/1172]      || Average Training Loss: 3.6234
Epoch: [0/20]          || Step: [700/1172]      || Average Training Loss: 3.5237
Epoch: [0/20]          || Step: [800/1172]      || Average Training Loss: 3.4392
Epoch: [0/20]          || Step: [900/1172]      || Average Training Loss: 3.3676
Epoch: [0/20]          || Step: [1000/1172]     || Average Training Loss: 3.3101
Epoch: [0/20]          || Step: [1100/1172]     || Average Training Loss: 3.2595
Epoch: [0/20]          || St

## Modify the parameters

Use entire train2017 data set to build the vocab

And use entire dataset to train. Why not? What the hell!

In [2]:
#image_path = '../../CW/Data/train2017'
#captions_path = '../../CW/Data/annotations_trainval2017/annotations/captions_train2017.json'
IMAGE_PATH = '../Datasets/coco/images/train2017'
CAPTIONS_PATH = '../Datasets/coco/annotations/' #captions_train2017.json'
FREQ_THRESHOLD = 5
CAPS_PER_IMAGE = 5
BATCH_SIZE = 64
SHUFFLE = True

# root of the name to save or load captions files
CAPTIONS_NAME = 'everything'
SUPER_CATEGORIES = None # should be list of eligible coco super categories, or None to include all images

# for encoder and decoder
EMBED_SIZE = 512  # dimension of vocab embedding vector
HIDDEN_SIZE = 512
NUM_LAYERS = 3  # hidden layers in LTSM

# training parameters
PRINT_EVERY = 100
TOTAL_EPOCH = 20
CHECKPOINT = '../model/model_everything' 

In [3]:
# of images or reduce the size of the data
# this will write files to 'Datasets/coco/annotations' as 
#     [save_name]_captions_train.json
#     [save_name]_captions_val.json
#     [save_name]_captions_test.json

prepare_datasets(train_percent = 0.87, super_categories=SUPER_CATEGORIES,
                 max_train=150000, max_val=50000, max_test=50000,
                 save_name=CAPTIONS_NAME, random_seed=42)

# we explicitly build the vocab here. We use frequency threshold, and we build
# vocab from the specified captions file: we're using the training data
# we save the vocab to a name consistent with our training captions data so that 
# we can load a vocab consistent with the specific training run we've used.
build_vocab(freq_threshold = FREQ_THRESHOLD, 
            captions_file='captions_train2017.json',
            vocab_save_name=CAPTIONS_NAME)

train dataset has 102021 images
 val dataset has 15245 images
 test dataset has 4952 images
There are 591753 captions in the data set
With FREQ_THRESHOLD = 5, vocab size is 10192


In [4]:
with open(f'../vocabulary/{CAPTIONS_NAME}word2idx.json', 'r') as f:
    word2idx = json.load(f)
vocab_size = len(word2idx)

In [5]:
train_loader_params = {
    'images_path': IMAGE_PATH,
    'captions_path': CAPTIONS_PATH + f'{CAPTIONS_NAME}_captions_train.json',
    'freq_threshold': FREQ_THRESHOLD,
    'caps_per_image': 5,
    'batch_size': BATCH_SIZE,
    'shuffle': SHUFFLE,
    'mode': 'train',
    # 'idx2word': None,
    'word2idx': word2idx
}

train_loader, train_dataset = get_loader(**train_loader_params)

val_loader_params = {
    'images_path': IMAGE_PATH,
    'captions_path': CAPTIONS_PATH + f'{CAPTIONS_NAME}_captions_val.json',
    'freq_threshold': FREQ_THRESHOLD,
    'caps_per_image': 3,
    'batch_size': BATCH_SIZE,
    'shuffle': SHUFFLE,
    'mode': 'validation',
    # 'idx2word': train_dataset.vocab.idx2word,
    'word2idx': word2idx
}

val_loader, val_dataset = get_loader(**val_loader_params)

print(f"Length of training dataloader: {len(train_loader)}, Length of testing dataloader: {len(val_loader)}")
print(f"Length of vocabulary: {len(train_dataset.vocab.idx2word)}")

Length of training dataloader: 7971, Length of testing dataloader: 715
Length of vocabulary: 10192


## Load the model

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"We are using {device}.")

We are using cuda.


In [7]:
encoder = Encoder(embed_size=EMBED_SIZE, pretrained=True)
decoder = Decoder(embed_size=EMBED_SIZE, hidden_size=HIDDEN_SIZE, vocab_size=vocab_size, num_layers=NUM_LAYERS)

In [8]:
# the loss is a cross entropy loss and ignore the index of <PAD> since it doesn't make any difference
criterion = nn.CrossEntropyLoss(ignore_index=train_dataset.vocab.word2idx["<PAD>"]).cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss(ignore_index=train_dataset.vocab.word2idx["<PAD>"])

# combine the parameters of decoder and encoder
params = list(decoder.parameters()) + list(encoder.embed.parameters())

# Adam optimizer
opt_pars = {'lr':1e-3, 'weight_decay':1e-3, 'betas':(0.9, 0.999), 'eps':1e-08}
optimizer = optim.Adam(params, **opt_pars)

In [9]:
model_params = {
    'save_path': CHECKPOINT,
    'batch_size': BATCH_SIZE,
    'embed_size': EMBED_SIZE,
    'hidden_size': HIDDEN_SIZE,
    'num_layers': NUM_LAYERS,
    'vocab_size': len(train_dataset.vocab.idx2word)
}

save_params(**model_params)

## Training

In [None]:
train_params = {
    'encoder': encoder,
    'decoder': decoder,
    'criterion': criterion,
    'optimizer': optimizer,
    'train_loader': train_loader,
    'val_loader': val_loader,
    'total_epoch': TOTAL_EPOCH,
    'device': device,
    'checkpoint_path': CHECKPOINT,
    'print_every': PRINT_EVERY,
    'load_checkpoint': False
}

training_loss, validation_loss = train(**train_params) 

Epoch: [0/20]          || Step: [0/7971]        || Average Training Loss: 9.2285
Epoch: [0/20]          || Step: [100/7971]      || Average Training Loss: 5.4096
Epoch: [0/20]          || Step: [200/7971]      || Average Training Loss: 4.9976
Epoch: [0/20]          || Step: [300/7971]      || Average Training Loss: 4.7313
Epoch: [0/20]          || Step: [400/7971]      || Average Training Loss: 4.5499
Epoch: [0/20]          || Step: [500/7971]      || Average Training Loss: 4.4200
Epoch: [0/20]          || Step: [600/7971]      || Average Training Loss: 4.3241
Epoch: [0/20]          || Step: [700/7971]      || Average Training Loss: 4.2454
Epoch: [0/20]          || Step: [800/7971]      || Average Training Loss: 4.1795
Epoch: [0/20]          || Step: [900/7971]      || Average Training Loss: 4.1217
Epoch: [0/20]          || Step: [1000/7971]     || Average Training Loss: 4.0718
Epoch: [0/20]          || Step: [1100/7971]     || Average Training Loss: 4.0272
Epoch: [0/20]          || St