@authors
* Arseniy Ashuha, you can text me ```ars.ashuha@gmail.com```,
* Based on https://github.com/ebenolson/pydata2015

<h1 align="center"> Part II: Attention mechanism @ Image Captioning </h1> 

<img src="https://s2.postimg.org/pq18f5t7t/deepbb.png" width=480>

In this seminar you'll be going through the image captioning pipeline.

To begin with, let us download the dataset of image features from a pre-trained GoogleNet (see instructions in chat)

### Data preprocessing

In [3]:
# Load dataset
import numpy as np

captions = np.load("./data/train-data-captions.npy")
img_codes = np.load("./data/train-data-svdfeatures.npy").astype('float32')

In [4]:
print ("each image code is a 6x6 feature matrix from GoogleNet:", img_codes.shape)
print (img_codes[0,:10,0,0])
print ('\n\n')
print ("for each image there are 5-7 descriptions, e.g.:\n")
print ('\n'.join(captions[0]))

each image code is a 6x6 feature matrix from GoogleNet: (82783, 128, 6, 6)
[-19.53911972   3.23637891   2.08816719   0.66493636  -2.9185071
   1.82758021  -1.3254329   -0.53197509  -3.15473676  -1.8739953 ]



for each image there are 5-7 descriptions, e.g.:

People shopping in an open market for vegetables.
An open market full of people and piles of vegetables.
People are shopping at an open air produce market.
Large piles of carrots and potatoes at a crowded outdoor market.
People shop for vegetables like carrots and potatoes at an open air market.


In [None]:
#split descriptions into tokens
for img_i in range(len(captions)):
    for caption_i in range(len(captions[img_i])):
        sentence = captions[img_i][caption_i] 
        captions[img_i][caption_i] = ["#START#"]+sentence.split(' ')+["#END#"]

In [None]:
# Build a Vocabulary
from collections import Counter
word_counts = Counter()
for img_captions in captions:
    for caption in img_captions:
        word_counts.update(caption)

In [None]:
vocab  = ['#UNK#', '#START#', '#END#']
vocab += [k for k, v in word_counts.items() if v >= 5]
vocab = list(set(vocab))
n_tokens = len(vocab)

assert 12000 <= n_tokens <= 15000

word_to_index = {w: i for i, w in enumerate(vocab)}

We'll use this function to convert sentences into a network-readible matrix of token indices.

When given several sentences of different length, it pads them with -1.

In [None]:
PAD_ix = -1
UNK_ix = vocab.index('#UNK#')
START_ix = vocab.index("#START#")
END_ix = vocab.index("#END#")

#good old as_matrix for the third time
def as_matrix(sequences,max_len=None):
    max_len = max_len or max(map(len,sequences))
    
    matrix = np.zeros((len(sequences),max_len),dtype='int32')+PAD_ix
    for i,seq in enumerate(sequences):
        row_ix = [word_to_index.get(word,UNK_ix) for word in seq[:max_len]]
        matrix[i,:len(row_ix)] = row_ix
    
    return matrix

def to_string(tokens_ix):
    assert len(np.shape(tokens_ix))==1,"to_string works on one sequence at a time"
    tokens_ix = list(tokens_ix)[1:]
    if END_ix in tokens_ix:
        tokens_ix = tokens_ix[:tokens_ix.index(END_ix)]
    return " ".join([vocab[i] for i in tokens_ix])

In [None]:
#try it out on several descriptions of a random image
as_matrix(captions[1337])

In [None]:
to_string(as_matrix(captions[1337])[0])

### The neural network

Since the image encoder CNN is already applied, the only remaining part is to write a sentence decoder.


In [None]:
import theano, theano.tensor as T
import lasagne
from lasagne.layers import *

# network shapes. 
EMBEDDING_SIZE = 128    #Change at your will
LSTM_SIZE  = 256        #Change at your will
ATTN_SIZE  = 256        #Change at your will
FEATURES,HEIGHT,WIDTH = img_codes.shape[1:]


We will define a single LSTM step here. An LSTM step should
* take previous cell/out and input
* compute next cell/out and next token probabilities
* use attention to work with image features

In [None]:
#<Your attention layers>

In [None]:
from agentnet.resolver import ProbabilisticResolver
from agentnet.memory import LSTMCell

temperature = theano.shared(1.)
class decoder:
    prev_word = InputLayer((None,),name='index of previous word')
    image_features = InputLayer((None,FEATURES,HEIGHT,WIDTH),name='img features')

    prev_cell = InputLayer((None,LSTM_SIZE),name='previous LSTM cell goes here')
    prev_out = InputLayer((None,LSTM_SIZE),name='previous LSTM output goes here')
    
    prev_word_emb = EmbeddingLayer(prev_word,len(vocab),EMBEDDING_SIZE)
    
    ###Attention part:
    # Please implement attention part of rnn architecture
    
    #First we reshape image into a sequence of image vectors
    image_features_seq = reshape(dimshuffle(image_features,[0,2,3,1]),[[0],-1,[3]])
    
    #Then we apply attention just as usual
    attn_probs = <Compute attention probabilities>
    attn = <Compute attention result given probabilities>

    lstm_input = concat([attn,prev_word_emb],axis=-1)

    new_cell,new_out = LSTMCell(prev_cell,prev_out,lstm_input)
    
    
    output_probs = DenseLayer(new_out,len(vocab),nonlinearity=T.nnet.softmax)

    
    output_probs_scaled = ExpressionLayer(output_probs,lambda p: p**temperature)
    output_tokens = ProbabilisticResolver(output_probs_scaled,assume_normalized=False)
    
    
    # recurrent state transition dict
    # on next step, {key} becomes {value}
    transition = {
        new_cell:prev_cell,
        new_out:prev_out
    }

### Training

During training, we should feed our decoder RNN with reference captions from the dataset. Training then comes down to simple likelihood maximization problem.

Deep learning people also know this as minimizing crossentropy.

In [None]:
# Inputs for sentences
sentences = T.imatrix("[batch_size x time] of word ids")
l_sentences = InputLayer((None,None),sentences)

# Input layer for image features
image_vectors = T.tensor4("image features [batch,channels,h,w]")
l_image_features = InputLayer((None,FEATURES,HEIGHT,WIDTH),image_vectors)


In [None]:
from agentnet import Recurrence

decoder_trainer = Recurrence(
    input_sequences={decoder.prev_word:l_sentences},
    input_nonsequences={decoder.image_features:l_image_features},
    state_variables=decoder.transition,
    tracked_outputs=[decoder.output_probs],
    unroll_scan = False,
)

In [None]:
#get predictions and define loss
next_token_probs = get_output(decoder_trainer[decoder.output_probs])

next_token_probs = next_token_probs[:,:-1].reshape([-1,len(vocab)])
next_tokens = sentences[:,1:].ravel()

loss = T.nnet.categorical_crossentropy(next_token_probs,next_tokens)

#apply mask
mask = T.neq(next_tokens,PAD_ix)
loss = T.sum(loss*mask)/T.sum(mask)

In [None]:
#trainable NN weights
weights = get_all_params(decoder_trainer,trainable=True)
updates = lasagne.updates.adam(loss,weights)

In [None]:
#compile a functions for training and evaluation
#please not that your functions must accept image features as FIRST param and sentences as second one
train_step = theano.function([image_vectors,sentences],loss,updates=updates,allow_input_downcast=True)
val_step   = theano.function([image_vectors,sentences],loss,allow_input_downcast=True)
#for val_step use deterministic=True if you have any dropout/noize

# Training

* You first have to implement a batch generator
* Than the network will get trained the usual way

In [None]:
from random import choice

def generate_batch(images,captions,batch_size,max_caption_len=None):
    
    #sample random numbers for image/caption indicies
    random_image_ix = np.random.randint(0,len(images),size=batch_size)
    
    #get images
    batch_images = images[random_image_ix]
    
    #5-7 captions for each image
    captions_for_batch_images = captions[random_image_ix]
    
    #pick 1 from 5-7 captions for each image
    batch_captions = list(map(choice,captions_for_batch_images))
    
    #convert to matrix
    batch_captions_ix = as_matrix(batch_captions,max_len=max_caption_len)
    
    return batch_images, batch_captions_ix

In [None]:
bx,by = generate_batch(img_codes,captions,3)
bx[0,:10,0,0],by

### Main loop
* We recommend you to periodically evaluate the network using the next "apply trained model" block
 *  its safe to interrupt training, run a few examples and start training again

In [None]:
batch_size=50 #adjust me
n_epochs=100 #adjust me
n_batches_per_epoch = 50 #adjust me
n_validation_batches = 5 #how many batches are used for validation after each epoch


In [None]:
from tqdm import tqdm

for epoch in range(n_epochs):
    
    train_loss=0
    for _ in tqdm(range(n_batches_per_epoch)):
        train_loss += train_step(*generate_batch(img_codes,captions,batch_size))
    train_loss /= n_batches_per_epoch
    
    
    print('Epoch: {}, train loss: {}'.format(epoch, train_loss))

print("Finish :)")

### apply trained model

In [None]:
batch_size = theano.shared(np.int32(1))
MAX_LENGTH = 20         #Change at your will

In [None]:
#set up recurrent network that generates tokens and feeds them back to itself
unroll_dict = dict(decoder.transition)
unroll_dict[decoder.output_tokens] = decoder.prev_word #on next iter, output goes to input

first_output = T.repeat(T.constant(START_ix,dtype='int32'),batch_size)
init_dict = {
    decoder.output_tokens:InputLayer([None],first_output)
}

decoder_applier = Recurrence(
    input_nonsequences={decoder.image_features:l_image_features},
    state_variables=unroll_dict,
    state_init = init_dict,
    tracked_outputs=[decoder.output_probs,decoder.output_tokens],
    n_steps = MAX_LENGTH,
)

In [None]:
generated_tokens = get_output(decoder_applier[decoder.output_tokens])

generate = theano.function([image_vectors],generated_tokens,allow_input_downcast=True)

In [None]:
from pretrained_lenet import image_to_features
import matplotlib.pyplot as plt
%matplotlib inline

img = plt.imread("./data/Dog-and-Cat.jpg")
plt.imshow(img)

In [None]:
output_ix = generate([image_to_features(img)])[0]

for _ in range(100):
    temperature.set_value(10)
    print to_string(output_ix)

### Some tricks (for further research)

* Initialize LSTM with some function of image features.

* Try other attention functions

* If you train large network, it is usually a good idea to make a 2-stage prediction
    1. (large recurrent state) -> (bottleneck e.g. 256)
    2. (bottleneck) -> (vocabulary size)
    * this way you won't need to store/train (large_recurrent_state x vocabulary size) matrix
    
* Use [hierarchical softmax](https://gist.github.com/justheuristic/581853c6d6b87eae9669297c2fb1052d) or [byte pair encodings](https://github.com/rsennrich/subword-nmt)


