# Image Captioning with Pretrained WordVectors using RNN API

In this notebook, we will explore how to generate image captions with pretrained word vectors. We will also be using the TensorFlow RNN API to implement the LSTM. We will not repeat the CNN related operations (loading weights/ inference from CNN) but use the persisted image features.

## Restoration Points

[RESTORE POINT: Load Numerical Captions from Disk](#RESTORE-POINT:-Load-Numerical-Captions-from-Disk)

In [24]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
from tensorflow.contrib import rnn
import nltk
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
import tensorflow as tf
from PIL import Image
import json

import json
from pprint import pprint
import pickle

import correct_spellings

from nltk.translate import bleu_score

## Reading the Annotations

Here we read the annotations of the images which are found at `image_caption_data/train_valid/annotations`. This will result with the following entities.

* Dictionary mapping of the image IDs to filenames (`id_to_fname_map`)
* Dictionary mapping of the filenames to captions (`train_fname_to_caption_map` and `valid_fname_to_caption_map`).    
  * For example, `train_fname_to_caption_map` will be of the format `{fname1: [[cap1_w1, cap1_w2, ...], [cap2_w1, ...], [cap3_w1, ...]], fname2: [[cap1_w1, cap1_w2, ...], [cap2_w1, ...], [cap3_w1, ...]], ...}`

In [2]:
train_file_dir = os.path.join('image_caption_data', *('train_valid','images'))
image_filenames = [os.path.join(train_file_dir,f) for f in os.listdir(train_file_dir) if os.path.isfile(os.path.join(train_file_dir, f)) and f.endswith('.jpg')]
train_image_filenames = image_filenames[1000:] 
test_image_filenames = image_filenames[:1000] 

train_image_file = os.path.join('image_caption_data',*('train_valid', 'image_encodings', 'train_image_encodings.json'))
valid_image_file = os.path.join('image_caption_data',*('train_valid', 'image_encodings', 'valid_image_encodings.json'))
train_image_fnames = (json.load(open(train_image_file)).keys())
valid_image_fnames = (json.load(open(valid_image_file)).keys())

def preprocess_caption(capt):
    capt = capt.replace('-',' ')
    capt = capt.replace(',','')
    capt = capt.replace('.','')
    capt = capt.replace('"','')
    capt = capt.replace('!','')
    capt = capt.replace(':','')
    capt = capt.replace('/','')
    capt = capt.replace('?','')
    capt = capt.replace(';','')
    capt = capt.replace('\' ',' ')
    capt = capt.replace('\n',' ') 
    
    return capt.lower()

train_filename_tails = [os.path.split(path_str)[1] for path_str in train_image_filenames]
print(train_filename_tails[:10])
test_filename_tails = [os.path.split(path_str)[1] for path_str in test_image_filenames]

annotations_fname = os.path.join('image_caption_data', *('train_valid', 'annotations', 'captions_val2014.json'))
data = json.load(open(annotations_fname))

all_caption_string = ''

# Dictionary: image-id -> filename
id_to_fname_map = {}
for item in data['images']:
    id_to_fname_map[item['id']] = item['file_name']

# Dictionary: filename -> caption
train_fname_to_caption_map, valid_fname_to_caption_map = {},{}
max_caption_length = 0
all_cap_lengths = []
for item in data['annotations']:
    
    pre_caption = preprocess_caption(item['caption'])
    capt_len = len(pre_caption.split(' '))
    all_cap_lengths.append(capt_len)
    if capt_len > max_caption_length:
        max_caption_length = capt_len

    if id_to_fname_map[item['image_id']]  in train_filename_tails and id_to_fname_map[item['image_id']] in train_image_fnames:
        if id_to_fname_map[item['image_id']] not in train_fname_to_caption_map:
            train_fname_to_caption_map[id_to_fname_map[item['image_id']]] = [pre_caption]
        else:
            train_fname_to_caption_map[id_to_fname_map[item['image_id']]].append(pre_caption)

        
    elif id_to_fname_map[item['image_id']] in test_filename_tails and id_to_fname_map[item['image_id']] in valid_image_fnames:
        
        if id_to_fname_map[item['image_id']] not in valid_fname_to_caption_map:
            valid_fname_to_caption_map[id_to_fname_map[item['image_id']]] = [pre_caption]
        else:
            valid_fname_to_caption_map[id_to_fname_map[item['image_id']]].append(pre_caption)

    all_caption_string += pre_caption + ' '

    
print('Max caption length: ',max_caption_length)
print('Mean caption length: ',np.mean(all_cap_lengths))
print('Stddev captions: ', np.std(all_cap_lengths))
print('\nSample caption data')
print(list(train_fname_to_caption_map.items())[:10])

word_list = all_caption_string.split(' ')
unique_words = list(set(word_list))

dictionary = {'SOS':0, 'EOS': 1}
for tg in unique_words:
    dictionary[tg] = len(dictionary)

reverse_dictionary = dict([(v,k) for k,v in dictionary.items()]) 
print('\nSample words')
print(word_list[:10])
print('\nSample Dictionary Items')
print(list(dictionary.items())[:10])
print('\nSample Reverse Dictionary Items')
print(list(reverse_dictionary.items())[:10])
print('\nVocabulary size: ',len(dictionary))

assert 'horse' == reverse_dictionary[dictionary['horse']]

vocabulary_size = len(dictionary)

del all_caption_string, word_list, train_image_fnames, valid_image_fnames

['COCO_val2014_000000014248.jpg', 'COCO_val2014_000000014257.jpg', 'COCO_val2014_000000014265.jpg', 'COCO_val2014_000000014271.jpg', 'COCO_val2014_000000014276.jpg', 'COCO_val2014_000000014278.jpg', 'COCO_val2014_000000014282.jpg', 'COCO_val2014_000000014285.jpg', 'COCO_val2014_000000014297.jpg', 'COCO_val2014_000000014306.jpg']
Max caption length:  56
Mean caption length:  10.622163885242827
Stddev captions:  2.420075651000928

Sample caption data
[('COCO_val2014_000000514249.jpg', ['a white fire hydrant sitting next to news paper dispensers at night', 'the corner of a city street at night with a fire hydrant  ', 'a white fire hydrant at dusk sitting on a street corner', 'a white fire hydrant is next to some newspaper boxes', 'a fire hydrant that is sitting on the sidewalk']), ('COCO_val2014_000000562054.jpg', ['a close up of two zebras in a field with a blue background', 'there are two statues of zebras at a exhibit', 'there are two zebras that are standing by each other ', 'two dead

## Creating the Vocabulary
Here, using the unique words we saved, we now create the `dictionary`, `reverse_dictionary` and `vocabulary_size`.

In [3]:
# Add two special tokens to the dictionary
dictionary = {'SOS':0, 'EOS': 1}

# Create an ID for each word in the unique_words
for tg in unique_words:
    dictionary[tg] = len(dictionary)

# Create the reverse dictionary
reverse_dictionary = dict([(v,k) for k,v in dictionary.items()]) 

print('\nSample Dictionary Items')
print(list(dictionary.items())[:10])
print('\nSample Reverse Dictionary Items')
print(list(reverse_dictionary.items())[:10])
print('\nVocabulary size: ',len(dictionary))

# Just checking if the dictionary and reverse dictionary
# are correct
assert 'horse' == reverse_dictionary[dictionary['horse']]

vocabulary_size = len(dictionary)



Sample Dictionary Items
[('', 2), ('escalators', 3), ('preparation', 4), ('squatting', 5), ('wort', 7), ('simple', 4386), ('bodyboard', 8), ('horse', 9), ('hyrdant', 15468), ('aliens', 17550)]

Sample Reverse Dictionary Items
[(0, 'SOS'), (1, 'EOS'), (2, ''), (3, 'escalators'), (4, 'preparation'), (5, 'squatting'), (6, 'campstove'), (7, 'wort'), (8, 'bodyboard'), (9, 'horse')]

Vocabulary size:  17954


# Loading Pre-Trained GloVe Word Vectors
You need to download the GloVe word embeddings from the official [download page](https://nlp.stanford.edu/projects/glove/). We are going to use the `glove.6B.zip file` in there. So go ahead and download that to your project home directory (that is, `ch10` folder).

In [4]:


pret_embeddings = np.empty(shape=(vocabulary_size,50),dtype=np.float32)

# For storing words that found in the Glove that matches the words in our dataset 
# This will be used for correcting spellings of misspelled words
words_in_glove = [] 

words_found = 0 # Number of words in GloVe that matched our corpus
found_word_ids = [] # The IDs of the found words

# We read the downloaded zip file containing GloVe vectors
with zipfile.ZipFile('glove.6B.zip') as glovezip:
    with glovezip.open('glove.6B.50d.txt') as glovefile:
        # Each line of GloVe represents a vector separated by spaces
        # where first element is the word followed by the word vector
        for li, line in enumerate(glovefile):
            # Print progress
            if (li+1)%10000==0: print('.',end='')
            
            # Decode the line to get rid of any
            # unparsable symbols
            line_tokens = line.decode('utf-8').split(' ')
            
            # Get the word
            word = line_tokens[0]
            
            # Get the vector
            vector = [float(v) for v in line_tokens[1:]]
            
            # Make sure the vector is length 50
            assert len(vector)==50
            
            # Check if the GloVe word present in the dictionary
            if word in dictionary.keys():
                
                # We get the correct spelling words for misspelled words in caption dataset
                words_in_glove.append(word)
                
                # Add the vector to the correct position of the 
                # placeholder array
                pret_embeddings[dictionary[word],:] = vector
                
                # Update found word ids
                words_found += 1
                found_word_ids.append(dictionary[word])
                
                # we initialize the words like person's with the embeddings of person
                # because pretrained embeddings dont have words like person's
                word_with_s = word + '\'s'
                if word_with_s in dictionary.keys():
                    pret_embeddings[dictionary[word_with_s],:] = vector
                    words_found += 1
                    found_word_ids.append(dictionary[word_with_s])

# Print some stats
print('\n%d Words matched from pretrained embeddings (After matching words with \'s)'%words_found)
print('\nSample of words not found in the embeddings')
notfound_word_ids = list(set(list(range(0,vocabulary_size))) - set(found_word_ids))
for wid in notfound_word_ids[:100]:
    print(reverse_dictionary[wid],end=' ')


........................................
15666 Words matched from pretrained embeddings (After matching words with 's)

Sample of words not found in the embeddings
SOS EOS  vigrin outlooking kitche campstove thath parcoring shapped warching fooda fiilled sparsly olvie teniis snowboader oldertraditional sculture broncat enscription diswasher watiting platfrom croissaninsider rididing grassin tolit ooking witb timestamped railjet yachtsboats iside trussle chirch dimmely bowlfuls truck) twoo parasailers makeing weall kiteboards bookscds snowstore spraypaint hydrgon umbrells doorwall streetsign extinguiser frotn bannas 20mph marcp portugeuse giraaffes hyrandts excersising habiat sandwhich businesswear bozes buriesd danishes pecial conforter santuary watersurfing buldings cheerywine sittting grillmaster wheee graffiti'd polythenes packes recquet underneat pictcher handknitted graphicssign sevoflurane coffeeand drining tatter handwasher motrocycle windsail tomatogreen paraskiers groffetti un

## Cleanining Up the Captions 
This dataset has a surprising number of spelling mistakes in the caoptions. Therefore, we correct spelling mistakes by matching the incorrect words with words already found in the GloVe file. **Note that this can take few hours (around 2-3) to run**.

### Correcting Spellings for Training Data

In [5]:
word_list = []

# fname_to_caption_map is a filename -> [caption1, caption2, caption3]
# each caption is words seperated by spaces

completed_i = 0 # Used to print progress

# Iterate through caption lists for each image
for fn, cap_list in train_fname_to_caption_map.items():
    
    # Progress
    completed_i += 1
    if completed_i%1000==0:
        print('\n\tCompleted %d/%d\n'%(completed_i, len(train_fname_to_caption_map)))
    
    # Process each caption in list for a given image
    for cap_idx, cap in enumerate(cap_list):
        cap_corrected = [] # Stores the corrected caption
        cap_words = cap.split(' ') # Break caption to words
        
        # For each word in a caption
        for wi, cw in enumerate(cap_words):
            
            # We are going to truncate all the long sentences 
            # to 12 words. So let's save some computational time
            # by only correcting first 12 words of each caption
            if wi>12:
                break
                
            wid = dictionary[cw]
            
            # If word is in the notfound words
            if wid in notfound_word_ids:
                
                # For each word not found in pretrained embeddings we find most similar spellings
                for gw in words_in_glove:

                    cor, found_sim = correct_spellings.correct_wrong_word(cw,gw,cap)
                    cap_corrected.append(cor)
                    # We stop search as soon as we find 
                    # some mathing word
                    if found_sim:
                        break
            else:
                cap_corrected.append(cw)
        
        # String with correct spellings
        cap_corrected_str = ' '.join(cap_corrected)
        
        # Update the word list: contains unique words
        word_list.extend(cap_corrected_str.split(' '))
        word_list = list(set(word_list))
        cap_list[cap_idx] = cap_corrected_str
     
    # Update the caption holder with the correct caption (in place)
    train_fname_to_caption_map[fn] = cap_list

peope   peopel   0.9090909090909091  ( some peopel standing on a strange little car on the road  )
different   differet   0.9411764705882353  ( urinals at differet levels in a bathroom with a brick floor )
watching   wawtching   0.9411764705882353  ( a family wawtching a kite hign in the air )
doesnt   doesn't   0.9230769230769231  ( whatever is in the cooking pot doesn't look recognizable )
across   acoss   0.9090909090909091  ( a plane glides acoss thesurface of the water )
computer   compouter   0.9411764705882353  ( a man that is standing around next to a compouter )
doesnt   doesn't   0.9230769230769231  ( it doesn't appear to be a beautiful day outside )
peole   pepole   0.9090909090909091  ( there are a lot of pepole that are walking in the street )
hanging   haning   0.9230769230769231  ( a group of people haning around a walking area )
varying   varrying   0.9333333333333333  ( a man holding three frisbees of varrying colors  )
giraffe   girafe   0.9230769230769231  ( a girafe

cabinets   cabinents   0.9411764705882353  ( a line of unfinished cabinents in a warehouse )
equestrian   esquestrian   0.9523809523809523  ( an esquestrian obsticle designed to look like a zebra )
grazing   grazzing   0.9333333333333333  ( a picture of a field with multiple animals grazzing )
direct   'direct   0.9230769230769231  ( several rows of track with a 'direct rail services train on the furthest one  )
suitcases   suitecases   0.9473684210526315  ( two images of open suitecases full of toiletries  )
display   displayn   0.9333333333333333  ( lovely  clean dark set kitchen displayn newly remodeled )
giraffe   girraffe   0.9333333333333333  ( a giraffe with a man sitting in front of the girraffe )
giraffe   giraff   0.9230769230769231  ( a man sitting in a chair next to a giraff )
shredding   shreddin   0.9411764705882353  ( a kite surfer killin it while shreddin the gnar )
feild   feilds   0.9090909090909091  ( a train moving on train tracks next to a grass feilds )
pieces   p

owner   ownder   0.9090909090909091  ( an ownder of a dog takes a picture of the dog in the back seat )
frisbee   frisbe   0.9230769230769231  ( a person holding a water bpttle and throwing a frisbe  )
stairs   stairs]   0.9230769230769231  ( a dog sitting on top of stairs] with a red bow tie )
sandwich   sandwhich   0.9411764705882353  ( a long sub sandwhich with salomi and cheese )
raquet   raqcuet   0.9230769230769231  ( a girl swings a raqcuet at a tennis ball )
fuschia   fushia   0.9230769230769231  ( a humming bird above a fushia flower among green plants  )
labrador   laborador   0.9411764705882353  ( a yellow laborador laying in the grass holding a blue frisbee in its mouth )
smiles   smies   0.9090909090909091  ( a man playing guitar smies into the camera )
doesnt   doesn't   0.9230769230769231  ( the little girl doesn't want to keep sitting in her seat )
fluorescent   florescents   0.9090909090909091  ( careful bicycle riders add florescents to their clothes for safety in the

frisbee   frisbe   0.9230769230769231  ( a montage of a man throwing a frisbe  )
broccoli   broccolli   0.9411764705882353  ( a close up of a bowl with vegetables with broccolli )
google   googley   0.9230769230769231  ( a pink monster made out of a toilet seat with googley eyes )
lasagna   lasangna   0.9333333333333333  ( this to me looks like a giant piece of lasangna pizza )
written   writtien   0.9333333333333333  ( a freeway traffic sign writtien in an asian language )
spraypainted   spraypaint   0.9090909090909091  ( spraypaint tag on the side of a commuter train car at a station )
racquets   raquets   0.9333333333333333  ( a few people with tennis raquets in front of the net )
hotel   ahotel   0.9090909090909091  ( a woman in a congested kitchen in ahotel )
ceiling   celing   0.9230769230769231  ( people walking in a museum with a airplane hanging from the celing )
umbrellas   umberellas   0.9473684210526315  ( two people with umberellas standing on the river path )
bathroom   b

blueberry   bluebery   0.9411764705882353  ( a bluebery cake is on a plate and is topped with butter )

	Completed 14000/39500

valentines   valentine's   0.9523809523809523  ( a cake is pictured that is a combined friday the 13th and valentine's day chocolate layer cake )
serve   szerve   0.9090909090909091  ( the tennis player is attempting to return the szerve )
restaurant   resturant   0.9473684210526315  ( two couples having pizza and soda at a small resturant )
restaurant   resturant   0.9473684210526315  ( a picture of a table of food in a resturant )
tennis   teennis   0.9230769230769231  ( a man is playing teennis on the court  )
parked   parkeed   0.9230769230769231  ( a couple of boats parkeed near a boating dock  )
people   peoople   0.9230769230769231  ( this is peoople cutting up a birthday cake )
hamster   hampster   0.9333333333333333  ( a hampster sitting on a lap next to a brush )
conversation   conversating   0.9166666666666666  ( a group of women eating at a dinner 

portraits   potraits   0.9411764705882353  ( a clock in between animals potraits hanged in a wall )
separate   sepaerate   0.9411764705882353  ( a bathroom with two sepaerate sinks cabinets  and mirrors )
covers   coversa   0.9230769230769231  ( a bedroom with white bed coversa grey pillow and two lamps )
numerous   numberous   0.9411764705882353  ( two vehicles parked on a street with numberous stop signs around )
pepperoni   peppeoni   0.9411764705882353  ( a plate with cut slices of pizza and peppeoni )
monitor   monior   0.9230769230769231  ( two laptops and a computer monior sit next to each other on a desk  )

	Completed 20000/39500

either   ither   0.9090909090909091  ( there are two zebras that are standing by each ither )
background   bakground   0.9473684210526315  ( a clock on top of a building with a sky in the bakground )
board   fboard   0.9090909090909091  ( a young man is holding a blue sur fboard )
doesnt   doesn't   0.9230769230769231  ( he is so tall that his wet su

valentines   valentine's   0.9523809523809523  ( i man is holding up a toothbrush with a valentine's day note attached to it )
motorcyle   motorcylce   0.9473684210526315  ( an old model green motorcylce sits on the ground )
buildings   buidings   0.9411764705882353  ( i love the way the sun is creeping behind those two buidings )
desert   deseart   0.9230769230769231  ( a beautiful deseart  platter of cheesecake and strawberries )
controllers   controlers   0.9523809523809523  ( a cocker spaniel puppy laying on top of controlers )
street   sreet   0.9090909090909091  ( a truck with graffiti written on it on a city sreet  )
standing   standng   0.9333333333333333  ( there are two men that are standng by each other  )
multicolored   multitcolored   0.96  ( a small yellow bathroom with two multitcolored towels )
bathroom   bathtroom   0.9411764705882353  ( there is a white toilet in the bathtroom with the tall silver pole  )
bicycles   bicyclers   0.9411764705882353  ( a city street with

motorcycle   motorcyclers   0.9090909090909091  ( two speed motorcyclers turning a corner very sharply )
wrapped   wraped   0.9230769230769231  ( a bacon wraped hotdog with onion on it )
people   pepople   0.9230769230769231  ( there are pepople that are playing in the field  )
shirt   tshirt   0.9090909090909091  ( a child in a green tshirt is playing a wii game )
project   prohject   0.9333333333333333  ( the man is working on a important prohject )

	Completed 25000/39500

living   lving   0.9090909090909091  ( a tv set is in the lving room with a fire place )
refrigerator   refridgerator   0.96  ( a refridgerator with popular gaming characters on it )
competitor   competior   0.9473684210526315  ( competior preparing to return volley during tennis match )
laptop   lapto   0.9090909090909091  ( this is a lapto sitting on a white bed )
couch   courch   0.9090909090909091  ( a very black dog lying on a courch )
strafe   strafze   0.9230769230769231  ( a street sign that reads  greta g

sandwich   sandwhich   0.9411764705882353  ( a sandwhich sitting on a plate next to a glass of tea bowl of soup )
siting   siiting   0.9230769230769231  ( birds siiting on a boat that is in the water )
growing   growings   0.9333333333333333  ( an adult driving a speed boat in the water near some rock growings )
motorcycle   motorcyce   0.9473684210526315  ( a cross country motorcyce on a dirt road )
shirts   tshirts   0.9230769230769231  ( some teddy bears wearing tshirts sitting together  )
closed   eclosed   0.9230769230769231  ( a zebra under a tree in an eclosed area in a zoo )
motorcycle   (motorcycle)   0.9090909090909091  ( a dirt bike (motorcycle) parked behind a white van on the grass )
hangar   hangara   0.9230769230769231  ( inside a airplane hangara stealth plane hanging  up )

	Completed 28000/39500

directions   directons   0.9473684210526315  ( two giraffes standing beside each other but in opposite directons  )
motorcyle   motorcylce   0.9473684210526315  ( a woman wea

crullers   cruller   0.9333333333333333  ( the cruller is on the  paper served  with dunkin donut coffee )
floating   floting   0.9333333333333333  ( the boats are floting on top of the water )
motorcycles   motorcylces   0.9090909090909091  ( an image of a group of policeman on motorcylces )
coffee   coffe   0.9090909090909091  ( coffe cups a cell phone a fork and a meal are scattered about the table )
clydesdale   clydsdale   0.9473684210526315  ( two clydsdale horses being trained by a man and a woman )
miniature   miniture   0.9411764705882353  ( a miniture train hauling a group of people on a ride )
counting   countring   0.9411764705882353  ( a man cross countring sking on a set of trails )
bordered   boardered   0.9411764705882353  ( person in a grassy green field boardered by mountains )
computer   compuiter   0.9411764705882353  ( the worker is looking for information on the compuiter )
holding   holdinga   0.9333333333333333  ( a man is holdinga baseball bat by a chain fence 

baggage   bagage   0.9230769230769231  ( a bagage track takes bags at the airports )
extinguishers   extinguiser   0.9166666666666666  ( a fire department put a fire extinguiser in a park )
lichen   litchen   0.9230769230769231  ( a litchen with black cabinet doors and chairs  )
laying   layling   0.9230769230769231  ( a cat layling on a red blanket and looking relaxed )
snowboard   sno0wboard   0.9473684210526315  ( a person in blue grabbing a sno0wboard in the air )
shirt   tshirt   0.9090909090909091  ( a man dressed in a long sleeve tshirt and tie with a hat  to the side )
dimly   dimmly   0.9090909090909091  ( three people posing for the camera in a dimmly lit room )
displaying   diplaying   0.9473684210526315  ( a woman is diplaying a pizza that fits in her hands )
refrigerator   refridgerator   0.96  ( a refridgerator with a blender and basket sitting on top  )
raquet   raquett   0.9230769230769231  ( a man holding a raquett on a court )
waiting   waitng   0.9230769230769231  ( 

grabbing   grabing   0.9333333333333333  ( a fire hydrgon that a man is grabing )
sandwich   sandwhich   0.9411764705882353  ( a sandwhich cut in half and sitting on top of wrapping paper )
dinghy   dinghys   0.9230769230769231  ( a row of dinghys are pulled up on a beach with people in the background )


### Correcting Spellings for Validation/Testing Data

In [7]:
completed_i = 0 # Used to print progress

# Iterate through caption lists for each image
for fn, cap_list in valid_fname_to_caption_map.items():
    
    # Progress
    completed_i += 1
    if completed_i%1000==0:
        print('Completed %d/%d'%(completed_i, len(valid_fname_to_caption_map)))
        
    # Process each caption in list for a given image
    for cap_idx, cap in enumerate(cap_list):
        cap_corrected = []
        cap_words = cap.split(' ') # break caption to words
        
        # For each word in a caption
        for cw in cap_words:
            
            if cw in dictionary:
                wid = dictionary[cw]
                
                # If word is in the notfound words
                if wid in notfound_word_ids:

                    # for each word not found in pretrained embeddings we find most similar spellings
                    for gw in words_in_glove:
                        cor, found_sim = correct_spellings.correct_wrong_word(cw,gw,cap)
                        cap_corrected.append(cor)

                        # We stop search as soon as we find 
                        # some mathing word
                        if found_sim:
                            break
                else:
                    cap_corrected.append(cw)
            else:
                cap_corrected.append(cw)
        
        # String with correct spellings
        cap_corrected_str = ' '.join(cap_corrected)
        
        # Update the word list: contains unique words
        word_list.extend(cap_corrected_str.split(' '))
        word_list = list(set(word_list))
        cap_list[cap_idx] = cap_corrected_str
        
    valid_fname_to_caption_map[fn] = cap_list

parasail   parasails   0.9411764705882353  ( parasails glide above the blue water of the lake )
necklace   knecklace   0.9411764705882353  ( someone wearing a knecklace with a charm of a pair of scissors on it )
skateboarder   skaterboader   0.9166666666666666  ( a skaterboader flipping his skateboard in the air )
grinding   igrinding   0.9411764705882353  ( view of a skateboarder igrinding a rail through a lens )
porcelain   porcelin   0.9411764705882353  ( a squat style toilet made of white porcelin )
chocolate   choclate   0.9411764705882353  ( a child is eating a choclate doughnout  )
doughnut   doughnout   0.9411764705882353  ( a child is eating a choclate doughnout  )
bright   beright   0.9230769230769231  ( children on beright sunny day playing soccer who appear to be about 5 years old  )
shaped   shapped   0.9230769230769231  ( a bathroom with a blue tiled floor and a odd shapped toilet )
mirror   miror   0.9090909090909091  ( a close up of a motorcycle rear view miror )
conven

## Building the Dictionary with Clean Data

Now we again create `dictionary` and `reverse_dictionary` objects with the correctly spelled data.

In [8]:
# Add two special tokens to the dictionary
dictionary = {'SOS':0, 'EOS': 1}

# Create an ID for each word in the unique_words
for tg in unique_words:
    dictionary[tg] = len(dictionary)

reverse_dictionary = dict([(v,k) for k,v in dictionary.items()]) 

# Print some data
print('\nSample words')
print(word_list[:10])
print('\nSample Dictionary Items')
print(list(dictionary.items())[:10])
print('\nSample Reverse Dictionary Items')
print(list(reverse_dictionary.items())[:10])
print('\nVocabulary size: ',len(dictionary))

# Just checking if the dictionary and reverse dictionary
# are correct
assert 'horse' == reverse_dictionary[dictionary['horse']]

vocabulary_size = len(dictionary)

del word_list


Sample words
['', 'escalators', 'preparation', 'squatting', 'wort', 'bodyboard', 'horse', 'setups', 'parker', 'ears']

Sample Dictionary Items
[('', 2), ('escalators', 3), ('preparation', 4), ('squatting', 5), ('wort', 7), ('simple', 4386), ('bodyboard', 8), ('horse', 9), ('hyrdant', 15468), ('aliens', 17550)]

Sample Reverse Dictionary Items
[(0, 'SOS'), (1, 'EOS'), (2, ''), (3, 'escalators'), (4, 'preparation'), (5, 'squatting'), (6, 'campstove'), (7, 'wort'), (8, 'bodyboard'), (9, 'horse')]

Vocabulary size:  17954


In [9]:
train_fname_caption_numeric_tuples = []
valid_fname_caption_numeric_tuples = []

# Add SOS to beginning of sentence, Add EOS s.t. all captions size = max_caption_length
# we need to make the max_cap_length even (12) s.t. when we add the image vector the full length is odd (13.
# Then we can generate an unroll a single batch in two steps
# We do the 6 unrollings per a single batch of images (defined later), 
# therefore needing 2 iterations to process a full batch
# If you have a large GPU, you can set this to 12 (image feature vec + caption), 
# which will increase computational efficiency
max_caption_length = 12

# Traverse through each and every training caption
for k,v in train_fname_to_caption_map.items():
    # Each image has several captions
    for cap in v:
        # Split the caption to words
        cap_tokens = cap.split(' ')
        # Insert a SOS at the beginning
        cap_tokens.insert(0,'SOS')
        # If the sentence is short, append EOS until the caption is max_caption_length long
        if len(cap_tokens)<max_caption_length:
            cap_tokens.extend(['EOS' for _ in range(max_caption_length-len(cap_tokens))])
        # If the sentence is long, truncate the sentence so the caption is max_caption_length long
        if len(cap_tokens) > max_caption_length:
            del cap_tokens[max_caption_length:]
        
        # Make sure the processed caption is max_caption_length long
        assert len(cap_tokens)==max_caption_length
        
        # Replace each word in the caption with word ID
        num_cap = []
        for word in cap_tokens:
            num_cap.append(dictionary[word])
        
        # Add the numerical caption to the list
        train_fname_caption_numeric_tuples.append([k,num_cap])

# Process all the captions in the validation set similarl
# to the training set
for k,v in valid_fname_to_caption_map.items():
    for cap in v:
        cap_tokens = cap.split(' ')
        cap_tokens.insert(0,'SOS')
        if len(cap_tokens)<max_caption_length:
            cap_tokens.extend(['EOS' for _ in range(max_caption_length-len(cap_tokens))])
        if len(cap_tokens) > max_caption_length:
            del cap_tokens[max_caption_length:]
        
        assert len(cap_tokens)==max_caption_length
        num_cap = []
        for word in cap_tokens:
            num_cap.append(dictionary[word])
        
        valid_fname_caption_numeric_tuples.append([k,num_cap])

print(train_fname_caption_numeric_tuples[:10])

# Persist the numerical captions so that we can run the code from the next cell onwards
# if needed
with open(os.path.join('tmp','tmp_rnn_api_train_fname_caption_numeric_tuples.pkl'),'wb') as f:
    pickle.dump(train_fname_caption_numeric_tuples,f)
with open(os.path.join('tmp','tmp_rnn_api_valid_fname_caption_numeric_tuples.pkl'),'wb') as f:
    pickle.dump(valid_fname_caption_numeric_tuples,f)
with open(os.path.join('tmp','tmp_rnn_api_reverse_dictionary.pkl'),'wb') as f:
    pickle.dump(reverse_dictionary,f)


[['COCO_val2014_000000514249.jpg', [0, 5284, 9574, 12357, 14855, 14824, 5893, 11312, 16582, 10174, 13026, 8564]], ['COCO_val2014_000000514249.jpg', [0, 17393, 5713, 15423, 5284, 2718, 16239, 8564, 6222, 4006, 5284, 12357]], ['COCO_val2014_000000514249.jpg', [0, 5284, 9574, 12357, 14855, 8564, 12303, 14824, 2784, 5284, 16239, 5713]], ['COCO_val2014_000000514249.jpg', [0, 5284, 9574, 12357, 14855, 4996, 5893, 11312, 1900, 15954, 4575, 1]], ['COCO_val2014_000000514249.jpg', [0, 5284, 12357, 14855, 7140, 4996, 14824, 2784, 17393, 4621, 1, 1]], ['COCO_val2014_000000562054.jpg', [0, 5284, 11252, 2271, 15423, 9237, 7420, 10550, 5284, 14696, 4006, 5284]], ['COCO_val2014_000000562054.jpg', [0, 17453, 13260, 9237, 11010, 15423, 7420, 8564, 5284, 7621, 1, 1]], ['COCO_val2014_000000562054.jpg', [0, 17453, 13260, 9237, 7420, 7140, 13260, 11451, 16861, 8640, 15472, 2]], ['COCO_val2014_000000562054.jpg', [0, 9237, 4055, 2164, 12519, 11451, 5893, 11312, 8640, 15472, 1, 1]], ['COCO_val2014_000000562054

## Creating the Pretrained Embeddings with the Clean Data

Here we run the code we ran initially to initialize the pretrained vector numpy array but this time with the cleaned data. We have to rerun this because the dictionary changed after cleaning data.

In [None]:
pret_embeddings = np.empty(shape=(vocabulary_size,50),dtype=np.float32)

words_in_glove = []
words_found = 0
found_word_ids = []

# We read the downloaded zip file containing GloVe vectors
with zipfile.ZipFile('glove.6B.zip') as glovezip:
    with glovezip.open('glove.6B.50d.txt') as glovefile:
        
        # Each line of GloVe represents a vector separated by commas
        # where first element is the word followed by the word vector
        for li, line in enumerate(glovefile):
            # Progress
            if (li+1)%10000==0: print('.',end='')
                
            # Reading in the line
            line_tokens = line.decode('utf-8').split(' ')
            
            # Get the word
            word = line_tokens[0]
            
            # Get the word vector
            vector = [float(v) for v in line_tokens[1:]]
                    
            # For each word in the dictionary
            if word in dictionary.keys():
                # Update the pret_embeddings array
                pret_embeddings[dictionary[word],:] = vector
                
                # Update inforation about matched words between the files
                words_found += 1
                found_word_ids.append(dictionary[word])
                
                # We initialize the words like person's with the embeddings of person
                # because pretrained embeddings dont have words like person's
                word_with_s = word + '\'s'
                if word_with_s in dictionary.keys():
                    pret_embeddings[dictionary[word_with_s],:] = vector
                    words_found += 1
                    found_word_ids.append(dictionary[word_with_s])

# Print some statistics 
print('\n%d Words matched from pretrained embeddings (After matching words with \'s)'%words_found)
print('Words not found in the embeddings')
notfound_word_ids = list(set(list(range(0,vocabulary_size))) - set(found_word_ids))
for wid in notfound_word_ids[:100]:
    print(reverse_dictionary[wid],end=' ')

# We will add vectors for SOS and EOS manually (random)
pret_embeddings[0,:]= np.random.uniform(size=(50),low=-1.0, high=1.0)
pret_embeddings[1,:]= np.random.uniform(size=(50),low=-1.0, high=1.0)

#Saving the pretrained_glove_embeddings
work_dir = 'image_caption_data'
np.save(work_dir + os.sep + 'pretrained-glove-embeddings-tmp',pret_embeddings)

## RESTORE POINT: Load Numerical Captions from Disk
Only run this code if you closed the notebook after running the code above and starting fresh. *You do not need to run this if you have run the above code successfully*.

In [None]:
# Set the image file directory
image_file_dir = os.path.join('image_caption_data', *('train_valid','images'))
# Get the filenames in the directory
image_filenames = [os.path.join(image_file_dir,f) \
                   for f in os.listdir(image_file_dir) \
                   if os.path.isfile(os.path.join(image_file_dir, f)) and f.endswith('.jpg')]

# Splitting training and testing data
train_image_filenames = image_filenames[1000:] 
test_image_filenames = image_filenames[:1000]

if os.path.exists(os.path.join('tmp','tmp_rnn_api_train_fname_caption_numeric_tuples.pkl')):
    with open(os.path.join('tmp','tmp_rnn_api_train_fname_caption_numeric_tuples.pkl'),'rb') as f:
        train_fname_caption_numeric_tuples = pickle.load(f)
else:
    print('You cannot use this restoration point as '+
          'the file tmp/tmp_rnn_api_train_fname_caption_numeric_tuples.pkl does not exist')
    
if os.path.exists(os.path.join('tmp','tmp_rnn_api_valid_fname_caption_numeric_tuples.pkl')):
    with open(os.path.join('tmp','tmp_rnn_api_valid_fname_caption_numeric_tuples.pkl'),'rb') as f:
        valid_fname_caption_numeric_tuples = pickle.load(f)
else:
    print('You cannot use this restoration point as '+
          'the file tmp/tmp_rnn_api_valid_fname_caption_numeric_tuples.pkl does not exist')
        
if os.path.exists(os.path.join('tmp','tmp_rnn_api_reverse_dictionary.pkl')):
    with open(os.path.join('tmp','tmp_rnn_api_reverse_dictionary.pkl'),'rb') as f:
        reverse_dictionary = pickle.load(f)
else:
    print('You cannot use this restoration point as '+
          'the file tmp/tmp_rnn_api_reverse_dictionary.pkl does not exist')

## Creating a Validation Caption Dictionary

Here we create a dictionary `valid_fname_caption_map` which has the format `{fname_1:[[caption 1 word list],[caption 2 word list],...,[caption 3 word list]]}`

In [11]:
valid_fname_caption_map = {}
for fn in test_image_filenames:
    valid_fname_caption_map[os.path.split(fn)[1]] = [] 
    
for fn,c in valid_fname_caption_numeric_tuples:
    caption_w_list = []
    for w_id in c:
        if not (reverse_dictionary[w_id]=='SOS' or reverse_dictionary[w_id]=='EOS'):
            caption_w_list.append(reverse_dictionary[w_id])
    valid_fname_caption_map[fn].append(caption_w_list)

print(valid_fname_caption_map)



## Generating Batches of Data
The following object generates a batch of data which will be used to train the LSTM. More specifically the generator breaks the batch generation to two tasks. It will either output image feature vector followed by a sequence of words, or it can output just a sequence of words. First it decomposes a given sequence of words into `batch_size` segments. We also maintain a cursor for each segment. So whenever we create a batch of data, we sample one item from each segment and update the cursor of each segment. Here:

* `batch_data`: A `[batch_size,input_size]` array that contains image feature vector or words. If it contains words, word IDs will be in the first column and will be padded zeros to make each row `input_size` long.
* `batch_labels`: One hot encoded representation of words

In [12]:
input_size = 1000
tot_captions = len(train_fname_caption_numeric_tuples)

class DataGeneratorSeq(object):
    
    def __init__(self,batch_size,num_unroll,cap_length,image_vector_file,is_train):
        # The size of a single batch of data
        self._batch_size = batch_size
        # Number of unrolling steps to perform
        # To make sure we process all the words in the sequence
        # We make this a factorial of the max_caption_length
        self._num_unroll = num_unroll
        # Caption length (= max_caption_length)
        self._cap_length = cap_length
        # Cursors for each segment
        self._cursor = [0 for offset in range(self._batch_size)]
        
        # Load the image feature vectors
        self._image_data = json.load(open(image_vector_file))
        
        # Batch of cap IDs being processed at a given time
        self._cap_ids = None
        # Check if we are processing training data or validation data
        self._is_train = is_train
        
    def next_batch(self,cap_ids,first_sample):
        '''
        Produces a single batch of data
        '''
        # Holds the inputs and outputs to the network
        batch_data = np.zeros((self._batch_size,input_size),dtype=np.float32) 
        batch_labels = np.zeros((self._batch_size),dtype=np.int32)
        
        # Populate each index withing the batch
        for b in range(self._batch_size):
            cap_id = cap_ids[b] # Caption id corresponding to that position
            
            # If it's training data, get data from training data related variables
            if self._is_train:
                cap_image_vec = self._image_data[train_fname_caption_numeric_tuples[cap_id][0]]
                cap_text = train_fname_caption_numeric_tuples[cap_id][1]
            # If it's testing data, get data from testing data related variables
            else:
                cap_image_vec = self._image_data[valid_fname_caption_numeric_tuples[cap_id][0]]
                cap_text = valid_fname_caption_numeric_tuples[cap_id][1]
            
            # If the cursor exceeds the caption length, reset it
            if self._cursor[b]+1>=self._cap_length:
                self._cursor[b] = 0
            
            # If first sample the first input should be the image feature vector
            if first_sample:
                batch_data[b] = cap_image_vec
                batch_labels[b] = cap_text[0]
            # Else it should be just current word as input and the following word as the output
            else:
                # We are going to append 999 zeros to the end of the caption word index
                # this way we can have all inputs same size
                # and when processing through tensorflow we use tf.reduce_sum to get the indices
                batch_data[b] = np.array([cap_text[self._cursor[b]]]+[0 for _ in range(input_size-1)])
                batch_labels[b] = cap_text[self._cursor[b]+1]
                
                # Increment the cursor
                self._cursor[b] = (self._cursor[b]+1)%self._cap_length

        return batch_data,batch_labels
        
    def unroll_batches(self,first_set):
        '''
        Unroll a set of batches over time
        '''
        
        # This is to select a random set of captions at the beginning of unrolling
        # if first_set variable is True
        if first_set:
            self._cap_ids = np.random.randint(0,tot_captions,self._batch_size)
            self._cursor = [0 for _ in range(self._batch_size)]
            
        unroll_data,unroll_labels = [],[]
        for ui in range(self._num_unroll):
            # The first batch in any batch of captions is different
            if first_set and ui==0:
                data, labels = self.next_batch(self._cap_ids,True)            
            else:
                data, labels = self.next_batch(self._cap_ids,False)  

            unroll_data.append(data)
            unroll_labels.append(labels)
        
        return unroll_data, unroll_labels
    
    def reset_indices(self):
        self._cursor = [0 for offset in range(self._batch_size)]
        
# Running a tiny set to see if the implementation correct
dg = DataGeneratorSeq(
    batch_size=5,num_unroll=5,cap_length=12,
    image_vector_file=os.path.join('image_caption_data',*('train_valid','image_encodings','train_image_encodings.json')),
    is_train=True
                     )
u_data, u_labels = dg.unroll_batches(True)

for ui,(dat,lbl) in enumerate(zip(u_data,u_labels)):   
    print('\n\nUnrolled index %d'%ui)
    dat_ind = dat
    lbl_ind = lbl
    print('\tInputs:')
    for single_dat in dat_ind:
        print('\t%s'%(single_dat[:5]),end=", ")
    print('\n\tOutput:')
    for single_lbl in lbl_ind:        
        print('\t%s'%(np.argmax(single_lbl)),end=", ")



Unrolled index 0
	Inputs:
	[-5.25316   -1.9451786 -2.8510392 -2.1619089 -0.7653859], 	[-0.5736368  -3.0677907   6.0366945   0.01593766  2.8579597 ], 	[-3.739019   -0.8608974  -1.9272035  -0.32741022 -0.19843721], 	[-2.7862813  3.6620226 -3.2841046 -5.858438  -5.741728 ], 	[-3.5580664 -3.5335622 -1.8530213 -3.2802074 -0.6190071], 
	Output:
	0, 	0, 	0, 	0, 	0, 

Unrolled index 1
	Inputs:
	[0. 0. 0. 0. 0.], 	[0. 0. 0. 0. 0.], 	[0. 0. 0. 0. 0.], 	[0. 0. 0. 0. 0.], 	[0. 0. 0. 0. 0.], 
	Output:
	0, 	0, 	0, 	0, 	0, 

Unrolled index 2
	Inputs:
	[5284.    0.    0.    0.    0.], 	[5284.    0.    0.    0.    0.], 	[17393.     0.     0.     0.     0.], 	[5284.    0.    0.    0.    0.], 	[17453.     0.     0.     0.     0.], 
	Output:
	0, 	0, 	0, 	0, 	0, 

Unrolled index 3
	Inputs:
	[12369.     0.     0.     0.     0.], 	[12742.     0.     0.     0.     0.], 	[1029.    0.    0.    0.    0.], 	[2791.    0.    0.    0.    0.], 	[13260.     0.     0.     0.     0.], 
	Output:
	0, 	0, 	0, 	0, 	0, 

U

### Defining hyperparameters

Here we define several hyperparameters such as `batch_size`, `input_size`, ...

In [13]:
input_size = 1000

# Length of the caption, as well as,
# Maximum number of input output tuples 
# that can be created from a single sequence of data
sequence_length = 12

# Number of neurons in the LSTM Cell
num_nodes = 256
# Batch size
batch_size = 50
# Unrolling steps
num_unrollings = 6

### Defining Inputs and Outputs

In the code we define two different types of inputs. 
* Training inputs (batch_size > 1 with unrolling) 
* Validation/Test inputs (An unseen validation dataset) (bach_size > 1, no unrolling)


In [17]:
tf.reset_default_graph()

# Training data
is_train_text, train_inputs, train_labels = [],[],[]

for ui in range(num_unrollings):
    is_train_text.append(tf.placeholder(tf.bool, shape=None, name='is_train_text_data_%d'%ui))
    train_inputs.append(tf.placeholder(tf.float32, shape=[batch_size,input_size],name='train_inputs_%d'%ui))
    train_labels.append(tf.placeholder(tf.int32, shape=[batch_size], name = 'train_labels_%d'%ui))
    
# Testing: Given an image generate the text.
test_is_train_text = tf.placeholder(tf.bool, shape=None,name='is_test_text_data')
test_input = tf.placeholder(tf.float32, shape=[batch_size,input_size],name='test_input')


### Defining Model Parameters

Now we define model parameters. Compared to RNNs, LSTMs have a large number of parameters. We define an LSTM cell using the TensorFlow RNN API here. Specifically we define the following parameters:

* The pretrained embeddings as a variable
* An linear layer to get embeddings that matches the size of the feature vectors
* A softmax layer to get the predictions out

In [18]:
# We are loading the pretrained embeddings
with tf.variable_scope('embeddings'):
    embeddings = tf.get_variable(
        'glove_embeddings',shape=[vocabulary_size, 50], initializer=tf.constant_initializer(pret_embeddings,dtype=tf.float32)
    )

    # We need to match the size of the input to the LSTM to be same as input_size always
    # For that we use a dense layer that will take the input of size 50 and produce inputs of size 1000 (input size)
    embedding_dense = tf.get_variable('embedding_dense',shape=[50,1000],
                                      dtype=tf.float32,initializer=tf.contrib.layers.xavier_initializer())
    embedding_bias = tf.get_variable('embedding_bias',
                                     dtype=tf.float32,
                                     initializer=tf.random_uniform(shape=[1000],minval=-0.1,maxval=0.1))

# LSTM cell and Dropout Cell
with tf.variable_scope('rnn'):
    lstm = tf.nn.rnn_cell.LSTMCell(num_nodes)
    # We use dropout to improve the performance
    dropout_lstm = rnn.DropoutWrapper(
        cell=lstm, input_keep_prob=0.8,
        output_keep_prob=0.8, state_keep_prob=1.0,
        dtype=tf.float32
    )

# Defining the softmax weights and biases
with tf.variable_scope('rnn'):    
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], stddev=0.01),name='softmax_weights',trainable=True)
    b = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01),name='softmax_bias',trainable=True)

# Unlike the training state, we would like more control over the testing
# state, so we externally define that
saved_test_output = tf.Variable(tf.zeros([batch_size, num_nodes]))
saved_test_state = tf.Variable(tf.zeros([batch_size, num_nodes]))

### Defining LSTM computations
Here we define inference logic for the parameters we defined above. Compared to the other exercise, we also have an automatic way of processing inputs differently depending on whether we are processing images or text at the moment.

In [19]:
# This is where we check if we are processing images or text
# Because depending on which we're processing we have to do different things
# For images: We just process them as they are
# For text: For the given word ids, we need to look up the embeddings
# and send it through the linear layer
train_inputs_processed = []
for ui in range(num_unrollings):
    
    train_inputs_processed.append(
        tf.cond(is_train_text[ui],
                lambda: tf.add(
                    tf.matmul(tf.nn.embedding_lookup(
                        embeddings, tf.reduce_sum(tf.cast(train_inputs[ui],tf.int32),axis=1)
                    ),embedding_dense),embedding_bias),                
                lambda: tf.identity(train_inputs[ui]))
    )

# Define Initial State
initial_state = lstm.zero_state(batch_size, dtype=tf.float32)

# Setting shape for the processed inputs because the shape information is lost 
# when passing through tf.cond
# But is essential for tf.nn.dynamic_rnn
[t_in.set_shape([batch_size,input_size]) for t_in in train_inputs_processed]

# Gives a [num_unrolling, batch_size, num_nodes] size output
train_outputs, initial_state = tf.nn.dynamic_rnn(
    dropout_lstm, tf.concat([tf.expand_dims(t_in,axis=0) for t_in in train_inputs_processed],axis=0), 
    time_major=True, initial_state=initial_state
)

# Calculate the final output logits for all unrolled steps
final_output = tf.reshape(train_outputs,[-1,num_nodes])
logits = tf.matmul(final_output, w) + b

# Predictions.
train_prediction = tf.nn.softmax(logits)

# Get time majorly reshaped logits (to define loss)
time_major_train_logits = tf.reshape(logits,[num_unrollings,batch_size,vocabulary_size])

# Get time majoryly reshaped labels (to define loss)
time_major_train_labels = tf.reshape(tf.concat(train_labels,axis=0),[num_unrollings,batch_size])

# ===========================================================
# Test inference logic

# Process test inputs
test_inputs_processed = \
        tf.cond(test_is_train_text,
                lambda: tf.add(
                    tf.matmul(tf.nn.embedding_lookup(
                        embeddings, tf.reduce_sum(tf.cast(test_input,tf.int32),axis=1)
                    ),embedding_dense),embedding_bias),                
                lambda: tf.identity(test_input))

# Computing the LSTM output
test_initial_state = (saved_test_output,saved_test_state)
test_output, test_initial_state = lstm(test_inputs_processed, test_initial_state)

# Making sure the states are saved before making predictions
with tf.control_dependencies([saved_test_output.assign(test_initial_state[0]),
                                saved_test_state.assign(test_initial_state[1])]):
    # Test prediction
    test_prediction = tf.nn.softmax(tf.nn.xw_plus_b(test_output, w, b))


### Defining the Loss
We define the loss of the model as a sequence-to-sequence loss. TensorFlow provides a builtin function to compute this loss as shown below.

In [20]:
# We compute [num_unrollings] size loss vector denoting
# the mean loss for each of the batch in the num_unrollings steps
loss = tf.contrib.seq2seq.sequence_loss(
    logits = tf.transpose(time_major_train_logits,[1,0,2]),
    targets = tf.transpose(time_major_train_labels),
    weights= tf.ones([batch_size, num_unrollings], dtype=tf.float32),
    average_across_timesteps=False,
    average_across_batch=True
)

# Here we compute the loss over the time axis and sum each component
# We sum the num_unrollings losses across all the unrolled time steps
loss = tf.reduce_sum(loss)

### Defining Learning Rate and the Optimizer with Gradient Clipping
Here we define the learning rate and the optimizer we're going to use. We will be using the Adam optimizer as it is one of the best optimizers out there. Furthermore we use gradient clipping to prevent any gradient explosions.

In [21]:
# This variable and operation are used to decay the learning rate
# as we saw in chapter 8
global_step = tf.Variable(0, trainable=False)
inc_gstep = tf.assign(global_step,global_step + 1)

# We define a decaying learning rate
learning_rate = tf.train.exponential_decay(
    0.001, global_step, decay_steps=1, decay_rate=0.75, staircase=True)
# We define Adam Optimizer
optimizer = tf.train.AdamOptimizer(learning_rate)

# Gradient clipping
gradients, v = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer = optimizer.apply_gradients(
    zip(gradients, v))

### Resetting Operations for Resetting Hidden States
Sometimes the state variable needs to be reset (e.g. when starting predictions at a beginning of a new epoch)

In [22]:

reset_test_state = tf.group(
    saved_test_output.assign(tf.zeros([batch_size, num_nodes])),
    saved_test_state.assign(tf.zeros([batch_size, num_nodes]))
)

### Running Image Caption Generation

We now run the image caption generation using the TensorFlow functions we defined above. First we create data generators to generate training and test data. Then we train the algorithm for `num_steps` number of steps. Each step consists of two iterations. First iteration processes the first half of the full sequence of the inputs in a batch where the second iteration processes the second half. We also calculate the BLEU score over all the data in the validation set as well as display captions generated for a set of hand-picked examples.

In [25]:
# Training data generator
train_data_generator = DataGeneratorSeq(batch_size=batch_size,num_unroll=num_unrollings,cap_length=sequence_length,
                     image_vector_file='image_caption_data' + os.sep +'train_valid' + \
                      os.sep + 'image_encodings' + os.sep + 'train_image_encodings.json',is_train=True)

# Validation data generator
valid_data_generator = DataGeneratorSeq(batch_size=batch_size,num_unroll=1,cap_length=sequence_length,
                     image_vector_file='image_caption_data' + os.sep +'train_valid' + \
                      os.sep + 'image_encodings' + os.sep + 'valid_image_encodings.json',is_train=False)

# These image will be used to visually analyze the generated text
selected_images_to_view = ['COCO_val2014_000000000757.jpg','COCO_val2014_000000001029.jpg',
                           'COCO_val2014_000000001296.jpg','COCO_val2014_000000001369.jpg','COCO_val2014_000000001584.jpg',
                          'COCO_val2014_000000000885.jpg','COCO_val2014_000000003690.jpg','COCO_val2014_000000003832.jpg',
                           'COCO_val2014_000000004286.jpg','COCO_val2014_000000007444.jpg']

# Used for various testing/validation phase calculations
valid_image_data = json.load(open('image_caption_data' + os.sep +'train_valid' + \
                                          os.sep + 'image_encodings' + os.sep + 'valid_image_encodings.json'))
valid_fnames, valid_image_vectors = valid_image_data.keys(), valid_image_data.values()
        
# Create a session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.8
config.allow_soft_placement=True
sess = tf.InteractiveSession(config=config)

# Initialize all variables
tf.global_variables_initializer().run()
print('Initialized variables')
bleu_over_time = [0]

# Used to decay the learning rate
bleu_drop = 0
bleu_drop_threshold = 5

num_steps = 20001
print('Started Training')
for step in range(num_steps):
    
    # Each step has two training iterations
    # First iteration: Process the first half of the data sequence (image feature vector and following captions)
    # Second iteration: Processing the rest of the words in the caption
    
    # =================================================================
    # First step done starting from the image vector
    u_data, u_labels = train_data_generator.unroll_batches(first_set=True)
    
    # Populate the feed_dict dictionary
    # Feeding in inputs to all num_unrollings placeholders
    feed_dict = {}
    for ui,(dat,lbl) in enumerate(zip(u_data,u_labels)):            
        feed_dict[train_inputs[ui]] = dat
        feed_dict[train_labels[ui]] = lbl
        if ui==0:
            feed_dict[is_train_text[ui]] = False
        else:
            feed_dict[is_train_text[ui]] = True
        
    # Running TensorFlow operations
    _, l = sess.run([optimizer, loss], feed_dict=feed_dict)

    # =================================================================
    # Second iteration: Processing the rest of the words in the caption
    u_data, u_labels = train_data_generator.unroll_batches(first_set=False)
    
    feed_dict = {}
    for ui,(dat,lbl) in enumerate(zip(u_data,u_labels)):            
        feed_dict[train_inputs[ui]] = dat
        feed_dict[train_labels[ui]] = lbl
        feed_dict[is_train_text[ui]] = True
        #print(['( %s; %s ) '%(reverse_dictionary[tid],reverse_dictionary[til]) for tid,til in zip(np.argmax(dat,axis=1),np.argmax(lbl,axis=1))])

    # Running TensorFlow operations
    _, l = sess.run([optimizer, loss], feed_dict=feed_dict)

    
    if (step+1)%100==0:
        # =======================================================================================
        # Validation/Testing phase
        print('\n============================ Step ',step+1,' ==================================')
        
        # Make sure the filenames and image vectors lists
        valid_fnames = list(valid_fnames)
        valid_image_vectors = list(valid_image_vectors)

        # This will be populated with all the predictions we make,
        # For each of the 1000 images in the validation set
        all_test_predictions = ['' for _ in range(len(valid_fnames))]
        
        # Calculates the BLEU score for the validation set
        bleu_for_dataset = []
        
        # Process validation/test data as batches
        for test_batch_index in range(0, len(valid_fnames)-batch_size,batch_size):
            
            # The first input would be the batch of image vectors
            curr_inputs_batch = valid_image_vectors[test_batch_index: test_batch_index + batch_size]
            
            # Get the predictions out
            pred_batch = sess.run(
                test_prediction,
                feed_dict={test_input:np.asarray(curr_inputs_batch),test_is_train_text:False}
            )
            
            # Using the word ID, get the word string using the reverse dictionary for each item in the batch

            # This variable is used for display purposes alone
            pred_sentence_batch = [reverse_dictionary[np.asscalar(np.argmax(pred_batch[ri,:]))] for ri in range(batch_size)]
            
            # Get the word list predicted for each item in the batch
            # for max_caption_length+2 steps
            # We don't put the first word in to pred_word_list_batch as it is 'SOS' and 
            # we don't need that to calculate the BLEU
            # We iteratively feed the previous prediction's word embedding
            # as the input of the next time step
            pred_word_list_batch = [[] for ri in range(batch_size)] 
            for valid_step in range(max_caption_length+2):
                curr_inputs_batch = np.concatenate([np.argmax(pred_batch,axis=1).reshape(-1,1), np.zeros(shape=(batch_size,input_size-1))],axis=1)
                pred_batch = sess.run(test_prediction,feed_dict={test_input:curr_inputs_batch,test_is_train_text:True})
                for bi1 in range(batch_size):
                    max_pred = np.asscalar(np.argmax(pred_batch[bi1]))
                    pred_sentence_batch[bi1] = pred_sentence_batch[bi1] + ' ' + reverse_dictionary[max_pred] 
                    # If the output is EOS we do not add that to calculate BLEU
                    if reverse_dictionary[max_pred]  != 'EOS':
                        pred_word_list_batch[bi1] = pred_word_list_batch[bi1] + [reverse_dictionary[max_pred]]
            
            # Calculating BLEU for batch
            # A list of BLEU values for each prediction
            bleu_for_batch = []
            for bi in range(batch_size):
                all_test_predictions[test_batch_index+bi] = pred_sentence_batch[bi]
                bleu_for_batch.append(bleu_score.sentence_bleu(
                    valid_fname_caption_map[valid_fnames[test_batch_index+bi]], pred_word_list_batch[bi],
                    smoothing_function=bleu_score.SmoothingFunction().method4)
                                     )
                if valid_fnames[test_batch_index + bi] in selected_images_to_view:
                    print(valid_fnames[test_batch_index + bi],': ',pred_sentence_batch[bi])
            
            # Get the mean of the BLEUs for this batch and
            # append it to the global list
            bleu_for_dataset.append(np.mean(bleu_for_batch))
            
            sess.run(reset_test_state)
        
        
        print('BLEU-4 Score')
        current_bscore = np.mean(bleu_for_dataset)
        print('\t',current_bscore)
        
        # Decaying learning rate
        # If the bleu score has not improved in
        # bleu_drop_threshold steps
        
        if current_bscore < max(bleu_over_time):
            bleu_drop += 1
        else:
            bleu_drop = 0
        
        if bleu_drop >= bleu_drop_threshold:
            sess.run(inc_gstep)
            print('Dropping learning rate')
            bleu_drop = 0
            
        bleu_over_time.append(current_bscore)    
        

Initialized variables
Started Training

COCO_val2014_000000003832.jpg :  SOS a a a a a a a EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a a a a a a a EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001584.jpg :  SOS a a a a a a a EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a a a a a a a EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a a a a a a a EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a a a a a a a EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a a a a a a a EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a a a a a a a EOS EOS EOS EOS EOS EOS EOS
BLEU-4 Score
	 0.1936103206174203

COCO_val2014_000000003832.jpg :  SOS a man is is is is is is is is is on a table
COCO_val2014_000000001029.jpg :  SOS a man is is is is is is is is is is on a
COCO_val2014_000000001584.jpg :  SOS a man is is is is is is is is on a table EOS
COCO_val2014_000000000885.jpg :  SOS a man is is is 

COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man in a white shirt and a tie EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a plate with a sandwich on a plate EOS EOS EOS EOS EOS EOS
BLEU-4 Score
	 0.24801203748414813

COCO_val2014_000000003832.jpg :  SOS a group of people standing on a street EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a large airplane flying a kite on a skateboard EOS EOS EOS EOS EOS
COCO_val2014_000000001584.jpg :  SOS a train is parked on a street EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man in a tennis court EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a man is standing on a skateboard on a skateboard EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a giraffe EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a tennis racket EOS EOS EOS EOS


COCO_val2014_000000003832.jpg :  SOS a group of people on a beach with a surfboard EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a bird is flying a kite in the water EOS EOS EOS EOS EOS
COCO_val2014_000000001584.jpg :  SOS a bus driving down a street with a street sign EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is playing tennis on a tennis court EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a herd of elephants standing in a field EOS EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe is standing in a field EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man is holding a cell phone in a kitchen EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a plate of food with a sandwich and a sandwich on a table EOS
BLEU-4 Score
	 0.2570581571336324

COCO_val2014_000000003832.jpg :  SOS a boat is on a beach EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the water EOS EOS EOS EOS EO

COCO_val2014_000000001029.jpg :  SOS a plane flying over a blue sky EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001584.jpg :  SOS a bus parked on a city street EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is playing tennis on a tennis court EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a elephant standing in a field EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a plate of food with a slice of pizza on a plate EOS EOS
BLEU-4 Score
	 0.2706622760551148

COCO_val2014_000000003832.jpg :  SOS a large boat is on a beach EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air on a surfboard EOS EOS EOS EOS EOS
COCO_val2014_000000001584.jpg :  SOS a bus is parked on a street EOS EOS EOS EOS EOS EOS EO

COCO_val2014_000000000885.jpg :  SOS a man is playing tennis on a tennis court EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a man standing in a field with a horse EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone in front of a laptop EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a plate of food with a sandwich and a sandwich on a plate EOS
BLEU-4 Score
	 0.2756740153938318

COCO_val2014_000000003832.jpg :  SOS a boat is docked on the beach EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying over a body of water EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001584.jpg :  SOS a bus is parked on a street with a street sign EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man in a blue shirt is playing tennis on a tennis court EOS
COCO_val2014_000000000757.jpg :  SOS a man standing in a field EOS EOS EOS EOS EOS EOS E

COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone on a cell phone EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a plate of food with a sandwich and a fork and a fork EOS
BLEU-4 Score
	 0.26682077207012495
Dropping learning rate

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying over a bridge in the air EOS EOS EOS EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus driving down a street with a street sign EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a couple of elephants standing in a field EOS EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a dog EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man is holding a cell phone EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a plate of food with a sand


COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a sky background EOS EOS EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus is parked on a street with a bus EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man in a tennis court holding a tennis racket EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a baby elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man in a suit and tie is holding a cell phone EOS EOS
COCO_val2014_000000003690.jpg :  SOS a plate of food with a sandwich on a plate with a sandwich on
BLEU-4 Score
	 0.27971893930678127

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the sky in the sky

COCO_val2014_000000001029.jpg :  SOS a large airplane flying through the air with a kite in the sky EOS
COCO_val2014_000000001584.jpg :  SOS a bus is parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man in a tennis court holding a tennis racket EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a baby elephant is standing in a field EOS EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a dog EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man is holding a baby elephant EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a plate of food with a sandwich and a fork and a fork and
BLEU-4 Score
	 0.2809518051339536
Dropping learning rate

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a large airplane flying through the air with a sky background EOS EOS EOS
COCO_val2014_000000001584.jpg :  SOS a bus is pa

COCO_val2014_000000001584.jpg :  SOS a bus is parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is playing tennis on a tennis court EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a baby elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a large elephant EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.28274329471850074

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the sky with a kite in the sky EOS EOS
COCO_val2014_000000001584.jpg :  SOS a bus is parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is playing tennis on a tennis court EOS EOS EOS EO

COCO_val2014_000000000757.jpg :  SOS a baby elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.2859244155524397

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red double decker bus is parked on a street next to a building
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a baby elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a fris

COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.28252908820554223

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus is parked on a street EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS



COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street next to a building EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.2843790766229874

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the a

COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man in a tennis court holding a tennis racket EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.2841904461112723

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS E

COCO_val2014_000000000885.jpg :  SOS a man holding a tennis racket EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in the grass EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.2830206839781429

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in the grass EOS EOS EOS 

COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.28327799548271576

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS E


COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.2826327966285938

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air E

COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.28271917857116674
Dropping learning rate

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street

COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.28217965620363367

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EO

COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.2823401503902784

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EO


COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.28288347713518414

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air 

COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.2825402238013779

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS 

COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS EOS EOS EOS
COCO_val2014_000000004286.jpg :  SOS a giraffe standing in a field with a frisbee EOS EOS EOS EOS EOS
COCO_val2014_000000001296.jpg :  SOS a man holding a cell phone EOS EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000003690.jpg :  SOS a pizza with a fork and a fork on a plate EOS EOS EOS
BLEU-4 Score
	 0.2828001334645546

COCO_val2014_000000003832.jpg :  SOS a boat is docked in the water EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000001029.jpg :  SOS a plane flying through the air with a kite in the air EOS EOS
COCO_val2014_000000001584.jpg :  SOS a red bus parked on a street with a bus EOS EOS EOS EOS
COCO_val2014_000000000885.jpg :  SOS a man is holding a tennis racket EOS EOS EOS EOS EOS EOS EOS
COCO_val2014_000000000757.jpg :  SOS a large elephant standing in a field of grass EOS EOS