 # Building an image caption generator with Deep Learning in Tensorflow
 
 
In this tutorial, we’ll learn how a convolutional neural network (CNN) and Long Short Term Memory (LSTM) can be combined to create an image caption generator and generate captions for your own images.
## Overview

*    Introduction to Image Captioning Model Architecture
*    Captions as a Search Problem
*    Creating Captions in Tensorflow

## Prerequisites

*    Basic understanding of Convolutional Neural Networks
*    Basic understanding of LSTM
*    Basic understanding of Tensorflow

# Introduction to image captioning model architecture
## Combining a CNN and LSTM

In 2014, researchers from Google released a paper, [Show And Tell: A Neural Image Caption Generator]('https://arxiv.org/pdf/1411.4555.pdf'). At the time, this architecture was state-of-the-art on the MSCOCO dataset. It utilized a CNN + LSTM to take an image as input and output a caption.
A CNN-LSTM Image Caption Architecture:
![](img/1.png)


## Using a CNN for image embedding

A convolutional neural network can be used to create a dense feature vector. This dense vector, also called an embedding, can be used as feature input into other algorithms or networks.

For an image caption model, this embedding becomes a dense representation of the image and will be used as the initial state of the LSTM.
Mapping input to embedding:
![](img/2.png)

## LSTM

An LSTM is a recurrent neural network architecture that is commonly used in problems with temporal dependences. It succeeds in being able to capture information about previous states to better inform the current prediction through its memory cell state.

An LSTM consists of three main components: a forget gate, input gate, and output gate. Each of these gates is responsible for altering updates to the cell’s memory state.
An unrolled LSTM:
![](img/3.png)

For a deeper understanding of LSTM’s, visit [Chris Olah’s post]('https://colah.github.io/posts/2015-08-Understanding-LSTMs/').
## Prediction with image as initial state

In a sentence language model, an LSTM is predicting the next word in a sentence. Similarly, in a character language model, an LSTM is trying to predict the next character, given the context of previously seen characters.


Sentence and character model predictions:
![](img/4.png)

In an image caption model, you will create an embedding of the image. This embedding will then be fed as initial state into an LSTM. This becomes the first previous state to the language model, influencing the next predicted words.

At each time-step, the LSTM considers the previous cell state and outputs a prediction for the most probable next value in the sequence. This process is repeated until the end token is sampled, signaling the end of the caption.

Sampling characters from an LSTM.:
![](img/5.png)
## Captions as a search problem

Generating a caption can be viewed as a graph search problem. Here, the nodes are words. The edges are the probability of moving from one node to another. Finding the optimal path involves maximizing the total probability of a sentence.

Sampling and choosing the most probable next value is a greedy approach to generating a caption. It is computationally efficient, but can lead to a sub-optimal result.

Given all possible words, it would not be computationally/space efficient to calculate all possible sentences and determine the optimal sentence. This rules out using a search algorithm such as Depth First Search or Breadth First Search to find the optimal path.
![](img/6.png)
## Beam Search

Beam search is a breadth-first search algorithm that explores the most promising nodes. It generates all possible next paths, keeping only the top N best candidates at each iteration.

As the number of nodes to expand from is fixed, this algorithm is space-efficient and allows more potential candidates than a best-first search.
Beam search for building a sentence:
![](img/7.png)
## Review

Up to this point, you’ve learned about creating a model architecture to generate a sentence, given an image. This is done by utilizing a CNN to create a dense embedding and feeding this as initial state to an LSTM. Additionally, you’ve learned how to generate better sentences with beam search.

In the next section, you’ll learn to generate captions from a pre-trained model in Tensorflow.
## Creating captions in Tensorflow



### Project Structure

├── Dockerfile
├── bin
│ └── download_model.py
├── etc
│ ├── show-and-tell-2M.zip
│ ├── show-and-tell.pb
│ └── word_counts.txt
├── imgs
│ └── trading_floor.jpg
├── medium_show_and_tell_caption_generator
│ ├── __init__.py
│ ├── caption_generator.py
│ ├── inference.py
│ ├── model.py
│ └── vocabulary.py
└── requirements.txt

## Environment setup

Here, you’ll use Docker to install Tensorflow.

Docker is a container platform that simplifies deployment. It solves the problem of installing software dependencies onto different server environments.  To install Docker, run:
```
curl https://get.docker.com | sh
```
After installing Docker, you’ll create two files. A requirements.txt for the Python dependencies and a Dockerfile to create your Docker environment.
```
tensorflow==1.6.0
requests==2.18.4
```


**Dockerfile:**
```
FROM ubuntu:16.04

RUN apt-get update -y --fix-missing
RUN apt-get install -y \
    build-essential \
    wget \
    python3 \
    python3-dev \
    python3-numpy \
    python3-pip

ADD $PWD/requirements.txt /requirements.txt
RUN pip3 install -r /requirements.txt

CMD ["/bin/bash"]
```

To build this image, run:
```
$ docker build -t colemurray/medium-show-and-tell-caption-generator -f Dockerfile .

# On MBP, ~ 3mins
# Image can be pulled from dockerhub below
```
If you would like to avoid building from source, the image can be pulled from dockerhub using:

```docker pull colemurray/medium-show-and-tell-caption-generator # Recommended```

Download the model

![](img/9.png)
Show and Tell Inference Architecture source

Below, you’ll download the model graph and pre-trained weights. These weights are from a training session on the [MSCOCO ]('http://cocodataset.org/#home')dataset for 2MM iterations.

```Python
import argparse
import logging
import os
import zipfile

import requests

model_dict = {
    'show-and-tell-2M': '15Juh0gaYR0qv8GjRL1EvsigErdQXTmnt'
}


def download_and_extract_model(model_name, data_dir):
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    file_id = model_dict[model_name]
    destination = os.path.join(data_dir, model_name + '.zip')
    if not os.path.exists(destination):
        print('Downloading model to %s' % destination)
        download_file_from_google_drive(file_id, destination)
        with zipfile.ZipFile(destination, 'r') as zip_ref:
            print('Extracting model to %s' % data_dir)
            zip_ref.extractall(data_dir)


def download_file_from_google_drive(file_id, destination):
    URL = "https://drive.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params={'id': file_id}, stream=True)
    token = get_confirm_token(response)
    if token:
        params = {'id': file_id, 'confirm': token}
        response = session.get(URL, params=params, stream=True)

    save_response_content(response, destination)


def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None


def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk:  # filter out keep-alive new chunks
                f.write(chunk)


if __name__ == '__main__':
    logging.basicConfig(level=logging.DEBUG)
    parser = argparse.ArgumentParser(add_help=True)
    parser.add_argument('--model-dir', type=str, action='store', dest='model_dir',
                        help='Path to model protobuf graph')

    args = parser.parse_args()

download_and_extract_model('show-and-tell-2M', args.model_dir)
```



To download, run:

```docker run -e PYTHONPATH=$PYTHONPATH:/opt/app -v $PWD:/opt/app \
-it colemurray/medium-show-and-tell-caption-generator \
python3 /opt/app/bin/download_model.py \
--model-dir /opt/app/etc
```
Next, create a model class. This class is responsible for loading the graph, creating image embeddings, and running an inference step on the model.


```Python
import logging
import os

import tensorflow as tf


class ShowAndTellModel(object):
    def __init__(self, model_path):
        self._model_path = model_path
        self.logger = logging.getLogger(__name__)

        self._load_model(model_path)
        self._sess = tf.Session(graph=tf.get_default_graph())

    def _load_model(self, frozen_graph_path):
        """
        Loads a frozen graph
        :param frozen_graph_path: path to .pb graph
        :type frozen_graph_path: str
        """

        model_exp = os.path.expanduser(frozen_graph_path)
        if os.path.isfile(model_exp):
            self.logger.info('Loading model filename: %s' % model_exp)
            with tf.gfile.FastGFile(model_exp, 'rb') as f:
                graph_def = tf.GraphDef()
                graph_def.ParseFromString(f.read())
                tf.import_graph_def(graph_def, name='')
        else:
            raise RuntimeError("Missing model file at path: {}".format(frozen_graph_path))

    def feed_image(self, encoded_image):
        initial_state = self._sess.run(fetches="lstm/initial_state:0",
                                       feed_dict={"image_feed:0": encoded_image})
        return initial_state

    def inference_step(self, input_feed, state_feed):
        softmax_output, state_output = self._sess.run(
            fetches=["softmax:0", "lstm/state:0"],
            feed_dict={
                "input_feed:0": input_feed,
                "lstm/state_feed:0": state_feed,
            })
return softmax_output, state_output, None
```
Download the vocabulary

When training an LSTM, it is standard practice to tokenize the input. For a sentence model, this means mapping each unique word to a unique numeric id. This allows the model to utilize a softmax classifier for prediction.

Below, you’ll download the vocabulary used for the pre-trained model and create a class to load it into memory. Here, the line number represents the numeric id of the token.

```# File structure
# token num_of_occurrances

# on 213612
# of 202290
# the 196219
# in 182598

curl -o etc/word_counts.txt https://raw.githubusercontent.com/ColeMurray/medium-show-and-tell-caption-generator/master/etc/word_counts.txt
```
To store this vocabulary in memory, you’ll create a class responsible for mapping words to ids.
```Python
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import logging
import os


class Vocabulary(object):
    """Vocabulary class for mapping words to ids"""

    def __init__(self,
                 vocab_file_path,
                 start_token="<S>",
                 end_token="</S>",
                 unk_token="<UNK>"):
        """Initializes the vocabulary.
    
        Args:
          vocab_file_path: File containing the vocabulary, where the tokens are the first
            whitespace-separated token on each line (other tokens are ignored) and
            the token ids are the corresponding line numbers.
          start_token: Special token denoting sequence start.
          end_token: Special token denoting sequence end.
          unk_token: Special token denoting unknown tokens.
        """
        self.logger = logging.getLogger(__name__)
        if not os.path.exists(vocab_file_path):
            self.logger.exception("Vocab file %s not found.", vocab_file_path)
            raise RuntimeError
        self.logger.info("Initializing vocabulary from file: %s", vocab_file_path)

        with open(vocab_file_path, mode="r") as f:
            reverse_vocab = list(f.readlines())
        reverse_vocab = [line.split()[0] for line in reverse_vocab]
        assert start_token in reverse_vocab
        assert end_token in reverse_vocab
        if unk_token not in reverse_vocab:
            reverse_vocab.append(unk_token)
        vocab = dict([(x, y) for (y, x) in enumerate(reverse_vocab)])

        self.logger.info("Created vocabulary with %d words" % len(vocab))

        self.vocab = vocab
        self.reverse_vocab = reverse_vocab

        self.start_id = vocab[start_token]
        self.end_id = vocab[end_token]
        self.unk_id = vocab[unk_token]

    def token_to_id(self, token_id):
        if token_id in self.vocab:
            return self.vocab[token_id]
        else:
            return self.unk_id

    def id_to_token(self, token_id):
        if token_id >= len(self.reverse_vocab):
            return self.reverse_vocab[self.unk_id]
        else:
return self.reverse_vocab[token_id]
```



## Creating a caption generator:

To generate captions, first you’ll create a caption generator. This caption generator utilizes beam search to improve the quality of sentences generated.

At each iteration, the generator passes the previous state of the LSTM (initial state is the image embedding) and previous sequence to generate the next softmax vector.

The top N most probable candidates are kept and utilized in the next inference step. This process continues until either the max sentence length is reached or all sentences have generated the end-of-sentence token.

```Python

# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Class for generating captions from an image-to-text model."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import heapq
import math

import numpy as np


class TopN(object):
    """Maintains the top n elements of an incrementally provided set."""

    def __init__(self, n):
        self._n = n
        self._data = []

    def size(self):
        assert self._data is not None
        return len(self._data)

    def push(self, x):
        """Pushes a new element."""
        assert self._data is not None
        if len(self._data) < self._n:
            heapq.heappush(self._data, x)
        else:
            heapq.heappushpop(self._data, x)

    def extract(self, sort=False):
        """Extracts all elements from the TopN. This is a destructive operation.
    
        The only method that can be called immediately after extract() is reset().
    
        Args:
          sort: Whether to return the elements in descending sorted order.
    
        Returns:
          A list of data; the top n elements provided to the set.
        """
        assert self._data is not None
        data = self._data
        self._data = None
        if sort:
            data.sort(reverse=True)
        return data

    def reset(self):
        """Returns the TopN to an empty state."""
        self._data = []


class Caption(object):
    """Represents a complete or partial caption."""

    def __init__(self, sentence, state, logprob, score, metadata=None):
        """Initializes the Caption.
        Args:
          sentence: List of word ids in the caption.
          state: Model state after generating the previous word.
          logprob: Log-probability of the caption.
          score: Score of the caption.
          metadata: Optional metadata associated with the partial sentence. If not
            None, a list of strings with the same length as 'sentence'.
        """
        self.sentence = sentence
        self.state = state
        self.logprob = logprob
        self.score = score
        self.metadata = metadata

    def __cmp__(self, other):
        """Compares Captions by score."""
        assert isinstance(other, Caption)
        if self.score == other.score:
            return 0
        elif self.score < other.score:
            return -1
        else:
            return 1

    # For Python 3 compatibility (__cmp__ is deprecated).
    def __lt__(self, other):
        assert isinstance(other, Caption)
        return self.score < other.score

    # Also for Python 3 compatibility.
    def __eq__(self, other):
        assert isinstance(other, Caption)
        return self.score == other.score


class CaptionGenerator(object):
    """Class to generate captions from an image-to-text model.
    This code is a modification of https://github.com/tensorflow/models/blob/master/research/im2txt/im2txt/inference_utils/caption_generator.py
    """

    def __init__(self,
                 model,
                 vocab,
                 beam_size=3,
                 max_caption_length=20,
                 length_normalization_factor=0.0):

        self.vocab = vocab
        self.model = model

        self.beam_size = beam_size
        self.max_caption_length = max_caption_length
        self.length_normalization_factor = length_normalization_factor

    def beam_search(self, encoded_image):
        # Feed in the image to get the initial state.
        partial_caption_beam = TopN(self.beam_size)
        complete_captions = TopN(self.beam_size)
        initial_state = self.model.feed_image(encoded_image)

        initial_beam = Caption(
            sentence=[self.vocab.start_id],
            state=initial_state[0],
            logprob=0.0,
            score=0.0,
            metadata=[""])

        partial_caption_beam.push(initial_beam)

        # Run beam search.
        for _ in range(self.max_caption_length - 1):
            partial_captions_list = partial_caption_beam.extract()
            partial_caption_beam.reset()
            input_feed = np.array([c.sentence[-1] for c in partial_captions_list])
            state_feed = np.array([c.state for c in partial_captions_list])

            softmax, new_states, metadata = self.model.inference_step(input_feed,
                                                                      state_feed)

            for i, partial_caption in enumerate(partial_captions_list):
                word_probabilities = softmax[i]
                state = new_states[i]
                # For this partial caption, get the beam_size most probable next words.
                words_and_probs = list(enumerate(word_probabilities))
                words_and_probs.sort(key=lambda x: -x[1])
                words_and_probs = words_and_probs[0:self.beam_size]
                # Each next word gives a new partial caption.
                for w, p in words_and_probs:
                    if p < 1e-12:
                        continue  # Avoid log(0).
                    sentence = partial_caption.sentence + [w]
                    logprob = partial_caption.logprob + math.log(p)
                    score = logprob
                    if metadata:
                        metadata_list = partial_caption.metadata + [metadata[i]]
                    else:
                        metadata_list = None
                    if w == self.vocab.end_id:
                        if self.length_normalization_factor > 0:
                            score /= len(sentence) ** self.length_normalization_factor
                        beam = Caption(sentence, state, logprob, score, metadata_list)
                        complete_captions.push(beam)
                    else:
                        beam = Caption(sentence, state, logprob, score, metadata_list)
                        partial_caption_beam.push(beam)
            if partial_caption_beam.size() == 0:
                # We have run out of partial candidates; happens when beam_size = 1.
                break

        # If we have no complete captions then fall back to the partial captions.
        # But never output a mixture of complete and partial captions because a
        # partial caption could have a higher score than all the complete captions.
        if complete_captions.size() == 0:
            complete_captions = partial_caption_beam

return complete_captions.extract(sort=True)


```






Next, you’ll load the show and tell model and use it with the above caption generator to create candidate sentences. These sentences will be printed along with their log probability.
```Python

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import logging
import math
import os

import tensorflow as tf

from medium_show_and_tell_caption_generator.caption_generator import CaptionGenerator
from medium_show_and_tell_caption_generator.model import ShowAndTellModel
from medium_show_and_tell_caption_generator.vocabulary import Vocabulary

FLAGS = tf.flags.FLAGS

tf.flags.DEFINE_string("model_path", "", "Model graph def path")
tf.flags.DEFINE_string("vocab_file", "", "Text file containing the vocabulary.")
tf.flags.DEFINE_string("input_files", "",
                       "File pattern or comma-separated list of file patterns "
                       "of image files.")

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(__name__)


def main(_):
    model = ShowAndTellModel(FLAGS.model_path)
    vocab = Vocabulary(FLAGS.vocab_file)
    filenames = _load_filenames()

    generator = CaptionGenerator(model, vocab)

    for filename in filenames:
        with tf.gfile.GFile(filename, "rb") as f:
            image = f.read()
        captions = generator.beam_search(image)
        print("Captions for image %s:" % os.path.basename(filename))
        for i, caption in enumerate(captions):
            # Ignore begin and end tokens <S> and </S>.
            sentence = [vocab.id_to_token(w) for w in caption.sentence[1:-1]]
            sentence = " ".join(sentence)
            print("  %d) %s (p=%f)" % (i, sentence, math.exp(caption.logprob)))


def _load_filenames():
    filenames = []
    for file_pattern in FLAGS.input_files.split(","):
        filenames.extend(tf.gfile.Glob(file_pattern))
    logger.info("Running caption generation on %d files matching %s",
                len(filenames), FLAGS.input_files)
    return filenames


if __name__ == "__main__":
tf.app.run()

```


## Results

To generate captions, you’ll need to pass in one or more images to the script.

```docker run -v $PWD:/opt/app \
-e PYTHONPATH=$PYTHONPATH:/opt/app \
-it colemurray/medium-show-and-tell-caption-generator  \
python3 /opt/app/medium_show_and_tell_caption_generator/inference.py \
--model_path /opt/app/etc/show-and-tell.pb \
--input_files /opt/app/imgs/trading_floor.jpg \
--vocab_file /opt/app/etc/word_counts.txt
```
You should see output:

Captions for image trading_floor.jpg:
 0) a group of people sitting at tables in a room . (p=0.000306)
 1) a group of people sitting around a table with laptops . (p=0.000140)
 2) a group of people sitting at a table with laptops . (p=0.000069)

Generated Caption: a group of people sitting around a table with laptops:

![](img/10.png)

## Conclusion

In this tutorial, you learned:

    how a convolutional neural network and LSTM can be combined to generate captions to an image
    how to utilize the beam search algorithm to consider multiple captions and select the most probable sentence.

Complete code here.

Next Steps:

    Try with your own images
    Read the Show and Tell paper
    Create an API to serve captions