# Image Caption Generator using CNN and LSTM



## Introduction

You saw an image and your brain can easily tell what the image is about, but can a computer tell what the image is representing? Computer vision researchers worked on this a lot and they considered it impossible until now! With the advancement in Deep learning techniques, availability of huge datasets and computer power, we can build models that can generate captions for an image.


This is what we are going to implement in this Python based project where we will use deep learning techniques of Convolutional Neural Networks **(CNN)** and a type of Recurrent Neural Network **(LSTM)** together.

## Goal 

The objective of our project is to learn the concepts of a CNN and LSTM model and build a working model of Image caption generator by implementing CNN with LSTM.

## What is Image Caption Generator?


Image caption generator is a task that involves computer vision and natural language processing concepts to recognize the context of an image and describe them in a natural language like English.Its the task of predicting a caption for a given image.


## What is CNN?

CNN is a subfield of Deep learning and specialized deep neural networks used to recognize and classify images. It processes the data represented as 2D matrix-like images. CNN can deal with scaled, translated, and rotated imagery. It analyzes the visual imagery by scanning them from left to right and top to bottom and extracting relevant features. Finally, it combines all the parts for image classification.


### What is LSTM?

Being a type of RNN (recurrent neural network), LSTM (Long short-term memory) is capable of working with sequence prediction problems. It is mostly used for the next word prediction purposes, as in Google search our system is showing the next word based on the previous text. Throughout the processing of inputs, LSTM is used to carry out the relevant information and to discard non-relevant information.

### Our Architecture?


![title](Architecture.jpeg)


Image Caption Generator Model (CNN-RNN model) = CNN + LSTM

- **CNN** : To extract features from the image. A pre-trained model called Xception is used for this.
- **LSTM** : To generate a description from the extracted information of the image.

## Dataset for Image Caption Generator

For the image caption generator, we will be using the Flickr_8K dataset. There are also other big datasets like Flickr_30K and MSCOCO dataset but it can take weeks just to train the network so we will be using a small Flickr8k dataset. The advantage of a huge dataset is that we can build better models.

The most important file is **Flickr 8k.token**, which stores all the image names with captions. 8091 images are stored inside the Flicker8k_Dataset folder and the text files with captions of images are stored in the Flickr_8k_text folder.

## Image caption generator in code

install the required libraries: 

In [1]:
# !pip install TensorFlow
# !pip install Keras
# !pip install pillow
# !pip install NumPy
# !pip install tqdm
# !pip install jupyterlab

In [2]:
# !pip install -q TensorFlow

In [5]:
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import argparse
import os
import string
import keras
from pickle import dump
from pickle import load
from keras.applications.xception import Xception #to get pre-trained model Xception
from keras.applications.xception import preprocess_input
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from tensorflow.keras.layers import add
from keras.models import Model, load_model
from keras.layers import Input, Dense#Keras to build our CNN and LSTM
from keras.layers import LSTM, Embedding, Dropout
from tqdm import tqdm_notebook as tqdm #to check loop progress
from tqdm import tqdm

tqdm().pandas()


0it [00:00, ?it/s]


## Data Cleaning


In [6]:

# Load the document file 
def load_fp(filename):
    # Open file to read
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text


In [7]:
# get all images with their captions
def img_capt(filename):
    file = load_fp(filename)  # Corrected function call
    captions = file.split('\n')
    descriptions = {}
    for caption in captions[:-1]:
        img, caption = caption.split('\t')
        if img[:-2] not in descriptions:
            descriptions[img[:-2]] = [caption]
        else:
            descriptions[img[:-2]].append(caption)
    return descriptions

In [8]:
import string


# Data cleaning function will convert all upper case alphabets to lowercase, removing punctuations and words containing numbers
def txt_clean(captions):
    table = str.maketrans('', '', string.punctuation)
    for img, caps in captions.items():
        for i, img_caption in enumerate(caps):
            img_caption.replace("-", " ")
            descp = img_caption.split()
            descp = [wrd.lower() for wrd in descp]
            # remove punctuation from each token
            descp = [wrd.translate(table) for wrd in descp]
            # remove hanging 's and a
            descp = [wrd for wrd in descp if (len(wrd) > 1)]
            # remove words containing numbers with them
            descp = [wrd for wrd in descp if (wrd.isalpha())]
            # converting back to string
            img_caption = ' '.join(descp)
            captions[img][i] = img_caption
    return captions


In [9]:
def txt_vocab(descriptions):
    # To build vocab of all unique words
    vocab = set()
    for key in descriptions.keys():
        [vocab.update(d.split()) for d in descriptions[key]]
    return vocab

In [10]:
# To save all descriptions in one file
def save_descriptions(descriptions, filename):
    lines = list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(key + '\t' + desc)
    data = "\n".join(lines)
    file = open(filename, "w")
    file.write(data)
    file.close()

In [11]:
# Set these paths according to project folder in your system, like I create a folder with my name shikha inside D-drive
dataset_text = "C:/Users/ADMIN/Downloads/Flickr8k_text"
dataset_images = "C:/Users/ADMIN/Downloads/Flickr8k_Dataset/Flicker8k_Dataset"

# To prepare our text data
filename = dataset_text + "/" + "Flickr8k.token.txt"

# loading the file that contains all data
# map them into descriptions dictionary
descriptions = img_capt(filename)
print("Length of descriptions =", len(descriptions))

# cleaning the descriptions
clean_descriptions = txt_clean(descriptions)

# to build vocabulary
vocabulary = txt_vocab(clean_descriptions)
print("Length of vocabulary =", len(vocabulary))

# saving all descriptions in one file
save_descriptions(clean_descriptions, "descriptions.txt")

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/ADMIN/Downloads/Flickr8k_text/Flickr8k.token.txt'

In [10]:
def extract_features(directory):
    model = Xception(include_top=False, pooling='avg')
    features = {}
    for pic in tqdm(os.listdir(directory)):
        file = os.path.join(directory, pic)  # Use os.path.join to concatenate directory path and file name
        image = Image.open(file)
        image = image.resize((299, 299))
        image = np.expand_dims(image, axis=0)
        image = image / 127.5
        image = image - 1.0
        feature = model.predict(image)
        features[pic] = feature  # Store features with file name as key
    return features


In [None]:
#2048 feature vector
# Extract features from images and save them to a pickle file
features = extract_features(dataset_images)
dump(features, open("features.p", "wb"))

# To directly load the features from the pickle file.
features = load(open("features.p","rb"))


  0%|                                                                                         | 0/8091 [00:00<?, ?it/s]




  0%|                                                                               | 1/8091 [00:02<4:58:51,  2.22s/it]




  0%|                                                                               | 2/8091 [00:02<2:43:05,  1.21s/it]




  0%|                                                                               | 3/8091 [00:03<1:56:55,  1.15it/s]




  0%|                                                                               | 4/8091 [00:03<1:36:27,  1.40it/s]




  0%|                                                                               | 5/8091 [00:04<1:24:07,  1.60it/s]




  0%|                                                                               | 6/8091 [00:04<1:15:34,  1.78it/s]




  0%|                                                                               | 7/8091 [00:05<1:14:41,  1.80it/s]




  0%|                                                                               | 8/8091 [00:05<1:13:07,  1.84it/s]




  0%|                                                                               | 9/8091 [00:06<1:35:11,  1.42it/s]




  0%|                                                                              | 10/8091 [00:08<2:08:54,  1.04it/s]




  0%|                                                                              | 11/8091 [00:08<1:57:16,  1.15it/s]




  0%|                                                                              | 12/8091 [00:09<2:02:13,  1.10it/s]




  0%|▏                                                                             | 13/8091 [00:10<2:03:31,  1.09it/s]




  0%|▏                                                                             | 14/8091 [00:11<1:59:11,  1.13it/s]




  0%|▏                                                                             | 15/8091 [00:12<1:55:08,  1.17it/s]




  0%|▏                                                                             | 16/8091 [00:13<2:01:45,  1.11it/s]




  0%|▏                                                                             | 17/8091 [00:14<1:55:21,  1.17it/s]




  0%|▏                                                                             | 18/8091 [00:14<1:47:46,  1.25it/s]




  0%|▏                                                                             | 19/8091 [00:15<1:47:22,  1.25it/s]




  0%|▏                                                                             | 20/8091 [00:16<1:39:36,  1.35it/s]




  0%|▏                                                                             | 21/8091 [00:16<1:29:33,  1.50it/s]




  0%|▏                                                                             | 22/8091 [00:17<1:22:27,  1.63it/s]




  0%|▏                                                                             | 23/8091 [00:17<1:22:31,  1.63it/s]




  0%|▏                                                                             | 24/8091 [00:18<1:21:28,  1.65it/s]




  0%|▏                                                                             | 25/8091 [00:18<1:18:07,  1.72it/s]




  0%|▎                                                                             | 26/8091 [00:19<1:14:39,  1.80it/s]




  0%|▎                                                                             | 27/8091 [00:19<1:13:10,  1.84it/s]




  0%|▎                                                                             | 28/8091 [00:20<1:11:17,  1.88it/s]




  0%|▎                                                                             | 29/8091 [00:20<1:10:28,  1.91it/s]




  0%|▎                                                                             | 30/8091 [00:21<1:13:15,  1.83it/s]




  0%|▎                                                                             | 31/8091 [00:22<1:12:10,  1.86it/s]




  0%|▎                                                                             | 32/8091 [00:22<1:13:04,  1.84it/s]




  0%|▎                                                                             | 33/8091 [00:23<1:18:04,  1.72it/s]




  0%|▎                                                                             | 34/8091 [00:23<1:21:35,  1.65it/s]




  0%|▎                                                                             | 35/8091 [00:24<1:30:48,  1.48it/s]




  0%|▎                                                                             | 36/8091 [00:25<1:37:30,  1.38it/s]




  0%|▎                                                                             | 37/8091 [00:26<1:41:05,  1.33it/s]




  0%|▎                                                                             | 38/8091 [00:28<2:19:50,  1.04s/it]




  0%|▍                                                                             | 39/8091 [00:29<2:14:43,  1.00s/it]




  0%|▍                                                                             | 40/8091 [00:29<2:06:25,  1.06it/s]




  1%|▍                                                                             | 41/8091 [00:30<1:55:17,  1.16it/s]




  1%|▍                                                                             | 42/8091 [00:31<1:48:24,  1.24it/s]




  1%|▍                                                                             | 43/8091 [00:31<1:42:43,  1.31it/s]




  1%|▍                                                                             | 44/8091 [00:32<1:38:17,  1.36it/s]




  1%|▍                                                                             | 45/8091 [00:33<1:35:52,  1.40it/s]




  1%|▍                                                                             | 46/8091 [00:33<1:29:35,  1.50it/s]




  1%|▍                                                                             | 47/8091 [00:34<1:23:24,  1.61it/s]




  1%|▍                                                                             | 48/8091 [00:34<1:22:45,  1.62it/s]




  1%|▍                                                                             | 49/8091 [00:35<1:30:43,  1.48it/s]




  1%|▍                                                                             | 50/8091 [00:36<1:47:34,  1.25it/s]




  1%|▍                                                                             | 51/8091 [00:37<1:54:58,  1.17it/s]




  1%|▌                                                                             | 52/8091 [00:38<2:02:45,  1.09it/s]




  1%|▌                                                                             | 53/8091 [00:39<2:01:53,  1.10it/s]




  1%|▌                                                                             | 54/8091 [00:41<2:44:56,  1.23s/it]




  1%|▌                                                                             | 55/8091 [00:43<2:47:35,  1.25s/it]




  1%|▌                                                                             | 56/8091 [00:44<2:39:14,  1.19s/it]




  1%|▌                                                                             | 57/8091 [00:45<2:28:24,  1.11s/it]




  1%|▌                                                                             | 58/8091 [00:46<2:29:30,  1.12s/it]




  1%|▌                                                                             | 59/8091 [00:47<2:24:52,  1.08s/it]




  1%|▌                                                                             | 60/8091 [00:48<2:26:51,  1.10s/it]




  1%|▌                                                                             | 61/8091 [00:49<2:21:28,  1.06s/it]




  1%|▌                                                                             | 62/8091 [00:50<2:12:56,  1.01it/s]




  1%|▌                                                                             | 63/8091 [00:51<2:09:20,  1.03it/s]




  1%|▌                                                                             | 64/8091 [00:51<2:05:45,  1.06it/s]




  1%|▋                                                                             | 65/8091 [00:52<1:53:27,  1.18it/s]




  1%|▋                                                                             | 66/8091 [00:53<1:49:07,  1.23it/s]




  1%|▋                                                                             | 67/8091 [00:54<1:48:07,  1.24it/s]




  1%|▋                                                                             | 68/8091 [00:54<1:39:15,  1.35it/s]




  1%|▋                                                                             | 69/8091 [00:55<1:33:59,  1.42it/s]




  1%|▋                                                                             | 70/8091 [00:55<1:33:32,  1.43it/s]




  1%|▋                                                                             | 71/8091 [00:56<1:31:53,  1.45it/s]




  1%|▋                                                                             | 72/8091 [00:57<1:28:19,  1.51it/s]




  1%|▋                                                                             | 73/8091 [00:57<1:25:16,  1.57it/s]




  1%|▋                                                                             | 74/8091 [00:58<1:23:10,  1.61it/s]




  1%|▋                                                                             | 75/8091 [00:58<1:21:47,  1.63it/s]




  1%|▋                                                                             | 76/8091 [00:59<1:20:38,  1.66it/s]




  1%|▋                                                                             | 77/8091 [01:00<1:19:36,  1.68it/s]




  1%|▊                                                                             | 78/8091 [01:00<1:22:43,  1.61it/s]




  1%|▊                                                                             | 79/8091 [01:01<1:31:44,  1.46it/s]




  1%|▊                                                                             | 80/8091 [01:02<1:33:00,  1.44it/s]




  1%|▊                                                                             | 81/8091 [01:02<1:27:36,  1.52it/s]




  1%|▊                                                                             | 82/8091 [01:03<1:28:09,  1.51it/s]




  1%|▊                                                                             | 83/8091 [01:04<1:25:16,  1.57it/s]




  1%|▊                                                                             | 84/8091 [01:04<1:22:54,  1.61it/s]




  1%|▊                                                                             | 85/8091 [01:05<1:21:36,  1.64it/s]




  1%|▊                                                                             | 86/8091 [01:05<1:22:00,  1.63it/s]




  1%|▊                                                                             | 87/8091 [01:06<1:18:34,  1.70it/s]




  1%|▊                                                                             | 88/8091 [01:07<1:15:57,  1.76it/s]




  1%|▊                                                                             | 89/8091 [01:07<1:12:53,  1.83it/s]




  1%|▊                                                                             | 90/8091 [01:08<1:11:13,  1.87it/s]




  1%|▉                                                                             | 91/8091 [01:08<1:09:58,  1.91it/s]




  1%|▉                                                                             | 92/8091 [01:09<1:09:17,  1.92it/s]




  1%|▉                                                                             | 93/8091 [01:09<1:08:39,  1.94it/s]




  1%|▉                                                                             | 94/8091 [01:10<1:07:38,  1.97it/s]




  1%|▉                                                                             | 95/8091 [01:10<1:08:08,  1.96it/s]




  1%|▉                                                                             | 96/8091 [01:11<1:08:09,  1.96it/s]




  1%|▉                                                                             | 97/8091 [01:11<1:07:17,  1.98it/s]




  1%|▉                                                                             | 98/8091 [01:12<1:05:57,  2.02it/s]




  1%|▉                                                                             | 99/8091 [01:12<1:07:47,  1.97it/s]




  1%|▉                                                                            | 100/8091 [01:13<1:10:56,  1.88it/s]




  1%|▉                                                                            | 101/8091 [01:13<1:16:25,  1.74it/s]




  1%|▉                                                                            | 102/8091 [01:14<1:12:27,  1.84it/s]




  1%|▉                                                                            | 103/8091 [01:14<1:09:17,  1.92it/s]




  1%|▉                                                                            | 104/8091 [01:15<1:08:37,  1.94it/s]




  1%|▉                                                                            | 105/8091 [01:15<1:07:50,  1.96it/s]




  1%|█                                                                            | 106/8091 [01:16<1:12:59,  1.82it/s]




  1%|█                                                                            | 107/8091 [01:16<1:13:16,  1.82it/s]




  1%|█                                                                            | 108/8091 [01:17<1:11:14,  1.87it/s]




  1%|█                                                                            | 109/8091 [01:18<1:13:41,  1.81it/s]




  1%|█                                                                            | 110/8091 [01:18<1:12:10,  1.84it/s]




  1%|█                                                                            | 111/8091 [01:19<1:16:31,  1.74it/s]




  1%|█                                                                            | 112/8091 [01:19<1:13:47,  1.80it/s]




  1%|█                                                                            | 113/8091 [01:20<1:11:47,  1.85it/s]




  1%|█                                                                            | 114/8091 [01:21<1:31:29,  1.45it/s]




  1%|█                                                                            | 115/8091 [01:22<1:34:47,  1.40it/s]




  1%|█                                                                            | 116/8091 [01:22<1:35:19,  1.39it/s]




  1%|█                                                                            | 117/8091 [01:23<1:38:19,  1.35it/s]




  1%|█                                                                            | 118/8091 [01:24<1:42:47,  1.29it/s]




  1%|█▏                                                                           | 119/8091 [01:25<1:44:29,  1.27it/s]




  1%|█▏                                                                           | 120/8091 [01:26<1:48:01,  1.23it/s]




  1%|█▏                                                                           | 121/8091 [01:27<1:54:30,  1.16it/s]




  2%|█▏                                                                           | 122/8091 [01:27<1:50:43,  1.20it/s]




  2%|█▏                                                                           | 123/8091 [01:28<1:48:48,  1.22it/s]




  2%|█▏                                                                           | 124/8091 [01:29<1:45:25,  1.26it/s]




  2%|█▏                                                                           | 125/8091 [01:30<1:42:30,  1.30it/s]




  2%|█▏                                                                           | 126/8091 [01:30<1:47:47,  1.23it/s]

In [1]:
# Load the data
def load_photos(filename):
    file = load_fp(filename)
    photos = file.split("\n")[:-1]  # Corrected "n" to "\n"
    return photos

In [2]:
def load_clean_descriptions(filename, photos):
    # Loading clean_descriptions
    file = load_fp(filename)
    descriptions = {}
    for line in file.split("\n"):  # Corrected "n" to "\n"
        words = line.split()
        if len(words) < 1:
            continue
        image, image_caption = words[0], words[1:]
        if image in photos:
            if image not in descriptions:
                descriptions[image] = []
            desc = ' ' + " ".join(image_caption) + ' '
            descriptions[image].append(desc)
    return descriptions


In [3]:
def load_features(photos):
    # Loading all features
    all_features = load(open("features.p", "rb"))
    # Selecting only needed features
    features = {k: all_features[k] for k in photos}
    return features

In [4]:
# File paths
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"

# Load train data
train_imgs = load_photos(filename)
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
train_features = load_features(train_imgs)


NameError: name 'dataset_text' is not defined

In [None]:
# Convert dictionary to a clear list of descriptions
def dict_to_list(descriptions):
    all_desc = []
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

In [None]:
# Creating tokenizer class
# This will vectorize text corpus
# Each integer will represent a token in the dictionary
def create_tokenizer(descriptions):
    desc_list = dict_to_list(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(desc_list)
    return tokenizer

In [None]:
# Give each word an index, and store that into tokenizer.p pickle file
tokenizer = create_tokenizer(train_descriptions)
dump(tokenizer, open('tokenizer.p', 'wb'))

vocab_size = len(tokenizer.word_index) + 1  # The size of our vocabulary is 7577 words.


In [None]:
# Calculate maximum length of descriptions to decide the model structure parameters.
def max_length(descriptions):
    desc_list = dict_to_list(descriptions)
    return max(len(d.split()) for d in desc_list)

max_length = max_length(train_descriptions)  # Max length of description is 32


In [None]:
# Data generator, used by model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length):
    while True:
        for key, description_list in descriptions.items():
            # Retrieve photo features
            feature = features[key][0]
            inp_image, inp_seq, op_word = create_sequences(tokenizer, max_length, description_list, feature)
            yield [[inp_image, inp_seq], op_word]

In [None]:
def create_sequences(tokenizer, max_length, desc_list, feature):
    x_1, x_2, y = [], [], []
    # Move through each description for the image
    for desc in desc_list:
        # Encode the sequence
        seq = tokenizer.texts_to_sequences([desc])[0]
        # Divide one sequence into various X, y pairs
        for i in range(1, len(seq)):
            # Divide into input and output pair
            in_seq, out_seq = seq[:i], seq[i]
            # Pad input sequence
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
            # Encode output sequence
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
            # Store
            x_1.append(feature)
            x_2.append(in_seq)
            y.append(out_seq)
    return np.array(x_1), np.array(x_2), np.array(y)



In [None]:
# To check the shape of the input and output for your model
[a, b], c = next(data_generator(train_descriptions, train_features, tokenizer, max_length))
print(a.shape, b.shape, c.shape)  # Output: ((47, 2048), (47, 32), (47, 7577))

In [None]:
from keras.utils import plot_model

# Define the captioning model
def define_model(vocab_size, max_length):
    # Features from the CNN model compressed from 2048 to 256 nodes
    inputs1 = Input(shape=(2048,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)
    
    # LSTM sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)
    
    # Merging both models
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)
    
    # Merge it [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    # Summarize model
    print(model.summary())
    plot_model(model, to_file='model.png', show_shapes=True)
    
    return model


In [None]:
# Train our model
print('Dataset:', len(train_imgs))
print('Descriptions: train=', len(train_descriptions))
print('Photos: train=', len(train_features))
print('Vocabulary Size:', vocab_size)
print('Description Length:', max_length)

# Define the model
model = define_model(vocab_size, max_length)

epochs = 10
steps = len(train_descriptions)

# Create a directory named "models" to save our models
os.mkdir("models1")

for i in range(epochs):
    generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
    model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
    model.save("models1/model_" + str(i) + ".h5")


### Test our model


In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.applications.xception import Xception
from keras.models import load_model
from pickle import load
import numpy as np
import cv2
import matplotlib.pyplot as plt
import argparse


def extract_features(filename, model):
    print("File Path: \"" + filename + "\"")

    try:
        image = cv2.imread(filename, -1)
    except:
        print("ERROR: Couldn't open image! Make sure the image path and extension is correct.")

    image = cv2.resize(image, (299, 299))
    image = np.array(image)
    # For images that has 4 channels, we convert them into 3 channels
    if image.shape[2] == 4:
        image = image[..., :3]
    image = np.expand_dims(image, axis=0)
    image = image / 127.5
    image = image - 1.0
    feature = model.predict(image)
    return feature


def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None


def generate_desc(model, tokenizer, photo, max_length):
    in_text = 'start'
    for i in range(max_length):
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        sequence = pad_sequences([sequence], maxlen=max_length)
        pred = model.predict([photo,sequence], verbose=0)
        pred = np.argmax(pred)
        word = word_for_id(pred, tokenizer)
        if word is None:
            break
        in_text += ' ' + word
        if word == 'end':
            break
    return in_text


def main():
    # Example image: 'flickr8k-dataset/111537222_07e56d5a30.jpg'
    # Command: python testing_caption_generator.py -i ./flickr8k-dataset/111537222_07e56d5a30.jpg
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--image', required=True, help="Image Path")
    args = vars(parser.parse_args())
    img_path = args['image']

    max_length = 32
    tokenizer = load(open("tokenizer.p","rb"))
    model = load_model('models/model_9.h5')
    xception_model = Xception(include_top=False, pooling="avg")

    photo = extract_features(img_path, xception_model)
    img = cv2.imread(img_path, 0)

    description = generate_desc(model, tokenizer, photo, max_length)

    if description != 'start':
        description = description[6:]
    if description[-3:] == 'end':
        description = description[:-3]

    print("\n")
    print("Caption: " + description)
    plt.imshow(img)

if __name__ == '__main__':
    main()