Deep Learning Personal Project: Guide Bot

Goal of this project: To build an image-captioning application "Guide Bot" that can be connected to a camera and describe the scene in human voice using text-to-speech conversion. Possibly serve as an aid for the blind.

< About the original dataset "Flickr 30k Data" >

One folder of 30k images and a csv file of corresponding captions of the images (5 captions per image)

Acknowledgement -- this dataset is taken from University of Illinois at Urbana-Champaign Department of Computer Science (https://forms.illinois.edu/sec/229675)

< Table of Contents >

1) Create directories to store train / validation / test data

2) From 30k images, randomly extract 12k for train set, 2k for validation set, and 2k for test set
    - not all of 30k because only going to use CPU for training

3) Extract 80k captions from csv file which contains all captions of images (5 captions per image) and save it as txt file
    - 5 captions * 16k images (12k train + 2k validation + 2k test)

4) Create a dictionary "imgs_and_captions" where
    - key : name_of_images
    - value : a list of corresponding_captions
    
5) Clean each caption in the dictionary "imgs_and_captions"
    - remove punctuations
    - remove non-alphabets
    - remove trailing whitespaces
    - convert all characters to lower-case
    - did not remove english stopwords
    
6) Save cleaned captions to "cleaned_captions.txt"

7) Create a dictionary "training_captions" where
    - key : name of image (without .jpg extension)
    - value : a list of corresponding 5 captions

8) Load Inception V3 model and remove its last layer
    - remove the last layer because the model is not used to classify images but to convert images to fixed-length informative vectors
    
9) Automated Feature Engineering: convert images to (2048,) sized informative vectors and create a dictionary "train_imgs_encoded" where
    - key : name of image (train set)
    - value : feature vector of size (2048,)
    - saved as : "train_imgs_encoded.pkl"

10) Automated Feature Engineering: convert images to (2048,) sized informative vectors and create a dictionary "validation_imgs_encoded" where
    - key : name of image (validation set)
    - value : feature vector of size (2048,)
    - saved as : "validation_imgs_encoded.pkl"
    
11) Create a list "training_vocabulary" with only words that occurs at least 10 times

12) Create two dictionaries for easy conversion from index to word and from word to index
    - dictionary "word_to_index"
    - dictionary "index_to_word"
    
13) Calculate max length (max number of words) of training set captions : "max_len_of_all_cap"
    - to make sure each sequence is of equal length when batch processing

14) Create a data_generator

15) Read glove.txt and create a dictionary "word_to_embedding_vectors" where
    - key : word in glove.txt
    - value : corresponding embedding vector from pretrained Glove vectors
    
16) Create a dictionary "embedding_matrix" where
    - key : word in training vocabulary
    - value : embedding vector from Glove if the word exists

17) Build a functional model with Dense layers and Conv1D layers
    - Freeze embedding layer

18) Build a functional model with Dense layers and LSTM layer
    - Freeze embedding layer

19) Train both models
    - Trained only using CPU: pain in my ass
    
20) Using greedy search, predict captions of images in validation set

21) Conclusion

22) Acknowledgement for Pretrained Model Used

23) Sources of Reference

In [None]:
# 1) Create directories to store train / validation / test data
# Make necessary folders for the data
import os

data_dir = './data_folder'
os.mkdir(data_dir)
train_dir = os.path.join(data_dir, 'train')
os.mkdir(train_dir)
validation_dir = os.path.join(data_dir, 'validation')
os.mkdir(validation_dir)
test_dir = os.path.join(data_dir, 'test')
os.mkdir(test_dir)

In [None]:
# 2) From 30k images, randomly extract 12k for train set, 2k for validation set, and 2k for test set
import shutil
original_data_dir = './flickr30k_images/flickr30k_images'
count = 0
all_imgs = set()
train_imgs = set()
val_imgs = set()
test_imgs = set()
for img in os.listdir(original_data_dir):
    if (count == 16000):
        break
    if (count < 12000):
        src = os.path.join(original_data_dir, img)
        dst = os.path.join(train_dir, img)
        shutil.copyfile(src, dst)
        all_imgs.add(img)
        train_imgs.add(img)
        count += 1
    elif (count < 14000):
        src = os.path.join(original_data_dir, img)
        dst = os.path.join(validation_dir, img)
        shutil.copyfile(src, dst)
        all_imgs.add(img)
        val_imgs.add(img)
        count += 1
    else :
        src = os.path.join(original_data_dir, img)
        dst = os.path.join(test_dir, img)
        shutil.copyfile(src, dst)
        all_imgs.add(img)
        test_imgs.add(img)
        count += 1

In [None]:
print('Trainset images : ', len(os.listdir(train_dir)))
print('Validationset images : ', len(os.listdir(validation_dir)))
print('Testset images : ', len(os.listdir(test_dir)))
# ".DS_Store" was also counted in all of the folders (therefore got 12000/2000/2000 images)
print(len(all_imgs))
print(len(train_imgs))
print(len(val_imgs))
print(len(test_imgs))

In [None]:
# 3) Extract 80k captions from csv file which contains all captions of images (5 captions per image)
#    and save it as txt file

csv_path = './flickr30k_images/results.csv'
with open(csv_path, "r") as csv_file:
    lines = [line.split("|") for line in csv_file.readlines()]

In [None]:
# 3) Extract 80k captions from csv file which contains all captions of images (5 captions per image)
#    and save it as txt file

txt_file = open('./data_folder/all_captions.txt',"w")

for line in lines:
    img_name = line[0].strip()
    if img_name in all_imgs:
        txt_file.write(img_name + "#" + str(line[1]).strip() + " " + line[2])

In [None]:
count = 0
for line in open('./data_folder/all_captions.txt',"r"):
    count += 1
print(count)
# 286 captions are missing, but will ignore them and proceed to next steps

In [None]:
# 4) Create a dictionary "imgs_and_captions" where key : name_of_images
#    and value : a list of corresponding_captions

all_captions = open('./data_folder/all_captions.txt',"r").read()
imgs_and_captions = dict()
for line in all_captions.split('\n'):
    l = line.split(' ')
    img_name = l[0].split('.')[0]
    corresponding_caption = ' '.join(l[1:])
    if img_name not in imgs_and_captions:
        imgs_and_captions[img_name] = list()
    imgs_and_captions[img_name].append(corresponding_caption)

In [None]:
print(len(imgs_and_captions))

In [None]:
print(imgs_and_captions.keys()[12345])

In [None]:
print(imgs_and_captions['2814037463'])
print("number of captions: " + str(len(imgs_and_captions['2814037463'])))

In [None]:
print(imgs_and_captions.keys()[36])

In [None]:
print(imgs_and_captions['2860314714'])
print("number of captions: " + str(len(imgs_and_captions['2860314714'])))

In [None]:
# 5) Clean each caption in the dictionary "imgs_and_captions"
#    - remove punctuations
#    - remove non-alphabets
#    - remove trailing whitespaces
#    - convert all characters to lower-case
#    - did not remove english stopwords

import string
# import nltk
# from nltk.corpus import stopwords
#nltk.download('stopwords')
for key, captions in imgs_and_captions.items():
    for i in range(len(captions)):
        tokens = captions[i].split() #split into words
        tokens = [word.translate(None, string.punctuation) for word in tokens] #remove punctuations
        tokens = [word for word in tokens if word.isalpha()] #remove non-alphabetics
        tokens = [word.strip() for word in tokens] #remove trailing whitespaces
        tokens = [word.lower() for word in tokens] #convert to lower-case 
#         tokens = [word for word in tokens if word not in set (stopwords.words('english'))] #remove stopwords        
        captions[i] = ' '.join(tokens)

In [None]:
print(imgs_and_captions['2814037463'])
print("number of captions: " + str(len(imgs_and_captions['2814037463'])))

In [None]:
print(imgs_and_captions['2860314714'])
print("number of captions: " + str(len(imgs_and_captions['2860314714'])))

In [None]:
print(imgs_and_captions['2514612680'])
print("number of captions: " + str(len(imgs_and_captions['2514612680'])))

In [None]:
# 6) Save cleaned captions to "cleaned_captions.txt"

txt_file = open('./data_folder/cleaned_captions.txt',"w")
for img, captions in imgs_and_captions.items():
    for caption in captions:
        txt_file.write(img + " " + caption + '\n')
txt_file.close()

In [None]:
print(len(all_imgs))
print(len(train_imgs))
print(len(val_imgs))
print(len(test_imgs))

In [None]:
# A list of names of imgs in the trainset (without .jpg)
train_img_names = []
for img in train_imgs:
    train_img_names.append(img.split('.')[0])
print(len(train_img_names))

In [None]:
# 7) Create a dictionary "training_captions" where key : name of image (without .jpg extension)
#    and value : a list of corresponding 5 captions

cleaned_captions = open('./data_folder/cleaned_captions.txt',"r").read()
training_captions = dict()
for line in cleaned_captions.split('\n'):
    tokens = line.split()
    if (len(tokens) < 2):
        continue
    img, caption = tokens[0], tokens[1:]
    if img in train_img_names:
        if img not in training_captions:
            training_captions[img] = list()
        words = 'startseq ' + ' '.join(caption) + ' endseq'
        training_captions[img].append(words)

In [None]:
print(len(training_captions))
print(training_captions.keys()[36])
print(training_captions['3178005751'])

In [None]:
print(training_captions.keys()[6789])
print(training_captions['12974441'])

In [None]:
# 8) Load Inception V3 model and remove its last layer

from keras.applications.inception_v3 import InceptionV3
model = InceptionV3(weights='imagenet')

In [None]:
# 8) Load Inception V3 model and remove its last layer

from keras.models import Model
model_v3_without_output_layer = Model(model.input, model.layers[-2].output)

In [None]:
# Reference from: https://github.com/hlamba28/Automatic-Image-Captioning

from keras.applications.inception_v3 import preprocess_input
from keras.preprocessing import image
import numpy as np

def preprocess(image_path):
    # Convert all the images to size 299x299 as expected by the inception v3 model
    img = image.load_img(image_path, target_size=(299, 299))
    # Convert PIL image to numpy array of 3-dimensions
    x = image.img_to_array(img)
    # Add one more dimension
    x = np.expand_dims(x, axis=0)
    # preprocess the images using preprocess_input() from inception module
    x = preprocess_input(x)
    return x

# Function to encode a given image into a vector of size (2048, )
def encode(image):
    image = preprocess(image) # preprocess the image
    fea_vec = model_v3_without_output_layer.predict(image) # Get the encoding vector for the image
    fea_vec = np.reshape(fea_vec, fea_vec.shape[1]) # reshape from (1, 2048) to (2048, )
    return fea_vec

In [None]:
# 9) Automated Feature Engineering: convert images to (2048,) sized informative vectors
#    and create a dictionary "train_imgs_encoded" where key : name of image (train set)
#    and value : feature vector of size (2048,), saved as : "train_imgs_encoded.pkl"

train_imgs_encoded = dict()
for img in os.listdir(train_dir):
    if img == '.DS_Store':
        continue
    img_name = img.split('.')[0]
    img_path = os.path.join(train_dir, img)
    train_imgs_encoded[img_name] = encode(img_path)

In [None]:
# 9) Automated Feature Engineering: convert images to (2048,) sized informative vectors
#    and create a dictionary "train_imgs_encoded" where key : name of image (train set)
#    and value : feature vector of size (2048,), saved as : "train_imgs_encoded.pkl"

import pickle
path_for_training = os.path.join(train_dir, 'Pickle')
os.mkdir(path_for_training)
path_for_training = os.path.join(path_for_training, 'train_imgs_encoded.pkl')
with open(path_for_training,'wb') as encoded_pickle:
    pickle.dump(train_imgs_encoded, encoded_pickle)

In [None]:
# 10) Automated Feature Engineering: convert images to (2048,) sized informative vectors
#     and create a dictionary "validation_imgs_encoded" where key : name of image (validation set)
#     and value : feature vector of size (2048,), saved as : "validation_imgs_encoded.pkl"

validation_imgs_encoded = dict()
count = 1
for img in os.listdir(validation_dir):
    if img == '.DS_Store':
        continue
    img_name = img.split('.')[0]
    img_path = os.path.join(validation_dir, img)
    print("" + str(count) + " / 2000 encoding...")
    validation_imgs_encoded[img_name] = encode(img_path)
    count += 1

In [None]:
# 10) Automated Feature Engineering: convert images to (2048,) sized informative vectors
#     and create a dictionary "validation_imgs_encoded" where key : name of image (validation set)
#     and value : feature vector of size (2048,), saved as : "validation_imgs_encoded.pkl"

path_for_validation = os.path.join(validation_dir, 'Pickle')
os.mkdir(path_for_validation)
path_for_validation = os.path.join(path_for_validation, 'validation_imgs_encoded.pkl')
with open(path_for_validation,'wb') as encoded_pickle:
    pickle.dump(validation_imgs_encoded, encoded_pickle)

In [None]:
path_for_validation = os.path.join(validation_dir, 'Pickle')
train_img_features = pickle.load(open(path_for_training,'rb'))
validation_img_features = pickle.load(open(path_for_validation,'rb'))
print('Train img features = %d' % len(train_img_features))
print('Validation img features = %d' % len(validation_img_features))

In [None]:
# 11) Create a list "training_vocabulary" with only words that occurs at least 10 times

word_freq = dict()
for list_of_cap in training_captions.values():
    for caption in list_of_cap:
        tokens = caption.split()
        for token in tokens:
            if token not in word_freq:
                word_freq[token] = 1
            else :
                count = word_freq[token] + 1
                word_freq[token] = count
training_vocabulary = [word for word in word_freq if word_freq[word] >= 10]
print('total number of words : %d & number of vocabs interested : %d' % (len(word_freq), len(training_vocabulary)))

In [None]:
# 12) Create two dictionaries for easy conversion from index to word and from word to index
#     dictionary "word_to_index"
#     dictionary "index_to_word"

index_to_word = dict()
word_to_index = dict()
index = 1
for word in training_vocabulary:
    index_to_word[index] = word
    word_to_index[word] = index
    index += 1

In [None]:
# Save dictionary word_to_index as text file
# index starts from 1
txt_file = open('./data_folder/word_to_index.txt',"w")
for word, index in word_to_index.items():
    txt_file.write(word + " " + str(index) + '\n')
txt_file.close()

In [None]:
print(len(training_vocabulary) == len(index_to_word))
print(len(training_vocabulary) == len(word_to_index))

In [None]:
# 13) Calculate max length (max number of words) of training set captions : "max_len_of_all_cap"
max_len_of_all_cap = 0
for captions in training_captions.values():
    for i in range(len(captions)):
        max_len_of_all_cap = max(len(captions[i].split()), max_len_of_all_cap)
print("Max length of all captions : %d" % max_len_of_all_cap)

In [None]:
# 14) Create a data_generator
# Reference from: https://github.com/hlamba28/Automatic-Image-Captioning

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

def data_generator(train_captions, train_features, wordtoix, max_length, vocab_size, num_imgs_per_batch):
    X1, X2, y = list(), list(), list()
    count = 0
    # loop for ever over images
    while True:
        for img, captions in train_captions.items():
            count += 1
            feature = train_features[img]
            for caption in captions:
                # encode the sequence
                seq = [wordtoix[word] for word in caption.split(' ') if word in wordtoix]
                # split one sequence into multiple X, y pairs
                for i in range(1, len(seq)):
                    # split into input and output pair
                    in_seq, out_seq = seq[:i], seq[i]
                    # pad input sequence
                    in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                    # encode output sequence
                    out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
                    # store
                    X1.append(feature)
                    X2.append(in_seq)
                    y.append(out_seq)
            # yield the batch data
            if count == num_imgs_per_batch:
                yield [[np.asarray(X1), np.asarray(X2)], np.asarray(y)]
                X1, X2, y = list(), list(), list()
                count = 0

In [None]:
# 15) Read glove.txt and create a dictionary "word_to_embedding_vectors" where key : word in glove.txt
#     and value : corresponding embedding vector from pretrained Glove vectors

import io
glove_dir = './data_folder/glove'

word_to_embedding_vectors = dict()

glove_file = io.open(os.path.join(glove_dir, 'glove.6B.200d.txt'), mode="r", encoding="utf-8")

for line in glove_file:
    tokens = line.split()
    word = tokens[0]
    embedding_vector = np.asarray(tokens[1:], dtype='float32')
    word_to_embedding_vectors[word] = embedding_vector
    
glove_file.close()

https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html

Higher-dimensional embeddings can more accurately represent the relationships between input values.
But more dimensions increases the chance of overfitting and leads to slower training.
Empirical rule of thumb (a good starting point but should be tuned using the validation data) : embedding_dimensions =  number_of_categories**0.25

But since dimension of embedding_vectors above is 200, will stick to 200 for this project.

In [None]:
print("dimension of embedding_vector: " + str(len(word_to_embedding_vectors['a'])))

In [None]:
# 16) Create a dictionary "embedding_matrix" where key : word in training vocabulary
#     and value : embedding vector from Glove if the word exists

embedding_dim = len(word_to_embedding_vectors.values()[0])
vocabulary_size = len(training_vocabulary) + 1
embedding_matrix = np.zeros((vocabulary_size,embedding_dim))

# For word in our training vocabulary, extract embedding_vector from Glove if exists
for word, index in word_to_index.items():
    if word in word_to_embedding_vectors:
        embedding_matrix[index] = word_to_embedding_vectors[word]

Since order of words in captions are not important interpreting their meanings, will use conv1D instead of RNN/LSTM/GRU. If order was important usually as in problems involving time-series data, would have used RNN/LSTM. But in this case, where involving text data, conv1d can be used for their lightness with almost the same performance.

In [None]:
# 17) Build a functional model with Dense layers and Conv1D layers
#     - Freeze embedding layer

from keras import Input, layers, Model

input1 = Input(shape=(2048,))
x1 = layers.Dense(256, activation='relu')(input1)
x1 = layers.Dropout(0.5)(x1)

input2 = Input(shape=(max_len_of_all_cap,))
x2 = layers.Embedding(vocabulary_size, embedding_dim)(input2)
x2 = layers.Conv1D(128, 7, activation='relu')(x2)
x2 = layers.MaxPooling1D(5)(x2)
x2 = layers.Conv1D(256, 7, activation='relu')(x2)
x2 = layers.GlobalMaxPooling1D()(x2)

input_added = layers.add([x1, x2])
x3 = layers.Dense(256, activation='relu')(input_added)
output = layers.Dense(vocabulary_size, activation='softmax')(x3)

model = Model(inputs=[input1, input2], outputs=output)

In [None]:
model.summary()

In [None]:
model.layers[1]

In [None]:
# Freeze embedding layer
model.layers[1].set_weights([embedding_matrix])
model.layers[1].trainable = False

In [None]:
from keras import optimizers
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(lr=0.0001))

In [None]:
train_imgs_encoded = pickle.load(open('./data_folder/train/Pickle/train_imgs_encoded.pkl','rb'))
validation_imgs_encoded = pickle.load(open('./data_folder/validation/Pickle/validation_imgs_encoded.pkl','rb'))

In [None]:
epochs = 10
batch_size = 3
steps = len(training_captions) // batch_size

In [None]:
for i in range(epochs):
    generator = data_generator(training_captions,
                               train_imgs_encoded,
                               word_to_index,
                               max_len_of_all_cap,
                               vocabulary_size,
                               batch_size)
    history = model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
    model.save('./models/model_' + str(i) + '.h5')

In [None]:
model = models.load_model('./models/model_9.h5')

In [None]:
#make some changes
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-4))

In [None]:
epochs = 10
batch_size = 3
steps = len(training_captions) // batch_size

In [None]:
for i in range(epochs):
    generator = data_generator(training_captions,
                               train_imgs_encoded,
                               word_to_index,
                               max_len_of_all_cap,
                               vocabulary_size,
                               batch_size)
    history = model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
    model.save('./models/model_1' + str(i) + '.h5')

In [None]:
model = models.load_model('./models/model_19.h5')

In [None]:
#make some changes
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-5))

In [None]:
epochs = 10
batch_size = 3
steps = len(training_captions) // batch_size

In [None]:
for i in range(epochs):
    generator = data_generator(training_captions,
                               train_imgs_encoded,
                               word_to_index,
                               max_len_of_all_cap,
                               vocabulary_size,
                               batch_size)
    history = model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
    model.save('./models/model_2' + str(i) + '.h5')

In [None]:
# 18) Build a functional model with Dense layers and LSTM layer
#     - Freeze embedding layer

input1 = Input(shape=(2048,))
x1 = layers.Dropout(0.5)(input1)
x1 = layers.Dense(256, activation='relu')(x1)

input2 = Input(shape=(max_len_of_all_cap,))
x2 = layers.Embedding(vocabulary_size, embedding_dim, mask_zero=True)(input2)
x2 = layers.Dropout(0.5)(x2)
x2 = layers.LSTM(256)(x2)

input_added = layers.add([x1, x2])
x3 = layers.Dense(256, activation='relu')(input_added)
output = layers.Dense(vocabulary_size, activation='softmax')(x3)
model = Model(inputs=[input1, input2], outputs=output)

In [None]:
model.summary()

In [None]:
model.layers[2]

In [None]:
model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable = False

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
epochs = 10
batch_size = 5
steps = len(training_captions) // batch_size

In [None]:
for i in range(epochs):
    generator = data_generator(training_captions,
                               train_imgs_encoded,
                               word_to_index,
                               max_len_of_all_cap,
                               vocabulary_size,
                               batch_size)
    history = model.fit_generator(generator, epochs=1, steps_per_epoch=steps, verbose=1)
    model.save('./models/lstm_model_1' + str(i) + '.h5')
    # loss: 2.7302

Load trained model and embedding vectors

In [None]:
model.load_weights('./models/lstm/lstm_model_9.h5')

In [None]:
validation_images_path = './data_folder/validation/'

In [None]:
with open('./data_folder/validation/Pickle/validation_imgs_encoded.pkl', 'rb') as encoded_pickle:
    validation_imgs_encoded = pickle.load(encoded_pickle)

In [None]:
# 20) Using greedy search, predict captions of images in validation set
# Reference from: https://github.com/hlamba28/Automatic-Image-Captioning

def greedy_search(feature):
    in_text = 'startseq'
    for i in range(max_len_of_all_cap):
        inputs = [word_to_index[w] for w in in_text.split() if w in word_to_index]
        inputs = pad_sequences([inputs], maxlen=max_len_of_all_cap)

        y_hat = model.predict([feature, inputs], verbose=0)
        y_hat = np.argmax(y_hat)
        word = index_to_word[y_hat]
        
        in_text += ' ' + word
        if word == 'endseq':
            break
    predicted_caption = in_text.split()
    # remove 'startseq' & 'endseq'
    predicted_caption = predicted_caption[1:-1]
    predicted_caption = ' '.join(predicted_caption)
    return predicted_caption

In [None]:
from random import randint

index = randint(0,1999)
img_name = list(validation_imgs_encoded.keys())[index]
feature = validation_imgs_encoded[img_name].reshape((1,2048))

x = plt.imread(validation_images_path + img_name + '.jpg')
plt.imshow(x)
plt.show()

caption = greedy_search(feature)
print(caption)

< Conclusion >

The application works, but the deep learning model used seems too weak. Since only used CPU for training, I was not able to train the model for enough epochs and had to use a small amount of data. Utilizing GPU, I need to collect and use more data (images and captions) and tweak hyper-parameters of the model. Also need to find the right evaluation metrics for training.

I built two different models: one using Conv1D and the other using LSTM. Since order of words in captions are not important interpreting their meanings, I thought using Conv1D instead of RNN/LSTM/GRU seemed like a more efficient approach. Conv1d is known to be lighter and can attain almost the same performance. Due to limited resources, I was not able to fully compare and contrast Conv1d and LSTM models. But each epoch in the Conv1D model was a lot shorter than that in the LSTM model, which proves the lightness of Conv1D. However, the Conv1D model also learned at a slower rate than the LSTM model, resulting in requiring more epochs for training. I wasn't able to determine which model is better from this project, and I'll have to closely examine it in my future projects.

Image-captioning to help the blind seems extremely difficuly because it requires tremendous amount of data (images and captions) for training the model.

< Acknowledgement for Pretrained Model Used >

1) Inception V3 (from keras.applications)
    - utilized for extracting feature vectors from images

2) Glove: Global Vectors for Word Representation
    by Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
    source from https://nlp.stanford.edu/projects/glove/

< Sources of Reference >

1) Francois Chollet.(2017) "Deep Learning with Python" Published by Manning

2) https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/

3) https://towardsdatascience.com/image-captioning-with-keras-teaching-computers-to-describe-pictures-c88a46a311b8?fbclid=IwAR2O1DXkN305efbxVazsbV-rmLKR7fsUvq39jUa5CydHEU3xKeytCx_ycsw

4) https://github.com/hlamba28/Automatic-Image-Captioning

5) https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_gui/py_video_display/py_video_display.html