# EYE FOR BLIND
This notebook will be used to prepare the capstone project 'Eye for Blind'

In [31]:
#Import all the required libraries

import warnings
warnings.filterwarnings('ignore')

import numpy as np # for scientific computing
import pandas as pd # for data manipulation, processing and analysis
import matplotlib.pyplot as plt # plotting 
import seaborn as sns # visualization
import glob # to return all file paths which matches specific pattern
import tensorflow as tf
import keras
import string
import time
import json
import random

from sklearn.model_selection import train_test_split #to split the data

from tensorflow.keras.applications.inception_v3 import InceptionV3 # pre-trained model
from tensorflow.keras.models import Model  # neural network model
from tensorflow.keras import layers 
from tensorflow.keras import activations
from tensorflow.keras import Input


from keras.preprocessing.image import load_img # to load an image from files as PIl image object
from keras.preprocessing.text import Tokenizer # vectorizing a text

from skimage import io # to read/ write images

import collections # containers used to store data
from wordcloud import WordCloud, STOPWORDS
 
from tqdm import tqdm # for creating Progress
from PIL import Image # Python Image Library
import IPython
from IPython import display



Let's read the dataset

## Data understanding
1.Import the dataset and read image & captions into two seperate variables

2.Visualise both the images & text present in the dataset

3.Create a dataframe which summarizes the image, path & captions as a dataframe

4.Create a list which contains all the captions & path

5.Visualise the top 30 occuring words in the captions



In [32]:
#Import the dataset and read the image into a seperate variable

images= '../input/flickr8k/Images'

all_imgs = glob.glob(images + '/*.jpg',recursive=True)
print("The total images present in the dataset: {}".format(len(all_imgs)))

In [33]:
#Visualise images
Img = all_imgs[0:3]
fig, axes = plt.subplots(1,3)
fig.set_figwidth(20)

for ax,image in zip(axes,Img):
    ax.imshow(io.imread(image),cmap=None)


In [34]:
#Import the dataset and read the text file into a seperate variable

text_file = '../input/flickr8k/captions.txt'

def load_doc(text_file):
    open_file = open(text_file, 'r')
    text = open_file.read()
    open_file.close()
    return text

doc = load_doc(text_file)
print(doc[:300])

Create a dataframe which summarizes the image, path & captions as a dataframe

Each image id has 5 captions associated with it therefore the total dataset should have 40455 samples.

In [35]:
image_path='../input/flickr8k/Images/'

all_img_id=[] #store all the image id here
all_img_vector=[] #store all the image path here
annotations=[] #store all the captions here

with open('../input/flickr8k/captions.txt','r') as fc:
    next(fc)
    for line in fc:
        split_line = line.split(',')
        all_img_id.append(split_line[0])
        all_img_vector.append(image_path + split_line[0])
        annotations.append(split_line[1].rstrip('\n.'))

df = pd.DataFrame(list(zip(all_img_id, all_img_vector,annotations)),columns =['ID','Path', 'Captions']) 
    
df

In [36]:
#Create a list which contains all the captions
annotations=['<start>' + ' ' +  line + ' ' + '<end>' for line in annotations]

#add the <start> & <end> token to all those captions as well

#Create a list which contains all the path to the images
all_img_path=all_img_vector#write your code here

print("Total captions present in the dataset: "+ str(len(annotations)))
print("Total images present in the dataset: " + str(len(all_img_path)))

In [37]:
annotations[0:5]

In [38]:
#Create the vocabulary & the counter for the captions

vocabulary= [word.lower() for line in annotations for word in line.split()]

val_count=collections.Counter(vocabulary)
val_count

In [39]:
#Visualise the top 30 occuring words in the captions


#write your code here
top30 = val_count.most_common(30)

words = []
counts = []
for word_count in top30 : 
    words.append(word_count[0])
    counts.append(word_count[1])

plt.figure(figsize=(20,10))
plt.title('Top 30 words in the vocabulary')
plt.xlabel('word')
plt.ylabel('count')
plot = sns.barplot(x=words, y=counts)
for p in plot.patches:
    plot.annotate(format(int(p.get_height())), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
plt.show()

In [40]:
# visualizing the random image and caption:

import random as rn 
    # image
random_index = rn.randint(0,len(all_imgs))
image_id = df.loc[random_index,'ID']
image = plt.imread(images+'/'+ image_id)
plt.title('Image ID : ' +image_id )
plt.imshow(image)
plt.axis('off')
plt.show()
print('Image Shape : ', image.shape,'\n')
    
    # captions
condition = df['ID'] == image_id
print('Captions for Image ID # ', image_id , ' : ')
print(df.loc[condition,'Captions'].values, '\n\n')

## Pre-Processing the captions
1.Create the tokenized vectors by tokenizing the captions fore ex :split them using spaces & other filters. 
This gives us a vocabulary of all of the unique words in the data. Keep the total vocaublary to top 5,000 words for saving memory.

2.Replace all other words with the unknown token "UNK" .

3.Create word-to-index and index-to-word mappings.

4.Pad all sequences to be the same length as the longest one.

In [41]:
# create the tokenizer

#your code here

tokenizer = keras.preprocessing.text.Tokenizer(
num_words = 5000, 
filters = '!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ',
lower = True, 
split = " ", 
char_level = False, 
oov_token = '<unk>',)

In [42]:
# Create word-to-index and index-to-word mappings.

#your code here

tokenizer.fit_on_texts(annotations)

# Converting sentences to sequences of word token indexes
caption_text_sequences = tokenizer.texts_to_sequences(annotations)
caption_text_sequences[:10]


In [43]:
tokenizer.word_index['PAD'] = 0
tokenizer.index_word[0] = 'PAD'

In [44]:
# Create a word count of your tokenizer to visulize the Top 30 occuring words after text processing

#your code here
words_count = tokenizer.get_config()['word_counts']

words_count_df = pd.DataFrame.from_dict(data = json.loads(words_count), orient='index', columns=['count'])
top_30 = words_count_df.sort_values(by='count', ascending=False)[:30]


plt.figure(figsize=(20,10))
plt.title('Top 30 words in the vocabulary')
plt.xlabel('word')
plt.ylabel('count')
plot = sns.barplot(x=top_30.index , y=top_30['count'])
for p in plot.patches:
    plot.annotate(format(int(p.get_height())), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
plt.show()

In [46]:
# Pad each vector to the max_length of the captions, store it to a vairable

caption_sequences_len=[len(seq) for seq in caption_text_sequences] #storing all lengths in list.Can be used if needed in future
longest_word_length= max(caption_sequences_len) #Python list method max returns the elements from the list with maximum value.


cap_vector=  tf.keras.preprocessing.sequence.pad_sequences(caption_sequences, padding='post',maxlen=longest_word_length,
                                                          dtype='int32', value=0)


print("The shape of Caption vector is :" + str(cap_vector.shape))

## Pre-processing the images

1.Resize them into the shape of (299, 299)

3.Normalize the image within the range of -1 to 1, such that it is in correct format for InceptionV3. 

### FAQs on how to resize the images::
* Since you have a list which contains all the image path, you need to first convert them to a dataset using <i>tf.data.Dataset.from_tensor_slices</i>. Once you have created a dataset consisting of image paths, you need to apply a function to the dataset which will apply the necessary preprocessing to each image. 
* This function should resize them and also should do the necessary preprocessing that it is in correct format for InceptionV3.


In [47]:
IMAGE_SHAPE= (299, 299)

In [48]:
#write your code here to create the dataset consisting of image paths
all_img_vector

In [49]:
#write your code here for creating the function. This function should return images & their path

def load_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299, 299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

In [51]:
#write your code here for applying the function to the image path dataset, such that the transformed dataset should contain images & their path


image,image_path = load_image(r"../input/flickr8k/Images/41999070_838089137e.jpg")
print("Shape after resize :", image.shape)
plt.imshow(image)



In [53]:
encode_train_set = sorted(set(all_img_vector))

feature_dict = {}

image_data_set = tf.data.Dataset.from_tensor_slices(encode_train_set)
image_data_set = image_data_set.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(32)

In [54]:
image_data_set

## Load the pretrained Imagenet weights of Inception net V3

1.To save the memory(RAM) from getting exhausted, extract the features of the images using the last layer of pre-trained model. Including this as part of training will lead to higher computational time.

2.The shape of the output of this layer is 8x8x2048. 

3.Use a function to extract the features of each image in the train & test dataset such that the shape of each image should be (batch_size, 8*8, 2048)



In [55]:
image_model = tf.keras.applications.InceptionV3(include_top=False,weights='imagenet')

new_input = image_model.input #write code here to get the input of the image_model
hidden_layer = image_model.layers[-1].output #write code here to get the output of the image_model

image_features_extract_model = tf.compat.v1.keras.Model(new_input, hidden_layer)  #build the final model using both input & output layer


In [56]:
all_img_vector

In [58]:
Img_Data_List = sorted(set(all_img_vector)) 

# Creating a Dataset using tf.data.Dataset.from_tensor_slice
Image_Dataset_New = tf.data.Dataset.from_tensor_slices(Img_Data_List)

In [59]:
Image_Dataset_New = Image_Dataset_New.map(load_image,num_parallel_calls=tf.data.experimental.AUTOTUNE)


In [60]:
Image_Dataset_New= Image_Dataset_New.batch(64,drop_remainder=False)

In [61]:
# write the code to apply the feature_extraction model to your earlier created dataset which contained images & their respective paths
# Once the features are created, you need to reshape them such that feature shape is in order of (batch_size, 8*8, 2048)

image_features_dict={}
for image, image_path in tqdm(Image_Dataset_New): #using tqdm as progress bar
    features_for_batch = image_features_extract_model(image) #feeding images from above created dataset to Inception v3 which we build above
    features_for_batch_flattened = tf.reshape(features_for_batch,
                             (features_for_batch.shape[0], -1, features_for_batch.shape[3])) ##We are sqeezing/squashing 
                                   
    for batch_feat, path in zip(features_for_batch_flattened, image_path):
        feature_path = path.numpy().decode("utf-8")
        image_features_dict[feature_path] =  batch_feat.numpy()


### FAQs on how to store the features:
* You can store the features using a dictionary with the path as the key and values as the feature extracted by the inception net v3 model OR
* You can store using numpy(np.save) to store the resulting vector.

## Dataset creation
1.Apply train_test_split on both image path & captions to create the train & test list. Create the train-test spliit using 80-20 ratio & random state = 42

2.Create a function which maps the image path to their feature. 

3.Create a builder function to create train & test dataset & apply the function created earlier to transform the dataset

2.Make sure you have done Shuffle and batch while building the dataset

3.The shape of each image in the dataset after building should be (batch_size, 8*8, 2048)

4.The shape of each caption in the dataset after building should be(batch_size, max_len)


In [62]:
#write your code here

path_train, path_test, cap_train, cap_test = train_test_split(all_img_vector,cap_vector,test_size=0.2,random_state=42)

In [63]:
print("Training data for images: " + str(len(path_train)))
print("Testing data for images: " + str(len(path_test)))
print("Training data for Captions: " + str(len(cap_train)))
print("Testing data for Captions: " + str(len(cap_test)))

In [64]:
# Create a function which maps the image path to their feature. 
# This function will take the image_path & caption and return it's feature & respective caption.

def map_func(image,captions):
        image_final = image_features_dict[image.decode('utf-8')]
        return image_final,captions

### FAQs on how to load the features:
* You can load the features using the dictionary created earlier OR
* You can store using numpy(np.load) to load the feature vector.

In [66]:
# create a builder function to create dataset which takes in the image path & captions as input
# This function should transform the created dataset(img_path,cap) to (features,cap) using the map_func created earlier
BUFFER_SIZE = 1000
BATCH_SIZE = 256
def gen_dataset(images_data, captions_data):
    
    # Creating a Dataset using tf.data.Dataset.from_tensor_slice 
    dataset = tf.data.Dataset.from_tensor_slices((images_data, captions_data))

    # num_parallel_calls= tf.data.AUTOTUNE is used, then the number of parallel calls is set dynamically based on available CPU.
    dataset = dataset.map(lambda item1, item2: tf.numpy_function(map_func, [item1, item2], [tf.float32, tf.int32]),
          num_parallel_calls=tf.data.experimental.AUTOTUNE)
    
    
    
    dataset = (
     dataset.shuffle(BUFFER_SIZE, reshuffle_each_iteration=True) 
    .batch(BATCH_SIZE, drop_remainder=False)
    .prefetch(tf.data.experimental.AUTOTUNE)
    ) 

    return dataset

In [67]:
train_dataset=gen_dataset(path_train,cap_train)
test_dataset=gen_dataset(path_test,cap_test)

In [68]:
sample_img_batch, sample_cap_batch = next(iter(train_dataset))
print(sample_img_batch.shape)  #(batch_size, 8*8, 2048)
print(sample_cap_batch.shape) #(batch_size,max_len)

## Model Building
1.Set the parameters

2.Build the Encoder, Attention model & Decoder

In [69]:
embedding_dim = 256 
units = 512
vocab_size = 5001
train_num_steps =len(path_train) // BATCH_SIZE
test_num_steps = len(path_test) // BATCH_SIZE
max_length=35
features_shape = batch_feat.shape[1]
attention_features_shape = batch_feat.shape[0]

### Encoder

In [70]:
class Encoder(Model):
    def __init__(self,embed_dim):
        super(Encoder, self).__init__()
        self.dense =tf.keras.layers.Dense(embed_dim) #build your Dense layer with relu activation
        
    def call(self, features):
        features = self.dense(features) # extract the features from the image shape: (batch, 8*8, embed_dim)
        features = tf.keras.activations.relu(features, alpha=0.01, max_value=None, threshold=0) #applying relu activation 
        return features

In [71]:
encoder=Encoder(embedding_dim)

### Attention model

In [72]:
class Attention_model(Model):
    def __init__(self, units):
        super(Attention_model, self).__init__()
        self.W1 =tf.keras.layers.Dense(units) #build your Dense layer
        self.W2 = tf.keras.layers.Dense(units)#build your Dense layer
        self.V = tf.keras.layers.Dense(1)#build your final Dense layer with unit 1
        self.units=units

    def call(self, features, hidden):
        #features shape: (batch_size, 8*8, embedding_dim)
        # hidden shape: (batch_size, hidden_size)
        hidden_with_time_axis =  hidden[:, tf.newaxis]# Expand the hidden shape to shape: (batch_size, 1, hidden_size)
        score =tf.keras.activations.tanh(self.W1(features) + self.W2(hidden_with_time_axis)) # build your score funciton to shape: (batch_size, 8*8, units)
        attention_weights = tf.keras.activations.softmax(self.V(score), axis=1) # extract your attention weights with shape: (batch_size, 8*8, 1)
        context_vector = attention_weights * features #shape: create the context vector with shape (batch_size, 8*8,embedding_dim)
        context_vector =tf.reduce_sum(context_vector, axis=1) # reduce the shape to (batch_size, embedding_dim)
        

        return context_vector, attention_weights

### Decoder

In [73]:
from tensorflow.keras.layers import LSTM, Embedding, TimeDistributed, Dense, RepeatVector, Activation, Flatten, Reshape, concatenate, Dropout, BatchNormalization, Conv2D
from tensorflow.keras import Input, layers
from tensorflow.keras import optimizers

In [74]:
class Decoder(Model):
    def __init__(self, embed_dim, units, vocab_size):
        super(Decoder, self).__init__()
        self.units=units
        self.attention = Attention_model(self.units) #iniitalise your Attention model with units
        self.embed = tf.keras.layers.Embedding(vocab_size, embed_dim,mask_zero=False)#build your Embedding layer
        self.gru = tf.keras.layers.GRU(self.units,return_sequences=True,return_state=True,recurrent_initializer='glorot_uniform')
        self.d1 =tf.keras.layers.Dense(self.units) #build your Dense layer
        self.d2 =tf.keras.layers.Dense(vocab_size) #build your Dense layer
        #self.dropout = Dropout(0.5)

    def call(self,x,features, hidden):
        context_vector, attention_weights =self.attention(features, hidden) #create your context vector & attention weights from attention model
        embed = self.embed(x) # embed your input to shape: (batch_size, 1, embedding_dim)
        embed = tf.concat([tf.expand_dims(context_vector, 1), embed], axis=-1) # Concatenate your input with the context vector from attention layer. Shape: (batch_size, 1, embedding_dim + embedding_dim)
        output,state =  self.gru(embed)# Extract the output & hidden state from GRU layer. Output shape : (batch_size, max_length, hidden_size)
        output = self.d1(output)
        output = tf.reshape(output, (-1, output.shape[2])) # shape : (batch_size * max_length, hidden_size)
        output = self.d2(output) # shape : (batch_size * max_length, vocab_size)
        
        return output,state, attention_weights
    
    def init_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

In [75]:
decoder=Decoder(embedding_dim, units, vocab_size)

In [76]:
features=encoder(sample_img_batch)

hidden = decoder.init_state(batch_size=sample_cap_batch.shape[0])
dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * sample_cap_batch.shape[0], 1)

predictions, hidden_out, attention_weights= decoder(dec_input, features, hidden)
print('Feature shape from Encoder: {}'.format(features.shape)) #(batch, 8*8, embed_dim)
print('Predcitions shape from Decoder: {}'.format(predictions.shape)) #(batch,vocab_size)
print('Attention weights shape from Decoder: {}'.format(attention_weights.shape)) #(batch, 8*8, embed_dim)

## Model training & optimization
1.Set the optimizer & loss object

2.Create your checkpoint path

3.Create your training & testing step functions

4.Create your loss function for the test dataset

In [78]:
optimizer = tf.keras.optimizers.Adam(lr=0.001)#define the optimizer
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                            reduction=tf.keras.losses.Reduction.NONE)#define your loss object

In [79]:
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

In [80]:
checkpoint_path = "./"
ckpt = tf.train.Checkpoint(encoder=encoder,
                           decoder=decoder,
                           optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

In [81]:
start_epoch = 0
if ckpt_manager.latest_checkpoint:
    start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])

* While creating the training step for your model, you will apply Teacher forcing.
* Teacher forcing is a technique where the target/real word is passed as the next input to the decoder instead of previous prediciton.

In [82]:
@tf.function
def train_step(img_tensor, target):
    loss = 0
    #hidden = decoder.reset_state(batch_size=target.shape[0]) #we dont have reset_state method
    hidden = decoder.init_state(batch_size=target.shape[0])

    dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * target.shape[0], 1)
    
    with tf.GradientTape() as tape: #Record operations for automatic differentiation for implementing backpropagation
        #write your code here to do the training steps
        encoder_output = encoder(img_tensor)

        # Using the teacher forcing technique where the target word is passed as the next input to the decoder
        for t in range(1, target.shape[1]):
          # passing encoder_output to the decoder
          predictions, hidden, _ = decoder(dec_input, encoder_output, hidden)

          loss += loss_function(target[:, t], predictions)

          dec_input = tf.expand_dims(target[:, t], 1)
    
    avg_loss = (loss / int(target.shape[1])) #we are calculating average loss for every batch

    tot_trainables_variables = encoder.trainable_variables + decoder.trainable_variables
    
    grads = tape.gradient(loss, tot_trainables_variables) # to calculate gradients with respect to every trainable variable

    #compute gradients and apply it to the optimizer and backpropagate.
    optimizer.apply_gradients(zip(grads, tot_trainables_variables)) 
        
    return loss, avg_loss

* While creating the test step for your model, you will pass your previous prediciton as the next input to the decoder.

In [83]:
@tf.function
def test_step(img_tensor, target):
    loss = 0
    #hidden = decoder.reset_state(batch_size=target.shape[0]) #we dont have reset_state method
    hidden = decoder.init_state(batch_size=target.shape[0])

    dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * target.shape[0], 1)
    
    with tf.GradientTape() as tape: #Record operations for automatic differentiation.
        #write your code here to do the training steps
        encoder_output = encoder(img_tensor)

        # Using the teacher forcing technique where the target word is passed as the next input to the decoder.
        for t in range(1, target.shape[1]):
          # passing encoder_output to the decoder
          predictions, hidden, _ = decoder(dec_input, encoder_output, hidden)

          loss += loss_function(target[:, t], predictions) 

          # using teacher forcing
          dec_input = tf.expand_dims(target[:, t], 1)
    
    avg_loss = (loss / int(target.shape[1]))#we are calculating average loss for every batch

    tot_trainables_variables = encoder.trainable_variables + decoder.trainable_variables

    grads = tape.gradient(loss, tot_trainables_variables) # to calculate gradients with respect to every trainable variable

    #compute gradients and apply it to the optimizer and backpropagate.
    optimizer.apply_gradients(zip(grads, tot_trainables_variables))
        
    return loss, avg_loss

In [84]:
def test_loss_cal(test_dataset):
    total_loss = 0

    #write your code to get the average loss result on your test data

    for (batch, (img_tensor, target)) in enumerate(test_dataset):
        batch_loss, t_loss = test_step(img_tensor, target)
        total_loss += t_loss
        avg_test_loss=total_loss / test_num_steps
    
    return avg_test_loss

In [85]:
loss_plot = []
test_loss_plot = []
EPOCHS = 15

best_test_loss=100
for epoch in tqdm(range(0, EPOCHS)):
    start = time.time()
    total_loss = 0

    for (batch, (img_tensor, target)) in enumerate(train_dataset):
        batch_loss, t_loss = train_step(img_tensor, target)
        total_loss += t_loss
        avg_train_loss=total_loss / train_num_steps
        
    loss_plot.append(avg_train_loss)    
    test_loss = test_loss_cal(test_dataset)
    test_loss_plot.append(test_loss)
    
    print ('For epoch: {}, the train loss is {:.3f}, & test loss is {:.3f}'.format(epoch+1,avg_train_loss,test_loss))
    print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
    
    if test_loss < best_test_loss:
        print('Test loss has been reduced from %.3f to %.3f' % (best_test_loss, test_loss))
        best_test_loss = test_loss
        ckpt_manager.save()

In [86]:
plt.plot(loss_plot)
plt.plot(test_loss_plot)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Plot')
plt.show()

#### NOTE: 
* Since there is a difference between the train & test steps ( Presence of teacher forcing), you may observe that the train loss is decreasing while your test loss is not. 
* This doesn't mean that the model is overfitting, as we can't compare the train & test results here, as both approach is different.
* Also, if you want to achieve better results you can run it more epochs, but the intent of this capstone is to give you an idea on how to integrate attention mechanism with E-D architecture for images. The intent is not to create the state of art model. 

## Model Evaluation
1.Define your evaluation function using greedy search

2.Define your evaluation function using beam search ( optional)

3.Test it on a sample data using BLEU score

### Greedy Search

In [87]:
def evaluate(image):
    attention_plot = np.zeros((max_length, attention_features_shape))

    #hidden = decoder.reset_state(batch_size=1)
    hidden = decoder.init_state(batch_size=1)

    temp_input = tf.expand_dims(load_image(image)[0], 0) #process the input image to desired format before extracting features
    img_tensor_val = image_features_extract_model(temp_input) # Extract features using our feature extraction model
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))

    features = encoder(img_tensor_val)# extract the features by passing the input to encoder

    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    result = []

    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)# get the output from decoder

        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

        predicted_id =  tf.argmax(predictions[0]).numpy()#extract the predicted id(embedded value) which carries the max value
        result.append(tokenizer.index_word[predicted_id])#map the id to the word from tokenizer and append the value to the result list

        if tokenizer.index_word[predicted_id] == '<end>':
            return result, attention_plot,predictions

        dec_input = tf.expand_dims([predicted_id], 0)

    attention_plot = attention_plot[:len(result), :]
    return result, attention_plot,predictions


### Beam Search(optional)

In [88]:
from PIL import Image
def plot_attmap(caption, weights, image):

    fig = plt.figure(figsize=(30, 30))
    temp_img = np.array(Image.open(image))
    
    len_cap = len(caption)
    for cap in range(len_cap):
        weights_img = np.reshape(weights[cap], (8,8))
        weights_img = np.array(Image.fromarray(weights_img).resize((224, 224), Image.LANCZOS))
        
        ax = fig.add_subplot(len_cap//2, len_cap//2, cap+1)
        ax.set_title(caption[cap], fontsize=15)
        
        img=ax.imshow(temp_img)
        
        ax.imshow(weights_img, cmap='gist_heat', alpha=0.6,extent=img.get_extent())
        ax.axis('off')
    plt.subplots_adjust(hspace=0.2, wspace=0.2)
    plt.show()

In [89]:
from nltk.translate.bleu_score import sentence_bleu

In [90]:
def filt_text(text):
    filt=['<start>','<unk>','<end>'] 
    temp= text.split()
    [temp.remove(j) for k in filt for j in temp if k==j]
    text=' '.join(temp)
    return text

In [97]:
# Greedy Search Evaluation on a test image , caption
rid = np.random.randint(0, len(path_test))
test_image = path_test[rid]
test_image, cap_test

test_image = path_test[rid]

real_caption = cap_test[rid]


real_caption = ' '.join([tokenizer.index_word[i] for i in cap_test[rid] if i not in [0]])
result, attention_plot,pred_test = evaluate(test_image)


real_caption=filt_text(real_caption)      


pred_caption=' '.join(result).rsplit(' ', 1)[0]

real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = pred_caption.split()

score = sentence_bleu(reference, candidate, weights=(0,0,1,0))
print(f"BLEU score: {score*100}")
print('Real Caption:', real_caption)
print('Prediction Caption:', pred_caption)
plot_attmap(result, attention_plot, test_image)


Image.open(test_image)

## ****Summary****

This is the final submission file for the Capstone project - Eye for Blind.
An CNN-RNN based Attention model has been built on flickr8k dataset to predict captions for random images. The Model selects captions using Greedy Search and resulting captions are evaluated using BLUE score.

* The project began with reading images and captions.   
* Displayed images and captions, stored in new dataframe. 
* EDA is performed to comprehend about the given dataset.  
* Data cleaning is done. 

Data preprocessing is additionally performed which included :

* tokenizing the captions and getting an embeded vector,
* preprocessing techniques.

* data is splitted into train and test set

 To Extract Features from Images:
* InceptionV3 model is used.Inception v3 is a widely-used image recognition model that has been shown to attain greater than 78.1% accuracy on the ImageNet dataset. So let's also use same model to get feature vector.
* we are not classifying the images, we only need to extract a feature vector from images.Hence we are removing the softmax layer from the model. 
* This feature vector is given as input to CNN Encoder 


* the output from the encoder, hidden state and start token is passed as input to the decoder. 
* The decoder goes over the image to predict the next word.

Attention model:
* We utilized the attention model to make our decoder center around a specific piece of the picture at a time rather than focusing in on the whole picture.
* This likewise diminishes noise and further improves accuracy. The decoder returns the predicted caption and the decoder's hidden state as output.


Calculate Loss:
* The predictions are utilized to calculate the  loss utilizing cross-entropy "SparseCategoricalCrossentropy"

< end > token:
    
* The decoder stop predicting when the model get the end token in the end.

How the Model works on the prediction of captions for the image:

* the prediction of captions by the model is done by finding out the probabbilites of the word in the vocabulary / language model.


Using the Greedy search Method:
* It calculates the probability of the words according to their occurrence in the vocabulary.
* It takes the example of the words, tracks down the probabilities of each word and afterward yields the word with the most noteworthy probabilities.

At long last, we are using the "BLEU score"(Bilingual Evaluation Understudy) as the evaluation metric for the predicted word. It decides the distinction between the predicted caption and the real caption 

The Models works really well. :)