<a href="https://colab.research.google.com/github/Ian-Lo/DayOne/blob/master/Master_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table extraction from Financial Documents

This notebook is a rewriting of the image captioning notebook from Tensorflow: https://www.tensorflow.org/tutorials/text/image_captioning

I suggest we rewrite the template to the problem of creating html code for tables from the basis of images. 

In the original paper, the structural decoder is "pre-trained" only on the structural tokens ($\lambda=0$ in loss function). I suggest we take the same approach and expand the code in the following steps:

1.   Rewrite code to structural decoder. 
2.   "pre-train" structural decoder. 
3.   Rewrite code to include a cell decoder. 
4.   Train both structure and cell decoder ($\lambda\neq0$)


Given an image like the example below, our goal is to generate the corresponding html code. 

![tableimage](https://github.com/ibm-aur-nlp/PubTabNet/raw/master/ICDAR_SLR_competition/example.png)

*[Image Source](https://evalai.cloudcv.org/web/challenges/challenge-page/673/overview)*

##########

To accomplish this, you'll use an attention-based model, which enables us to see what parts of the image the model focuses on as it generates a caption.

![Prediction](https://tensorflow.org/images/imcap_prediction.png)

The model architecture is similar to [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044).

This notebook is an end-to-end example. When you run the notebook, it downloads the [MS-COCO](http://cocodataset.org/#home) dataset, preprocesses and caches a subset of images using Inception V3, trains an encoder-decoder model, and generates captions on new images using the trained model.

In this example, you will train a model on a relatively small amount of data—the first 30,000 captions  for about 20,000 images (because there are multiple captions per image in the dataset). 

# Import

In [None]:
##### REWRITTEN CODE ######
#### TODO ####
# Check that all imports are necessary (do this at the very end)

import tensorflow as tf

# You'll generate plots of attention in order to see which parts of an image
# our model focuses on during captioning
import matplotlib.pyplot as plt

# Scikit-learn includes many helpful utilities
from sklearn.model_selection import train_test_split

import re
import numpy as np
import os
import time
import json
from glob import glob
from PIL import Image
import pickle



## Download and prepare the dataset


In [None]:

#### Rewritten code #### 

# mount google drive to be able to work on data (completed)
 # Run the cell below to mount the group's Google Drive. 
 # Sign in with the email: cs29group1@gmail.com and password: Tables4Life 
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:

#### Rewritten code #### 
# extract dataset to be able to work with it
#if os.path.isdir("/content/pubtabnet")==False:
#  os.mkdir("/content/pubtabnet")

! tar -C /content/ -xvf '/content/drive/My Drive/pubtabnet1000.tar.gz'


tar: /content/drive/My Drive/pubtabnet1000.tar.gz: Cannot read: Input/output error
tar: At beginning of tape, quitting now
tar: Error is not recoverable: exiting now


## Limit the size of the training set 
While developing the code I recommend to use the smallest possible training set. 


In [None]:
#### Rewritten code ####

!pip install jsonlines # install jsonlines

# Import json data from pubtabnet data in the right format to be used in tensorflow. 

# Input: 
#   num_examples (integer) : Number of example files to use in training set
num_examples = 16
num_vals = 16
max_struc_tokens = 300 # select only tables with less than this number
split = 'train'
import jsonlines

# if working on full dataset, uncomment the lines below
#pubtabnet_folder ="/content/pubtabnet/" # folder containing data
#json_file = "PubTabNet_2.0.0.jsonl" #name of file 

# if working on the small set, uncomment the lines below 
pubtabnet_folder ="/content/pubtabnet1000/" # folder containing data
json_file = "pubtabnet1000.jsonl" #name of file 

reader = jsonlines.open(pubtabnet_folder + json_file, 'r')

######: 
from collections import Counter
def get_pubtabnet_data(num_examples, max_struc_tokens, split):
  """ Returns pubtabnet data. """
  # Store html tokens and image names in lists
  all_structs = []
  all_cells = []
  all_img_name_vector = []
  # use Counter objects to get dictionaries later on
  struc_tokens = Counter()
  cell_tokens = Counter()
  # run thorugh file 
  print("Getting annotations. This may take up to a few minutes.")
  count = 0
  for annot in reader:
    if annot['split'] in split or split =="all":
      structure =  annot['html']['structure']['tokens']
      s = [struc for struc in structure]
      if len(s)> max_struc_tokens:
        continue
      struc_tokens.update(s) 
      cells = [ a['tokens'] for a in annot['html']['cells'] ]
      c = [item for sublist in cells for item in sublist]
      cell_tokens.update(c)
      full_image_path ="%s%s/%s"%(pubtabnet_folder,split, annot['filename'])
      #check that image file exists 
#      if os.path.isfile(full_image_path)==False:
#        print("Warning: the following file is not in the pubtabnet folder:", full_image_path)
  #      print(full_image_path)    
      # append data to lists
      all_img_name_vector.append(full_image_path)
      all_cells.append(cells)
      all_structs.append(structure)
      count+=1
      # if we have enough annotations: break
      if count>= num_examples:
        print("Got %d annotations"%count)
        break
  return all_structs, all_cells, all_img_name_vector, struc_tokens, cell_tokens

# get training data:
all_structs, all_cells, all_img_name_vector, struc_tokens, cell_tokens = get_pubtabnet_data( num_examples, max_struc_tokens, split)
# get validation data: 
val_structs, val_cells, val_img_name_vector, _, _ = get_pubtabnet_data(num_vals, max_struc_tokens, "val")


# Shuffle 

# Set a random state

from sklearn.utils import shuffle
train_structs, train_cells, img_name_vector = shuffle(all_structs,
                                          all_cells, 
                                          all_img_name_vector, 
                                          random_state=291)


# Output: 
# train_structs (list of lists) : list of list containing tokens for the structural taining set. 
#   Example: [ [t11, t12], [t21]]
# train_cells (list of lists of lists) : list of list of lists containing tokens for the structural taining set. 
#   Example: [ [ ["a","n", " ", "e" ...],[] [s112] ], ["t21"] ] ]                                  
# img_name_vector (list) : list containing full paths to training set  images 
#   Example: [ "C:/fajflj.png"]  


###### replaced variables######
# train_captions -> train_structs
# There is also a new train_cells variable for when we implement the cell decoder




Collecting jsonlines
  Downloading https://files.pythonhosted.org/packages/4f/9a/ab96291470e305504aa4b7a2e0ec132e930da89eb3ca7a82fbe03167c131/jsonlines-1.2.0-py2.py3-none-any.whl
Installing collected packages: jsonlines
Successfully installed jsonlines-1.2.0
<jsonlines.Reader at 0x7faa05c73470 wrapping '/content/pubtabnet1000/pubtabnet1000.jsonl'>
Getting annotations. This may take up to a few minutes.
Got 128 annotations


## Preprocess the images using InceptionV3 or ResNet
The preprocessing of the images consists of extracting the features from the last hidden layer of the encoder. 

We first preprocess the images to the expected format of the CNN. 

InceptionV3's expected format (copied from tensorflow tutorial):
* Resizing the image to 299px by 299px
* [Preprocess the images](https://cloud.google.com/tpu/docs/inception-v3-advanced#preprocessing_stage) using the [preprocess_input](https://www.tensorflow.org/api_docs/python/tf/keras/applications/inception_v3/preprocess_input) method to normalize the image so that it contains pixels in the range of -1 to 1, which matches the format of the images used to train InceptionV3.

ResNets expected format: 
* req. 1
* req. 2 
* ...

In [None]:
#### Rewritten code #### 
# Nothing has changed in the rewriting, but we if we want to replicate Zhong et al. (ResNet), then we must change this function. 
# input: None
# output: function similar to load_image (like in the previous cell) but for ResNet.
 
def load_image(image_path):
  img = tf.io.read_file(image_path)
  img = tf.image.decode_jpeg(img, channels=3)
  img = tf.image.resize(img, (299, 299))
  img = tf.keras.applications.inception_v3.preprocess_input(img)
  return img, image_path



## Initialize encoder and load the pretrained weights
(section below taken from tensorflow tutorial)

We create tf.keras model where the output layer is the last convolutional layer in the CNN architecture. The shape of the output of this layer is ```8x8x2048```. You use the last convolutional layer because you are using attention in this example. You don't perform this initialization during training because it could become a bottleneck.

* You forward each image through the network and store the resulting vector in a dictionary (image_name --> feature_vector).
* After all the images are passed through the network, you pickle the dictionary and save it to disk.


In [None]:
#### ORIGINAL CODE #####
image_model = tf.keras.applications.InceptionV3(include_top=False,
                                                weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output

image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

In [None]:
#### Rewritten code #####
# Nothing has changed in the rewriting, but we if we want to replicate Zhong et al. (ResNet), then we must change this function. 
image_model = tf.keras.applications.InceptionV3(include_top=False,
                                                weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output

image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5


## Caching the features extracted from encoder
(section below taken from tensorflow tutorial)

You will pre-process each image with InceptionV3 and cache the output to disk. Caching the output in RAM would be faster but also memory intensive, requiring 8 \* 8 \* 2048 floats per image. At the time of writing, this exceeds the memory limitations of Colab (currently 12GB of memory).

Performance could be improved with a more sophisticated caching strategy (for example, by sharding the images to reduce random access disk I/O), but that would require more code.

The caching will take about 10 minutes to run in Colab with a GPU. If you'd like to see a progress bar, you can: 

1. install [tqdm](https://github.com/tqdm/tqdm):

    `!pip install tqdm`

2. Import tqdm:

    `from tqdm import tqdm`

3. Change the following line:

    `for img, path in image_dataset:`

    to:

    `for img, path in tqdm(image_dataset):`


In [None]:
#### Rewritten code #####
# input: 
#   img_name_vector
# output:
#   saved features in manner similar to previous cell
from tqdm import tqdm

batch_size = 16
feat_save_dir = "/content/InceptionV3/"

#if folder doesn't exist, make it
if os.path.isdir(feat_save_dir)==False:
  os.mkdir(feat_save_dir)

##### cache training set #####
# Get unique images
encode_train = sorted(set(img_name_vector))


# Feel free to change batch_size according to your system configuration
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(
  load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(batch_size)

for img, path in image_dataset:
  batch_features = image_features_extract_model(img)
  batch_features = tf.reshape(batch_features,
                              (batch_features.shape[0], -1, batch_features.shape[3]))

  for bf, p in zip(batch_features, path):
    path_of_feature = feat_save_dir+"/"+ p.numpy().decode("utf-8").split("/")[-1]
    np.save(path_of_feature, bf.numpy()) # saving features in seperates "features folder"

##### cache validation set #####
# Get unique images
encode_val = sorted(set(val_img_name_vector))


# Feel free to change batch_size according to your system configuration
image_dataset_val = tf.data.Dataset.from_tensor_slices(encode_val)
image_dataset_val = image_dataset_val.map(
  load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(batch_size)

for img, path in image_dataset_val:
  batch_features = image_features_extract_model(img)
  batch_features = tf.reshape(batch_features,
                              (batch_features.shape[0], -1, batch_features.shape[3]))

  for bf, p in zip(batch_features, path):
    path_of_feature = feat_save_dir+"/"+ p.numpy().decode("utf-8").split("/")[-1]
    np.save(path_of_feature, bf.numpy()) # saving features in seperates "features folder"


## Preprocess and tokenize
(section below taken from tensorflow tutorial)
* First, you'll tokenize the captions (for example, by splitting on spaces). This gives us a  vocabulary of all of the unique words in the data (for example, "surfing", "football", and so on).
* Next, you'll limit the vocabulary size to the top 5,000 words (to save memory). You'll replace all other words with the token "UNK" (unknown).
* You then create word-to-index and index-to-word mappings.
* Finally, you pad all sequences to be the same length as the longest one.

In [None]:
#### Rewritten code #####
# Nothing has been rewritten, but we should consider to implement a version for the cell tokens. 
# Find the maximum length of any caption in our dataset
def calc_max_length(tensor):
    return max(len(t) for t in tensor)

In [None]:
#### Rewritten code #### 
# there are 28 unique structural tokens and 277 unique cell tokens in the training set.
# You can see the distribution of tokens in /Explore_pubtabnet/ on Drive

# Construct a dictionary that convert tokens to integers and a dictionary that converts integers to tokens for both structural and cell tokens. 
token2integer_cell ={"<start>":1, "<end>":2, "<oov>":3} # do we need a start and end token?
integer2token_cell = {1:"<start>",2:"<end>", 3:"<oov>"}

integer2token_struc = {1:"<start>",2:"<end>", 3:"<oov>"}
token2integer_struc ={"<start>":1, "<end>":2,"<oov>":3 }

count = 4
for token in struc_tokens:
  if token not in token2integer_struc:
    token2integer_struc[token]=count  
    integer2token_struc[count]=token
    count+=1


count = 4
for token in cell_tokens:
  if token not in token2integer_cell:
    token2integer_cell[token]=count  
    integer2token_cell[count]=token
    count+=1

### translate tokens to list of indices
train_seqs_struc = [list(map(lambda x: token2integer_struc.get(x, token2integer_struc["<oov>"]) , i)) for i in train_structs]
val_seqs_struc = [list(map(lambda x: token2integer_struc.get(x, token2integer_struc["<oov>"]) , i)) for i in val_structs]
for l in val_seqs_struc:
  if 3 in l:
    print(l)

train_seqs_cell = []
val_seqs_cell = []
for annot in train_cells:
  c = [list(map(lambda x: token2integer_cell.get(x,token2integer_cell["<oov>"] ) , i)) for i in annot]
  train_seqs_cell.append(c)

for annot in val_cells:
  c = [list(map(lambda x: token2integer_cell.get(x,token2integer_cell["<oov>"] ) , i)) for i in annot]
  val_seqs_cell.append(c)

token2integer_cell['<pad>'] = 0
token2integer_struc['<pad>'] = 0
integer2token_cell[0] = "<pad>"
integer2token_struc[0] = "<pad>"


struc_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs_struc, padding='post')
val_struc_vector = tf.keras.preprocessing.sequence.pad_sequences(val_seqs_struc, padding='post')
# not quite sure how to do the same for train_seqs_cell

# Calculates the max_length, which is used to store the attention weights
max_length = max([calc_max_length(train_seqs_struc),calc_max_length(val_seqs_struc)]) 

## Split the data into training and testing

In [None]:
##### Rewritten code ######
# We already have the training data so I just redeclare the variables
img_name_train = img_name_vector
struc_train = struc_vector #struc_cap -> struc_train 
img_name_val = val_img_name_vector # for validation set




In [None]:
# len(img_name_train), len(cap_train), len(img_name_val), len(cap_val)
len(img_name_train), len(struc_train)

## Create a tf.data dataset for training


 Our images and captions are ready! Next, let's create a tf.data dataset to use for training our model.

In [None]:
###### REWRITTEN CODE ########
# Feel free to change these parameters according to your system's configuration

BATCH_SIZE = 4 # changed from 64
BUFFER_SIZE = 1000
embedding_dim_struc = 16 # Embedding dimensions of the structural tokens
feature_map_dim = 256 # output dimensions of the feature mapping
units = 256 #  Dimensions of the hidden state of the decoder. This is also the dimensions used for the hidden state of the attention mechanism. They could be different. 
vocab_size = len(token2integer_struc)
num_steps = len(img_name_train) // BATCH_SIZE
# Shape of the vector extracted from InceptionV3 is (64, 2048)
# These two variables represent that vector shape
features_shape = 2048
attention_features_shape = 64

In [None]:
###### REWRITTEN CODE ########
# Load the numpy files
def map_func(img_name, cap):
  img_tensor = np.load(img_name.decode('utf-8')+'.npy')
  return img_tensor, cap

In [None]:
###### REWRITTEN CODE ########
# Replace path here with path to featuremap
path_fmap = "/content/InceptionV3/"
img_name_fmap = [path_fmap+x.split("/")[4] for x in img_name_train]
img_name_fmap_val = [path_fmap+x.split("/")[4] for x in img_name_val]

In [None]:
img_name_fmap[:]

In [None]:
###### REWRITTEN CODE ########
dataset = tf.data.Dataset.from_tensor_slices((img_name_fmap, struc_train ))

# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item1, item2: tf.numpy_function(
          map_func, [item1, item2], [tf.float32, tf.int32]),
          num_parallel_calls=tf.data.experimental.AUTOTUNE)

# Shuffle and batch
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

# repeat for validation set 
dataset_val = tf.data.Dataset.from_tensor_slices((img_name_fmap_val, val_struc_vector ))

# Use map to load the numpy files in parallel
dataset_val = dataset_val.map(lambda item1, item2: tf.numpy_function(
          map_func, [item1, item2], [tf.float32, tf.int32]),
          num_parallel_calls=tf.data.experimental.AUTOTUNE)

# Shuffle and batch
dataset_val = dataset_val.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset_val = dataset_val.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)



## Model

Fun fact: the decoder below is identical to the one in the example for [Neural Machine Translation with Attention](../sequences/nmt_with_attention.ipynb).

The model architecture is inspired by the [Show, Attend and Tell](https://arxiv.org/pdf/1502.03044.pdf) paper.

* In this example, you extract the features from the lower convolutional layer of InceptionV3 giving us a vector of shape (8, 8, 2048).
* You squash that to a shape of (64, 2048).
* This vector is then passed through the CNN Encoder (which consists of a single Fully connected layer).
* The RNN (here GRU) attends over the image to predict the next word.

In [None]:
###### REWRITTEN CODE #############
class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, features, hidden):
    # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)

    # hidden shape == (batch_size, hidden_size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden_size)
    hidden_with_time_axis = tf.expand_dims(hidden, 1)

    # score shape == (batch_size, 64, hidden_size)
    score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))

    # attention_weights shape == (batch_size, 64, 1)
    # you get 1 at the last axis because you are applying score to self.V
    attention_weights = tf.nn.softmax(self.V(score), axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * features
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

In [None]:
####### REWRITTEN CODE ########
class CNN_Encoder(tf.keras.Model):
    # Since you have already extracted the features and dumped it using pickle
    # This encoder passes those features through a Fully connected layer
    def __init__(self, feature_map_dim):
        super(CNN_Encoder, self).__init__()
        # shape after fc == (batch_size, 64, feature_map_dim)
        self.fc = tf.keras.layers.Dense(feature_map_dim)

    def call(self, x):
        x = self.fc(x)
        x = tf.nn.relu(x)
        return x

In [None]:
####### REWRITTEN CODE #########
class RNN_Decoder(tf.keras.Model):
  def __init__(self, embedding_dim_struc, units, vocab_size):
    super(RNN_Decoder, self).__init__()
    self.units = units

    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim_struc)
    self.gru = tf.keras.layers.GRU(self.units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc1 = tf.keras.layers.Dense(self.units)
    self.fc2 = tf.keras.layers.Dense(vocab_size)

    self.attention = BahdanauAttention(self.units)

  def call(self, x, features, hidden):
    # defining attention as a separate model
    context_vector, attention_weights = self.attention(features, hidden)
#    print("x.shape")
#    print(x.shape)
#    print("features.shape")
#    print(features.shape)
#    print("hidden.shape")
#    print(hidden.shape)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)
#    print("x.shape after")
#    print(x.shape)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # shape == (batch_size, max_length, hidden_size)
    x = self.fc1(output)

    # x shape == (batch_size * max_length, hidden_size)
    x = tf.reshape(x, (-1, x.shape[2]))

    # output shape == (batch_size * max_length, vocab)
    x = self.fc2(x)

    return x, state, attention_weights

  def reset_state(self, batch_size):
    return tf.zeros((batch_size, self.units))

In [None]:
###### REWRITTEN CODE ########
encoder = CNN_Encoder(feature_map_dim)
decoder = RNN_Decoder(embedding_dim_struc, units, vocab_size)

In [None]:
###### REWRITTEN CODE ########
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

Set up summary writers to write the summaries to disk in a different logs directory:




In [None]:
#### REWRITTEN CODE #####
import datetime
current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
train_log_dir = 'logs/gradient_tape/' + current_time + '/train'
val_log_dir = 'logs/gradient_tape/' + current_time + '/val'
train_summary_writer = tf.summary.create_file_writer(train_log_dir)
val_summary_writer = tf.summary.create_file_writer(val_log_dir)

## Checkpoint

In [None]:
###### REWRITTEN CODE ########

checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(encoder=encoder,
                           decoder=decoder,
                           optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

In [None]:
###### REWRITTEN CODE ########
start_epoch = 0
if ckpt_manager.latest_checkpoint:
  start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
  # restoring the latest checkpoint in checkpoint_path
  ckpt.restore(ckpt_manager.latest_checkpoint)

## Training

* You extract the features stored in the respective `.npy` files and then pass those features through the encoder.
* The encoder output, hidden state(initialized to 0) and the decoder input (which is the start token) is passed to the decoder.
* The decoder returns the predictions and the decoder hidden state.
* The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
* Use teacher forcing to decide the next input to the decoder.
* Teacher forcing is the technique where the target word is passed as the next input to the decoder.
* The final step is to calculate the gradients and apply it to the optimizer and backpropagate.


In [None]:
###### REWRITTEN CODE ######
@tf.function
def train_step(img_tensor, target):
  loss = 0

  # initializing the hidden state for each batch
  # because the captions are not related from image to image
  hidden = decoder.reset_state(batch_size=target.shape[0])

  # Use dictionary instead of tokeniser here
  dec_input = tf.expand_dims([token2integer_struc['<start>']] * target.shape[0], 1)
  with tf.GradientTape() as tape:
      features = encoder(img_tensor)
      for i in range(1, target.shape[1]):
          # passing the features through the decoder
          predictions, hidden, _ = decoder(dec_input, features, hidden)

          loss += loss_function(target[:, i], predictions)

          # using teacher forcing
          dec_input = tf.expand_dims(target[:, i], 1)

  total_loss = (loss / int(target.shape[1]))
  trainable_variables = encoder.trainable_variables + decoder.trainable_variables
  gradients = tape.gradient(loss, trainable_variables)
  optimizer.apply_gradients(zip(gradients, trainable_variables))

  return loss, total_loss

In [None]:
###### REWRITTEN CODE #######
EPOCHS = 5
print("Starting training")
for epoch in range(start_epoch, EPOCHS):
    t1 = time.time()

    # on training set: 
    total_loss = 0

    for (batch, (img_tensor, target)) in enumerate(dataset):
        batch_loss, t_loss = train_step(img_tensor, target)
        total_loss += t_loss
        if batch % 100 == 0:
            print ('Epoch {} Batch {} Loss {:.4f}'.format(
              epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))
    t2 = time.time()
    print("Time for training step: ", t2-t1, "sec.")            

  ##### TODO: CALCULATE VALIDATION ERROR #####
    # on validation set:  
#    val_loss = 0

#    for (batch, (img_tensor, target)) in enumerate(dataset_val):
#        batch_loss, t_loss = train_step(img_tensor, target)
#        total_loss += t_loss



  #### tensor board stuff
    with train_summary_writer.as_default():
      tf.summary.scalar('loss', total_loss/ num_steps, step=epoch)
        #### add TEDS score below
#        tf.summary.scalar('accuracy', train_accuracy.result(), step=epoch) 

    if epoch % 5 == 0:
      ckpt_manager.save()

    print ('Epoch {} Loss {:.6f}'.format(epoch + 1,
                                         total_loss/num_steps))
    t3 = time.time()
    print('Time for epoch {} sec\n'.format(t3 - t1))
    print( 'Expected time left: %.2d sec' %( t3-t1)*(EPOCHS-epoch-1 ) )

Begin 1601254311.2368708
target
Tensor("target:0", shape=(8, 288), dtype=int32)


KeyboardInterrupt: ignored

In [None]:
%load_ext tensorboard
%tensorboard --logdir logs/gradient_tape

## Caption!

* The evaluate function is similar to the training loop, except you don't use teacher forcing here. The input to the decoder at each time step is its previous predictions along with the hidden state and the encoder output.
* Stop predicting when the model predicts the end token.
* And store the attention weights for every time step.

In [None]:
###### ORIGINAL CODE ########
def evaluate(image):
    attention_plot = np.zeros((max_length, attention_features_shape))

    hidden = decoder.reset_state(batch_size=1)

    temp_input = tf.expand_dims(load_image(image)[0], 0)
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))

    features = encoder(img_tensor_val)

    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    result = []

    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

        predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy() # pick a word from the distribution. This is not necessarily argmax.
        result.append(tokenizer.index_word[predicted_id])

        if tokenizer.index_word[predicted_id] == '<end>':
            return result, attention_plot

        dec_input = tf.expand_dims([predicted_id], 0)

    attention_plot = attention_plot[:len(result), :]
    return result, attention_plot

In [None]:
###### REWRITTEN CODE ########
def evaluate(image):
    attention_plot = np.zeros((max_length, attention_features_shape))

    hidden = decoder.reset_state(batch_size=1)

    temp_input = tf.expand_dims(load_image(image)[0], 0)
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))

    features = encoder(img_tensor_val)

    dec_input = tf.expand_dims([ token2integer_struc['<start>']], 0)
    result = []

    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

        predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy() # pick a word from the distribution. This is not necessarily argmax. 
        result.append( integer2token_struc[predicted_id])

        if integer2token_struc[predicted_id] == '<end>':
            return result, attention_plot

        dec_input = tf.expand_dims([predicted_id], 0)

    attention_plot = attention_plot[:len(result), :]
    return result, attention_plot

In [None]:
###### ORIGINAL ########
def plot_attention(image, result, attention_plot):
    temp_image = np.array(Image.open(image))

    fig = plt.figure(figsize=(10, 10))

    len_result = len(result)
    for l in range(len_result):
        temp_att = np.resize(attention_plot[l], (8, 8))
        ax = fig.add_subplot(len_result//2, len_result//2, l+1)
        ax.set_title(result[l])
        img = ax.imshow(temp_image)
        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())

    plt.tight_layout()
    plt.show()

In [None]:
###### REWRITTEN CODE ######
def plot_attention(image, result, attention_plot):
    temp_image = np.array(Image.open(image))

    fig = plt.figure(figsize=(10, 10))

    len_result = len(result)
#    for l in range(len_result):
#        temp_att = np.resize(attention_plot[l], (8, 8))
#        ax = fig.add_subplot(len_result//2, len_result//2, l+1)
        #ax.set_title(result[l])
    img = plt.imshow(temp_image)
#        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())

    plt.tight_layout()
    plt.show()

In [None]:
###### ORIGINAL CODE ########
# captions on the validation set
rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

print ('Real Caption:', real_caption)
print ('Prediction Caption:', ' '.join(result))
plot_attention(image, result, attention_plot)


In [None]:
###### REWRITTEN CODE ########

# captions on the validation set
rid = 5 #np.random.randint(0, len(img_name_val))
image = img_name_train[rid]
real_caption = ' '.join([integer2token_struc[i] for i in struc_vector[rid] if i not in [0]])
print(real_caption)
result, attention_plot = evaluate(image)

print ('Real Caption:', real_caption)
print ('Prediction Caption:', ' '.join(result))
plot_attention(image, result, attention_plot)




In [None]:
##### REWRITTEN CODE ######
# Install modules required by IBM TEDS metric calculation
! pip install distance apted lxml 

# Add metric folder to sys path
import sys
sys.path.append("/content/drive/My Drive/metric/")
#Import module implementing TEDS by IBM
from metric import TEDS
#teds = TEDS()
#true_html = real_caption
#pred_html = ' '.join(result)


In [None]:
###### REWRITTEN CODE #######
def build_html_structure(structure_information):
    ''' Build the structure skeleton of the HTML code
        Add the structural <thead> and the structural <tbody> sections to a fixed <html> header
    '''
    
    html_structure = '''<html>
                       <head>
                       <meta charset="UTF-8">
                       <style>
                       table, th, td {
                         border: 1px solid black;
                         font-size: 10px;
                       }
                       </style>
                       </head>
                       <body>
                       <table frame="hsides" rules="groups" width="100%%">
                         %s
                       </table>
                       </body>
                       </html>''' % ''.join(structure_information)
    
    return html_structure

In [None]:

import string
teds = TEDS()
#pred_html = ' '.join(result)
result = "".join(result)
pred_html = build_html_structure(result)
print("Result")
print(result)
true_html = build_html_structure(real_caption.split())
print("True")
print(real_caption)
#print(true_html)
#print(pred_html)
teds.evaluate( true_html, pred_html )


## Try it on your own images
For fun, below we've provided a method you can use to caption your own images with the model we've just trained. Keep in mind, it was trained on a relatively small amount of data, and your images may be different from the training data (so be prepared for weird results!)


In [None]:
image_url = 'https://tensorflow.org/images/surf.jpg'
image_extension = image_url[-4:]
image_path = tf.keras.utils.get_file('image'+image_extension,
                                     origin=image_url)

result, attention_plot = evaluate(image_path)
print ('Prediction Caption:', ' '.join(result))
plot_attention(image_path, result, attention_plot)
# opening the image
Image.open(image_path)

# Next steps

Congrats! You've just trained an image captioning model with attention. Next, take a look at this example [Neural Machine Translation with Attention](../sequences/nmt_with_attention.ipynb). It uses a similar architecture to translate between Spanish and English sentences. You can also experiment with training the code in this notebook on a different dataset.