# Descripción de imágenes con un modelo recurrente (_RNN_)

Este notebook detalla un modelo que utiliza capas recurrentes para subtitulado/descripción de imágenes.

Este modelos es simalar a [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044)

Además las implementación está basada está basada en [Implementation of Attention Mechanism for Caption Generation on Transformers using TensorFlow
](https://www.tensorflow.org/tutorials/text/image_captioning)


***DataSet:*** 

Este notebook utiliza el conjunto de datos [MS-COCO](http://cocodataset.org/#home) para el entrenamiento y testeo del modelo.

## 1. Importar librerías

In [None]:
import sys

In [None]:
import tensorflow as tf

import matplotlib.pyplot as plt

import collections
import random
import numpy as np
import pandas as pd
import os
import time
import json
from PIL import Image

from tqdm import tqdm

In [None]:
import datetime
import json
import re
from pathlib import Path   

In [None]:
import warnings
warnings.filterwarnings("ignore")

## 2. Preparar entorno y el conjunto de datos _MS COCO_

Previamente, es necesario haber descargado el conjunto de datos _MS COCO_, crear un directorio "ms-coco" y organizar los archivos siguiendo la siguiente estructura;

---
```
ms-coco
  annotations
  images
    train2014
    val2014
```
---

En el siguiente código, se verifica la existencia del contenido del directorio ms-coco. Y con la variable de entorno ***CUDA_VISIBLE_DEVICES*** se especifican las GPU a utilizar.

In [None]:
# [IMPORTANTE]: Configurar CUDA_VISIBLE_DEVICES
# os.environ["CUDA_VISIBLE_DEVICES"]= "2,3"

In [None]:
root_dir = "/".join(os.getcwd().split("/")[0:-1])+"/"
print("INFO: El directorio ráiz de proyecto es:",root_dir)

In [None]:
coco_dir="ms-coco/"
annotation_folder = "annotations/"
image_folder = "images/"

if not os.path.exists(root_dir + coco_dir + annotation_folder) or not os.path.exists(root_dir + coco_dir + image_folder):
    raise Exception('ERR: Faltan archivos..' )

### Cargar _dataset_

In [None]:
with open(root_dir + coco_dir + annotation_folder + f'/captions_train2014.json') as f:
    annotations = json.load(f)

image_path_to_caption = collections.defaultdict(list)
for val in annotations['annotations']:
    caption = f"<start> {val['caption']} <end>"
    image_path = root_dir +coco_dir + 'images/train2014/' + 'COCO_train2014_' + '%012d.jpg' % (val['image_id'])
    image_path_to_caption[image_path].append(caption)

In [None]:
with open(root_dir + coco_dir + '/annotations' + f'/captions_val2014.json') as f:
    annotations.update(json.load(f))

for val in annotations['annotations']:
    caption = f"<start> {val['caption']} <end>"
    image_path = root_dir + coco_dir + 'images/val2014/' + 'COCO_val2014_' + '%012d.jpg' % (val['image_id'])
    image_path_to_caption[image_path].append(caption)

### Tamaño del _dataset_

In [None]:
image_paths = list(image_path_to_caption.keys())
random.shuffle(image_paths)
print('INFO: Tamaño de image_paths:',len(image_paths))

In [None]:
all_captions = []
img_name_vector = []

for image_path in image_paths:
    caption_list = image_path_to_caption[image_path]
    all_captions.extend(caption_list)
    img_name_vector.extend([image_path] * len(caption_list))

In [None]:
print('INFO: Subtítulo de referencia: '+' '.join(all_captions[0].split(' ')[1:-1]))
Image.open(img_name_vector[0])

## 3. Pre-procesado de las imágenes

Para la extracción de características se utiliza la red _InceptionV3_ (que está preentrenado en _ImageNet_). 

Para lo que es necesario:
- Cambiar el tamaño de la imagen a 299px por 299px.
- Normalizar las imágenes con [preprocess_input](https://www.tensorflow.org/api_docs/python/tf/keras/applications/inception_v3/preprocess_input).

In [None]:
def load_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299, 299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

### Inicializar _InceptionV3_ y cargar los pesos de _ImageNet_ previamente entrenados.

Ahora creará un modelo tf.keras donde la capa de salida es la última capa convolucional _InceptionV3_. Y la forma de la salida de esta capa es 8x8x2048.


In [None]:
image_model = tf.keras.applications._InceptionV3_(include_top=False,
                                                    weights='_ImageNet_')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output

image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

In [None]:
encode = sorted(set(img_name_vector))

image_dataset = tf.data.Dataset.from_tensor_slices(encode)
image_dataset = image_dataset.map(
  load_image, num_parallel_calls=tf.data.AUTOTUNE).batch(16)

if not os.path.exists(img_name_vector[0]+'.npy'):
    for img, path in tqdm(image_dataset):
        batch_features = image_features_extract_model(img)
        batch_features = tf.reshape(batch_features,
                                  (batch_features.shape[0], -1, 
                                   batch_features.shape[3]))

        for bf, p in zip(batch_features, path):
            path_of_feature = p.numpy().decode("utf-8")
            np.save(path_of_feature, bf.numpy())
else:
    print("INFO: Características en:", root_dir + coco_dir + 'images/[val2014|train2014]/')
    

## 4. Pre-procesado y tokenizado de los subtítulos

Procedimiento:
* Se convierten en tokens los subtítulos.
* Se limita el tamaño del vocabulario a las 5.000 palabras principales y reemplazara todas las demás palabras con el token "UNK" (desconocido).
* Se mapean palabras a índices (word-to-index) e índices a palabras (index-to-word).
* Se rellenan todas las secuencias para que tengan la misma longitud que la más larga.



In [None]:
# Función para encuentrar la longitud máxima de un subtítulo
def calc_max_length(tensor):
    return max(len(t) for t in tensor)

In [None]:
# Se eligen las 5000 palabras principales del vocabulario
top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                  oov_token="<unk>",
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~')
tokenizer.fit_on_texts(all_captions)

In [None]:
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

In [None]:
# Se crea el vector tokenizado
all_seqs = tokenizer.texts_to_sequences(all_captions)

In [None]:
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(all_seqs, padding='post')

In [None]:
max_length = calc_max_length(all_seqs)

## 5. Split del _dataset_ y crear tf.data dataset

In [None]:
split_dir=root_dir+"splits/"


In [None]:
def split_file(split):
    return split_dir + f'karpathy_{split}_images.txt'

In [None]:
def read_split_image_ids_and_paths(split):
    split_df = pd.read_csv(split_file(split), sep=' ', header=None)
    dir_aux = root_dir + coco_dir +'images/'+ split_df.iloc[:,0]
    return split_df.iloc[:,1].to_numpy(), dir_aux.to_numpy()

In [None]:
img_to_cap_vector = collections.defaultdict(list)
for img, cap in zip(img_name_vector, cap_vector):
    img_to_cap_vector[img].append(cap)
    
img_name_train_keys = read_split_image_ids_and_paths('train')[1]

img_name_train = []
cap_train = []

for imgt in img_name_train_keys:
    capt_len = len(img_to_cap_vector[imgt])
    
    img_name_train.extend([imgt] * capt_len)
    cap_train.extend(img_to_cap_vector[imgt])

In [None]:
print("INFO: Tamaño del train dataset:", len(img_name_train))

In [None]:
img_name_val_keys = read_split_image_ids_and_paths('valid')[1] 

img_name_val = []
cap_val = []

for imgv in img_name_val_keys:
    capv_len = len(img_to_cap_vector[imgv])
    
    img_name_val.extend([imgv] * capv_len)
    cap_val.extend(img_to_cap_vector[imgv])

In [None]:
print("INFO: Tamaño del val dataset:", len(img_name_val))

In [None]:
img_name_test_keys = read_split_image_ids_and_paths('test')[1]

img_name_test = []

for img_test in img_name_test_keys:
    img_name_test.extend([img_test])

In [None]:
print("INFO: Tamaño del test dataset:", len(img_name_test))

### Crear tf.data del dataset


In [None]:
BATCH_SIZE = 500
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = top_k + 1
num_steps = len(img_name_train) // BATCH_SIZE
print("INFO: Número de steps:", num_steps)

features_shape = 2048
attention_features_shape = 64

In [None]:
def map_func(img_name, cap):
    img_tensor = np.load(img_name.decode('utf-8')+'.npy')
    return img_tensor, cap

In [None]:
dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))

# Se utiliza map para cargar los archivos numpy en paralelo
dataset = dataset.map(lambda item1, item2: tf.numpy_function(
          map_func, [item1, item2], [tf.float32, tf.int32])
                     )
                     
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

## 6. Modelo

A continuación:

* Se extraen las características de la capa convolucional inferior de _InceptionV3_, resultando un vector con forma de (8, 8, 2048), y que se transforma en (64, 2048).
* Se pasa ese vector, a través, del codificador CNN (que consta de una sola capa _Fully Connected_).
* Y la red recurrente (GRU) predice la siguiente palabra.

In [None]:
class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, features, hidden):

        hidden_with_time_axis = tf.expand_dims(hidden, 1)

        attention_hidden_layer = (tf.nn.tanh(self.W1(features) +
                                             self.W2(hidden_with_time_axis)))

        
        score = self.V(attention_hidden_layer)

        attention_weights = tf.nn.softmax(score, axis=1)

        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights

In [None]:
class CNN_Encoder(tf.keras.Model):
    # Este codificador pasa las features a través de una capa Fully Connected 
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()

        self.fc = tf.keras.layers.Dense(embedding_dim)

    def call(self, x):
        x = self.fc(x)
        x = tf.nn.relu(x)
        return x

In [None]:
class RNN_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(RNN_Decoder, self).__init__()
        self.units = units

        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        self.fc1 = tf.keras.layers.Dense(self.units)
        self.fc2 = tf.keras.layers.Dense(vocab_size)

        self.attention = BahdanauAttention(self.units)

    def call(self, x, features, hidden):
        # Se define la atención como un modelo separado
        context_vector, attention_weights = self.attention(features, hidden)

        # Después del embedding x == (batch_size, 1, embedding_dim)
        x = self.embedding(x)

        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        # Se pasa el vector al GRU
        output, state = self.gru(x)

        x = self.fc1(output)

        x = tf.reshape(x, (-1, x.shape[2]))

        x = self.fc2(x)

        return x, state, attention_weights

    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

In [None]:
encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)

In [None]:
optimizer = tf.keras.optimizers.Adam()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')


def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

## 7. Training 

### Checkpoint

In [None]:
rm -rf checkpoints/train/*

In [None]:
checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(encoder=encoder,
                           decoder=decoder,
                           optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

In [None]:
start_epoch = 0
if ckpt_manager.latest_checkpoint:
    start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
    # Restaurando el último checkpoint en checkpoint_path
    ckpt.restore(ckpt_manager.latest_checkpoint)

---



In [None]:
loss_plot = []

In [None]:
@tf.function
def train_step(img_tensor, target):
    loss = 0
  
    # Inicializar el hidden state para cada batch porque
    # los subtítulos no están relacionado de una imágen a otra.
    hidden = decoder.reset_state(batch_size=target.shape[0])
  
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * target.shape[0], 1)
  
    with tf.GradientTape() as tape:
        features = encoder(img_tensor)
  
        for i in range(1, target.shape[1]):
            # Se pasan los features por el decoder
            predictions, hidden, _ = decoder(dec_input, features, hidden)
  
            loss += loss_function(target[:, i], predictions)
 
            dec_input = tf.expand_dims(target[:, i], 1)
  
    total_loss = (loss / int(target.shape[1]))
  
    trainable_variables = encoder.trainable_variables + decoder.trainable_variables
  
    gradients = tape.gradient(loss, trainable_variables)
  
    optimizer.apply_gradients(zip(gradients, trainable_variables))
  
    return loss, total_loss

In [None]:
tf.autograph.set_verbosity(0)
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)

In [None]:
EPOCHS = 20

for epoch in tqdm(range(start_epoch, EPOCHS)):
    start = time.time()
    total_loss = 0

    for (batch, (img_tensor, target)) in enumerate(dataset):
        batch_loss, t_loss = train_step(img_tensor, target)
        total_loss += t_loss
        
        if batch % 100 == 0:
            average_batch_loss = batch_loss.numpy()/int(target.shape[1])
            print(f'Epoch {epoch+1} Batch {batch} Loss {average_batch_loss:.4f}')
            
    # Almacenar la época y la loss 
    loss_plot.append(total_loss / num_steps)

    if epoch % 5 == 0:
        ckpt_manager.save()
        
    

    print(f'Epoch {epoch+1} Loss {total_loss/num_steps:.6f}')
    print(f'Time taken for 1 epoch {time.time()-start:.2f} sec\n')

In [None]:
plt.plot(loss_plot)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Plot')
plt.show()

## 8. Generar descripción 


In [None]:
def generate(image):
    attention_plot = np.zeros((max_length, attention_features_shape))

    hidden = decoder.reset_state(batch_size=1)

    temp_input = tf.expand_dims(load_image(image)[0], 0)
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0],
                                                 -1,
                                                 img_tensor_val.shape[3]))

    features = encoder(img_tensor_val)

    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    result = []

    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input,
                                                         features,
                                                         hidden)

        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

        predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy()
        result.append(tokenizer.index_word[predicted_id])

        if tokenizer.index_word[predicted_id] == '<end>':
            return result, attention_plot

        dec_input = tf.expand_dims([predicted_id], 0)

    attention_plot = attention_plot[:len(result), :]
    return result, attention_plot

In [None]:
def plot_attention(image, result, attention_plot):
    temp_image = np.array(Image.open(image))

    fig = plt.figure(figsize=(10, 10))

    len_result = len(result)
    for i in range(len_result):
        temp_att = np.resize(attention_plot[i], (8, 8))
        grid_size = max(np.ceil(len_result/2), 2)
        ax = fig.add_subplot(grid_size, grid_size, i+1)
        ax.set_title(result[i])
        img = ax.imshow(temp_image)
        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())

    plt.tight_layout()
    plt.show()

### Ejemplos de imágenes con la descripción generada

In [None]:
start_token = tokenizer.word_index['<start>']
end_token = tokenizer.word_index['<end>']
# Seleccionar una imágen aleatoria del conjunto de validación.
rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
real_caption = [tokenizer.index_word[i]
                for i in cap_val[rid] if i not in [0]]
result, attention_plot = generate(image)


# Eliminar "<unk>" 
for i in result:
    if i=="<unk>":
        result.remove(i)

for i in real_caption:
    if i=="<unk>":
        real_caption.remove(i)
        
real_caption = ' '.join(real_caption)
first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]
        
print ('Descripción de referencia:', real_caption)
print ('Descripción resultante:', ' '.join(word for word in result[:-1]))
temp_image = np.array(Image.open(image))
plt.imshow(temp_image)
plt.axis('off')


In [None]:
plot_attention(image, result, attention_plot)

### Generar descripciones de la imágenes de test y val

In [None]:

def f_create_json(img_name, split_val ):
    date=str(datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
    
    list_pred = []
    list_true= []
    
    idx = 0

    for image in tqdm(img_name):
        dict_pred = {}
        dict_true = {}

        regex_expression = r'(?P<prefix>COCO_(train|val)2014_)(?P<number>[0-9]+)'
        regex_expression = re.compile(regex_expression)
        img_id = int(regex_expression.match(Path(image).stem).group('number'))  
        caption_list, _  = generate(image)
        
        
        dict_pred['image_id' ] = img_id
        dict_pred['caption' ] = ' '.join(word for word in caption_list[:-1]).replace('<unk>','')
        
        
        list_pred.append(dict_pred)
        
        if (split_val == True):
            dict_true['image_id' ] = img_id
            dict_true['caption' ] = ' '.join([tokenizer.index_word[i] for i in cap_val[idx] if i not in [0]])
            list_true.append(dict_true)
            idx+=1

    full_file_name = 'output/rnn-tf-'+date
    
    with open(full_file_name+'-predictions.json', 'w') as f:
        json.dump(list_pred, f)
    
    print('Archivo con las predicciones:', full_file_name+'-predictions.json')
    
    if (split_val == True):
        with open(full_file_name+'-true.json', 'w') as f:
            json.dump(list_true, f)
            print('Archivo con las referencias:',full_file_name+'-true.json')
            
    return dict_pred

### Test Dataset

In [None]:
# Crear json con las descripciones del conjunto de test dataset
dict_pred = f_create_json(img_name_test, split_val = False)

---

In [None]:
aux = ['a', 'building', 'sitting', 'next', 'to', 'a', '<unk>','building', '<end>']
for i in aux:
    if i=="<unk>":
        aux.remove(i)
        print(aux)