# Image captioning with visual attention

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/text/image_captioning">
    <img src="https://www.tensorflow.org/images/tf_logo_32px.png" />
    View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/image_captioning.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/image_captioning.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/text/image_captioning.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

Given an image like the example below, our goal is to generate a caption such as "a surfer riding on a wave".

![Man Surfing](https://tensorflow.org/images/surf.jpg)

*[Image Source](https://commons.wikimedia.org/wiki/Surfing#/media/File:Surfing_in_Hawaii.jpg); License: Public Domain*

To accomplish this, you'll use an attention-based model, which enables us to see what parts of the image the model focuses on as it generates a caption.

![Prediction](https://tensorflow.org/images/imcap_prediction.png)

The model architecture is similar to [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044).

This notebook is an end-to-end example. When you run the notebook, it downloads the [MS-COCO](http://cocodataset.org/#home) dataset, preprocesses and caches a subset of images using Inception V3, trains an encoder-decoder model, and generates captions on new images using the trained model.

In this example, you will train a model on a relatively small amount of data—the first 30,000 captions  for about 20,000 images (because there are multiple captions per image in the dataset).

In [None]:
!git clone https://github.com/OmarAtyqy/attention-object-based-captioning.git

Cloning into 'attention-object-based-captioning'...
remote: Enumerating objects: 530, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (31/31), done.[K
remote: Total 530 (delta 8), reused 26 (delta 5), pack-reused 494[K
Receiving objects: 100% (530/530), 251.73 MiB | 26.65 MiB/s, done.
Resolving deltas: 100% (260/260), done.
Updating files: 100% (31/31), done.


In [None]:
%cd attention-object-based-captioning/
%pwd

[Errno 2] No such file or directory: 'attention-object-based-captioning/'
/content/attention-object-based-captioning


'/content/attention-object-based-captioning'

In [None]:
import os

# Replace YOUR_TOKEN with the actual token you generated
os.environ["HF_HOME"] = "/root/.cache/huggingface/"
os.environ["TRANSFORMERS_CACHE"] = "/root/.cache/huggingface/transformers"
os.environ["HF_DATASETS_CACHE"] = "/root/.cache/huggingface/datasets"
os.environ["HF_METRICS_CACHE"] = "/root/.cache/huggingface/metrics"
os.environ["HF_MODULES_CACHE"] = "/root/.cache/huggingface/modules"
os.environ['HF_TOKEN'] = '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'


In [None]:
%cd /content/image_captioning
%pwd

/content/image_captioning


'/content/image_captioning'

In [None]:
!pip install --upgrade pip
!pip install -r requirements.txt

Collecting ultralytics (from -r requirements.txt (line 4))
  Downloading ultralytics-8.2.2-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.4/40.4 kB[0m [31m804.4 kB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->-r requirements.txt (line 3))
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->-r requirements.txt (line 3))
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->-r requirements.txt (line 3))
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->-r requirements.txt (line 3))
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==

In [None]:
# path to the folder containing the images
images_folder_path = 'data/images'

# path to the file containing the captions (can either be a csv or a txt file)
# It should be structured as follows: image,caption (include the header)
captions_path = 'data/captions.csv'

# preprocess function to use.
preprocess_function = tf.keras.applications.xception.preprocess_input

# validation split (percentage of the data used for validation)
val_split = 0.1

# batch size
# Make sure that your batch size is < the number of samples in both your training and validation datasets for the generators to work properly
batch_size = 20

# epochs
epochs = 30

# image dimensions
# The Xception model works best with 299x299 images, but you can try other sizes as well if you're having memory issues.
# The dimensios should not be below 71
image_dimensions = (192, 192)

# embedding dimension (dimension of the Dense layer in the encoder and the Embedding layer in the decoder)
embedding_dim = 128

# number of units in the LSTM, Bahdanau attention and Dense layers
units = 256

/content/image_captioning
