# *Thanks Listed for giving me this opportunity.*


---



---


# **Generating Captions for Images**

Image caption generator is a task that involves computer vision and natural language processing concepts to recognize the context of an image and describe them in a natural language like English.

This project uses the Transformer Model for Image captioning along with a combination of Vision Transformer(ViT) and GPT-2 language models. The Interface is built on Gradio which provides a user-friendly way to interact with the model.

## **Prerequisites**


*   Python 3.7 or higher
*   Transformers library
*   PyTorch (Pre-installed with Google Colab)
*   Pillow
*   Gradio

**Install the required Python packages:**

In [7]:
pip install transformers torch pillow gradio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Importing the Necessary Libraries**

In [8]:
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer
import torch
from PIL import Image
import gradio as gr

**Loading the Pre-Trained Model and Tokenizer**

In [9]:
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

device= torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)



VisionEncoderDecoderModel(
  (encoder): ViTModel(
    (embeddings): ViTEmbeddings(
      (patch_embeddings): ViTPatchEmbeddings(
        (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ViTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ViTLayer(
          (attention): ViTAttention(
            (attention): ViTSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): ViTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ViTIntermediate(
            (dense): Linear(in_featur

**Defining the Function**

In [4]:
max_length = 16
gen_kwargs = {"max_length" : max_length}

def generate_captions(image, num_captions=1):

  try:
    if image.mode !="RGB":
      image = image.convert(mode="RGB")


    pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values
    pixel_values = pixel_values.to(device)

    captions = []
    for _ in range(num_captions):
      output_ids = model.generate(pixel_values, **gen_kwargs)
      caption = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
      captions.append(caption)

    return captions
  
  except Exception as e:
    print(f"Error:{str(e)}")
    return []

**Creating a Gradio Interface With Slider Option for Multiple Captions**

In [10]:
inputs = [
    gr.inputs.Image(type="pil", label = "Upload Image"),
    gr.inputs.Slider(minimum=1, maximum=5, default=1, label="Number of Captions")
]
outputs = gr.outputs.Textbox(label="Generated Captions")

iface = gr.Interface(fn=generate_captions, inputs=inputs, outputs=outputs, title="Image Caption")

iface.launch(share=True)

  super().__init__(
  super().__init__(


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://950b58833a3198ad73.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


