# 🧠 Image Captioning with Bidirectional LSTM

Welcome! This notebook walks you through building an image captioning system. The idea is simple but powerful: given an image, the model generates a descriptive caption. We combine a CNN (to extract features from the image) with a Bidirectional LSTM (to generate the text). Let’s dive in!


In [None]:
!pip install transformers diffusers gradio torch datasets accelerate torchvision

## 🖼️ Extracting Features from Images

To understand the content of each image, we use a pretrained CNN (like ViT or ResNet) to extract meaningful features. These features form the basis for generating captions.


In [None]:
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer
import torch
from PIL import Image

In [None]:
from diffusers import StableDiffusionPipeline
import torch

In [None]:

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

def generate_image_from_caption(caption):
    with torch.inference_mode():
        image = pipe(caption, num_inference_steps=20, guidance_scale=7.5).images[0]
    return image




## 🧹 Cleaning and Tokenizing Captions

Before training, we need to clean the text (remove punctuation, lowercase everything, etc.) and tokenize it so that the model can work with numbers instead of raw text.


In [None]:
from transformers import pipeline

In [None]:

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-hi")  # English → Hindi

def translate_caption(caption):
    return translator(caption, max_length=40)[0]['translation_text']


In [None]:
import gradio as gr
import torch
from PIL import Image
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer, pipeline


## 🧠 Building the Captioning Model

We use a Bidirectional LSTM to model the language. This allows the model to learn from both past and future context, leading to more natural and coherent captions.


In [None]:

# Load the pre-trained image captioning model and components
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
image_processor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

# Initialize translation pipeline for English -> Hindi
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-hi")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)


max_length = 16
num_beams = 1
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}

def predict_step(image):
    try:
        if image.mode != "RGB":
            image = image.convert(mode="RGB")

        # Pass the image directly as the first argument
        pixel_values = image_processor(image, return_tensors="pt").pixel_values
        pixel_values = pixel_values.to(device)

        print("Model before generate:", model is not None) # Print model status before generate
        output_ids = model.generate(pixel_values, **gen_kwargs)

        preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
        preds = [pred.strip() for pred in preds]
        caption = preds[0]

        hindi_caption = translator(caption, max_length=40)[0]['translation_text']

        return caption, hindi_caption
    except Exception as e:
        return f"An error occurred: {e}"

## 🧪 Generating Captions

Let’s test the model on some new images and see how well it generates captions. Also testing out the model on how well it generates images out of the captions we provide.


In [None]:
# Create a Gradio interface
iface = gr.Interface(
    fn=predict_step,
    inputs=gr.Image(type="pil"),
    outputs=["text", "text"],
    title="Image Captioning + Generation",
    description="Upload an image to get caption and Hindi translation. Or enter a caption to generate an image."
)

iface2 = gr.Interface(
    fn=generate_image_from_caption,
    inputs=gr.Textbox(label="Enter Caption"),
    outputs=gr.Image(type="pil"),
    title="Text-to-Image Generator",
    description="Enter a caption to generate an image using Stable Diffusion."
)

gr.TabbedInterface([iface, iface2], ["Image → Caption", "Caption → Image"]).launch(debug = True)

## ✅ Wrapping Up

This was a simple but solid implementation of image captioning using a CNN + Bidirectional LSTM. There’s a lot of potential to improve this with attention mechanisms or transformers — but this gives us a strong starting point.

Thanks for checking it out! 🚀
