<a href="https://colab.research.google.com/github/JadeEmm/ai-image-captioning-app/blob/main/Image_Captioning_App_Public_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Image Captioning App
Welcome! This notebook lets you run a professional AI image captioning app directly in your browser.

Create an image captioning app using free Hugging Face models via the `transformers` library in a clean Gradio UI interface where the user uploads any image, presses a button to generate a caption, and an AI model is used to generate a caption which is output as text to the user in a box describing in simple terms what the image is showing.

# How to use this notebook:

1. **Click "Copy to Drive"** (top-left) to get your own copy
2. **Run all cells** below in order (Ctrl+Enter or click the play buttons)
3. **Wait for the app to load** (takes 1-2 minutes first time)
4. **Click the public link** that appears to access your app
5. **Upload images and get AI captions!**

## What you'll get:
- A working AI app with public link to share
- State-of-the-art image captioning (Salesforce BLIP model)
- No setup required - everything runs in the cloud
- Free to use and share with friends!

## Real world use cases solutions like this are great for:
- Content creators needing captions
- Learning about AI and computer vision
- Accessibility (generating alt text)

# Build

## Install required libraries

### Subtask:
Install the necessary libraries, including `transformers`, `torch`, `gradio`, and `Pillow`.


**Reasoning**:
Install the necessary libraries using pip.



In [None]:
!pip install transformers torch gradio Pillow




## Load the model and processor

### Subtask:
Load a pre-trained image captioning model and its corresponding processor using the `transformers` library.


**Reasoning**:
Import the necessary classes from the `transformers` library and load a pre-trained image captioning model and its corresponding processor.



In [None]:
from transformers import VisionEncoderDecoderModel, AutoFeatureExtractor, AutoTokenizer

# Specify the model name
model_name = "nlpconnect/vit-gpt2-image-captioning"

# Load the pre-trained model
model = VisionEncoderDecoderModel.from_pretrained(model_name)

# Load the corresponding processor
# AutoFeatureExtractor for the image part
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
# AutoTokenizer for the text part
tokenizer = AutoTokenizer.from_pretrained(model_name)

print("✅ Model and processor loaded successfully!")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/982M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/982M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/228 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/241 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

✅ Model and processor loaded successfully!


## Define the caption generation function

### Subtask:
Create a Python function that takes an image as input, preprocesses it using the loaded processor, generates a caption using the loaded model, and returns the caption text.


**Reasoning**:
Define the Python function to generate captions, incorporating image preprocessing, model inference, and caption decoding as per the instructions.



In [None]:
import torch
from PIL import Image

# Set the device to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device) # Move model to the selected device

def generate_caption_from_image(image: Image.Image) -> str:
    """
    Generates a caption for a given image using the loaded model and processor.

    Args:
        image: A PIL Image object.

    Returns:
        A string containing the generated caption.
    """
    if image.mode != "RGB":
        image = image.convert(mode="RGB")

    # Preprocess the image
    # The feature extractor handles resizing and normalization
    pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
    pixel_values = pixel_values.to(device) # Move pixel values to the selected device

    # Generate the caption
    # Set max_length and num_beams for better caption generation
    output_ids = model.generate(pixel_values, max_length=16, num_beams=4)

    # Decode the generated token IDs into text
    caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    return caption

print("✅ Caption generation function defined.")

✅ Caption generation function defined.


## Create and launch the gradio interface

### Subtask:
Design a Gradio interface that allows the user to upload an image, triggers the caption generation function, and displays the generated caption in a text box.


**Reasoning**:
Design a Gradio interface to allow image upload, trigger the caption generation function, and display the result.



In [None]:
import gradio as gr

# Create the Gradio interface
interface = gr.Interface(
    fn=generate_caption_from_image,  # Our caption generation function
    inputs=[
        gr.Image(
            label="Upload an image",
            type="pil",  # Get the input as a PIL Image object
            sources=["upload", "webcam"] # Allow image upload or webcam input
        )
    ],
    outputs=[
        gr.Textbox(
            label="Generated Caption",
            placeholder="The generated caption will appear here...",
            lines=3 # Provide enough space for the caption
        )
    ],
    title="AI Image Captioning App",
    description="Upload an image to get an AI-generated caption describing its content.",
    allow_flagging="never" # Disable flagging button
)

# Launch the interface
print("Launching Gradio interface...")
interface.launch(share=True) # share=True creates a public link



Launching Gradio interface...
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://d22eeebd6ba170d0bb.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Summary:

### Data Analysis Key Findings

* The necessary libraries (`transformers`, `torch`, `gradio`, and `Pillow`) for building the image captioning app were found to be already installed in the environment.
* A pre-trained image captioning model ("nlpconnect/vit-gpt2-image-captioning"), along with its feature extractor and tokenizer, was successfully loaded using the `transformers` library.
* A Python function `generate_caption_from_image` was defined to handle image preprocessing, caption generation using the loaded model, and decoding the output into a human-readable string.
* A Gradio interface was successfully created and launched, allowing users to upload an image (via upload or webcam) and display the generated caption in a text box. The interface is titled "AI Image Captioning App" and includes a brief description.

### Insights or Next Steps

* The core components for an image captioning application using Hugging Face models and Gradio have been successfully set up. The next step would be to test the application thoroughly with various images to evaluate the quality of the generated captions.
* Explore options for improving caption quality by experimenting with different model generation parameters (e.g., `num_beams`, `max_length`, `early_stopping`) or by fine-tuning the model on a specific dataset.
