<a href="https://colab.research.google.com/github/Hajira-max/Codsoft-/blob/main/Untitled21.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:

# --------------------------------------------------
# 📌 Image Captioning Project
# --------------------------------------------------
# What this notebook does:
# 1. Loads a pre-trained AI model (Vision Transformer + GPT-2).
# 2. Lets you upload your own images.
# 3. Generates natural language captions that describe the image.
# --------------------------------------------------

# Step 1: Install the required libraries
!pip install transformers sentencepiece --quiet

# Step 2: Import necessary tools
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import torch
from PIL import Image
from google.colab import files

# Step 3: Load the pre-trained model
# (this model is already trained on millions of images + captions)
model_name = "nlpconnect/vit-gpt2-image-captioning"

print("📥 Loading model... please wait a moment.")

model = VisionEncoderDecoderModel.from_pretrained(model_name)
feature_extractor = ViTImageProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Use GPU if available, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

print("✅ Model loaded successfully!")

# Step 4: Define a function to generate captions
def generate_caption(image_path):
    """
    This function takes an image file,
    processes it with the model, and returns a caption.
    """
    # Open the image
    img = Image.open(image_path).convert("RGB")

    # Convert the image into tensor form the model understands
    pixel_values = feature_extractor(images=img, return_tensors="pt").pixel_values.to(device)

    # Generate caption (beam search makes it higher quality)
    output_ids = model.generate(pixel_values, max_length=16, num_beams=4)

    # Decode tokens into text
    caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    return caption

# Step 5: Upload your images
print("📂 Please upload one or more images (JPG/PNG).")
uploaded = files.upload()

# Step 6: Generate captions for each uploaded image
for filename in uploaded.keys():
    caption = generate_caption(filename)
    print("--------------------------------------------------")
    print(f"📷 Image: {filename}")
    print(f"📝 Caption: {caption}")

📥 Loading model... please wait a moment.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/982M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/982M [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/241 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

✅ Model loaded successfully!
📂 Please upload one or more images (JPG/PNG).


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Saving 360_F_666874684_iHm6UxLjRsR0RjeN1IsFp69Rps1QCDuZ.jpg to 360_F_666874684_iHm6UxLjRsR0RjeN1IsFp69Rps1QCDuZ (1).jpg


We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


--------------------------------------------------
📷 Image: 360_F_666874684_iHm6UxLjRsR0RjeN1IsFp69Rps1QCDuZ (1).jpg
📝 Caption: a black and white cat and a brown and white dog 
