<a href="https://colab.research.google.com/github/Intelligent07/CodeSoft/blob/main/ImageCaptioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from PIL import Image

def generate_caption(image_path, model_name="nlpconnect/vit-gpt2-image-captioning"):
    """
    Generates a caption for an image using a pre-trained VisionEncoderDecoderModel.

    Args:
        image_path (str): Path to the image file.
        model_name (str): Name of the pre-trained model to use.

    Returns:
        str: Generated caption.
    """
    try:
        # Load model, processor, and tokenizer
        model = VisionEncoderDecoderModel.from_pretrained(model_name)
        feature_extractor = ViTImageProcessor.from_pretrained(model_name)
        tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Load and preprocess image
        image = Image.open(image_path).convert("RGB")
        pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values

        # Generate caption
        model.eval()
        with torch.no_grad():
            output_ids = model.generate(pixel_values, max_length=16, num_beams=4)

        # Decode caption
        caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        return caption

    except Exception as e:
        return f"An error occurred: {e}"

# Example usage:
image_file = "your_image.jpg"  # Replace with your image file path.
caption = generate_caption(image_file)
print(caption)

#Example usage with error handling.
image_file_not_exist = "not_exist.jpg"
caption_error = generate_caption(image_file_not_exist)
print(caption_error)

Config of the encoder: <class 'transformers.models.vit.modeling_vit.ViTModel'> is overwritten by shared encoder config: ViTConfig {
  "architectures": [
    "ViTModel"
  ],
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "image_size": 224,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "model_type": "vit",
  "num_attention_heads": 12,
  "num_channels": 3,
  "num_hidden_layers": 12,
  "patch_size": 16,
  "pooler_act": "tanh",
  "pooler_output_size": 768,
  "qkv_bias": true,
  "torch_dtype": "float32",
  "transformers_version": "4.50.2"
}

Config of the decoder: <class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'> is overwritten by shared decoder config: GPT2Config {
  "activation_function": "gelu_new",
  "add_cross_attention": true,
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "decoder_start_to

a large body of water surrounded by lush green hills 


Config of the encoder: <class 'transformers.models.vit.modeling_vit.ViTModel'> is overwritten by shared encoder config: ViTConfig {
  "architectures": [
    "ViTModel"
  ],
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "image_size": 224,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "model_type": "vit",
  "num_attention_heads": 12,
  "num_channels": 3,
  "num_hidden_layers": 12,
  "patch_size": 16,
  "pooler_act": "tanh",
  "pooler_output_size": 768,
  "qkv_bias": true,
  "torch_dtype": "float32",
  "transformers_version": "4.50.2"
}

Config of the decoder: <class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'> is overwritten by shared decoder config: GPT2Config {
  "activation_function": "gelu_new",
  "add_cross_attention": true,
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "decoder_start_to

An error occurred: [Errno 2] No such file or directory: 'not_exist.jpg'


In [None]:
pip install torch torchvision transformers Pillow



In [None]:
import torch
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from PIL import Image

def generate_caption(image_path, model_name="nlpconnect/vit-gpt2-image-captioning"):
    """
    Generates a caption for an image using a pre-trained VisionEncoderDecoderModel.

    Args:
        image_path (str): Path to the image file.
        model_name (str): Name of the pre-trained model to use.

    Returns:
        str: Generated caption.
    """
    try:
        # Load model, processor, and tokenizer
        model = VisionEncoderDecoderModel.from_pretrained(model_name)
        feature_extractor = ViTImageProcessor.from_pretrained(model_name)
        tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Load and preprocess image
        image = Image.open(image_path).convert("RGB")
        pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values

        # Generate caption
        model.eval()
        with torch.no_grad():
            output_ids = model.generate(pixel_values, max_length=16, num_beams=4)

        # Decode caption
        caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        return caption

    except Exception as e:
        return f"An error occurred: {e}"

# Example usage:
image_file = "image1.jpg"  # Replace with your image file path.
caption = generate_caption(image_file)
print(caption)



Config of the encoder: <class 'transformers.models.vit.modeling_vit.ViTModel'> is overwritten by shared encoder config: ViTConfig {
  "architectures": [
    "ViTModel"
  ],
  "attention_probs_dropout_prob": 0.0,
  "encoder_stride": 16,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "image_size": 224,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "model_type": "vit",
  "num_attention_heads": 12,
  "num_channels": 3,
  "num_hidden_layers": 12,
  "patch_size": 16,
  "pooler_act": "tanh",
  "pooler_output_size": 768,
  "qkv_bias": true,
  "torch_dtype": "float32",
  "transformers_version": "4.50.2"
}

Config of the decoder: <class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'> is overwritten by shared decoder config: GPT2Config {
  "activation_function": "gelu_new",
  "add_cross_attention": true,
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "decoder_start_to

a young woman smiles as she poses for a picture 
