<a href="https://colab.research.google.com/github/Hari-nk7/Hari-nk7/blob/main/Video_Captioning_Bot_for_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#
#  Video Captioning Bot for Google Colab
#
#  Instructions:
#  1. Open a new notebook in Google Colab (https://colab.research.google.com/).
#  2. Create a new code cell for each step mentioned below.
#  3. Copy the code from each step into its corresponding cell and run them in order.
#

# =============================================================================
# STEP 1: Install Dependencies
#
# This cell installs all the necessary libraries.
# - transformers: For accessing pre-trained models from Hugging Face.
# - torch: The deep learning framework.
# - opencv-python-headless: For video processing without a GUI.
# - pillow: For image manipulation.
# =============================================================================

# !pip install transformers torch opencv-python-headless pillow

print("Step 1 Complete: Dependencies are ready to be installed (run the command by uncommenting it).")


# =============================================================================
# STEP 2: Import Libraries and Load the Pre-trained Model
#
# This cell imports the required modules and loads the image captioning model.
# The model ('nlpconnect/vit-gpt2-image-captioning') is a powerful pre-trained
# model that combines a Vision Transformer (ViT) with a GPT-2 decoder.
#
# IMPORTANT: The first time you run this, it will download the model files
# (which can be a few gigabytes). The Hugging Face library will automatically
# cache these files in your Colab environment. On subsequent runs, it will
# load the model directly from the cache, satisfying the requirement to not
# have to "train the model again and again".
# =============================================================================
import cv2
import torch
from PIL import Image
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, GPT2Tokenizer
from google.colab import files
import io
import os

Step 1 Complete: Dependencies are ready to be installed (run the command by uncommenting it).


In [2]:
def load_model():
    """Loads and returns the pre-trained captioning model, tokenizer, and feature extractor."""
    print("Loading the pre-trained captioning model...")
    model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
    feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
    tokenizer = GPT2Tokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
    print("Model loaded successfully.")
    return model, feature_extractor, tokenizer

# Check for GPU availability and move the model to the GPU if possible
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model, feature_extractor, tokenizer = load_model()
model.to(device)


print("\nStep 2 Complete: Model is loaded and ready.")

Using device: cpu
Loading the pre-trained captioning model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/982M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/982M [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/241 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Model loaded successfully.

Step 2 Complete: Model is loaded and ready.


In [3]:
# =============================================================================
# STEP 3: Define the Video Captioning Function
#
# This cell contains the core logic for processing the video and generating a caption.
# =============================================================================

def get_caption_for_video(video_path, model, feature_extractor, tokenizer, device):
    """
    Extracts a key frame from a video, generates a caption for it, and returns the caption.

    Args:
        video_path (str): The path to the video file.
        model: The loaded VisionEncoderDecoderModel.
        feature_extractor: The loaded ViTImageProcessor.
        tokenizer: The loaded GPT2Tokenizer.
        device: The torch device (CPU or GPU).

    Returns:
        A tuple containing the generated caption (str) and the key frame (PIL.Image).
        Returns (None, None) if the video cannot be processed.
    """
    try:
        print(f"Processing video: {video_path}")
        video = cv2.VideoCapture(video_path)

        # Check if video opened successfully
        if not video.isOpened():
            print("Error: Could not open video file.")
            return None, None

        # Get total number of frames to find the middle frame
        total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
        if total_frames == 0:
            print("Error: Video has no frames.")
            return None, None

        # Set the video to the middle frame
        middle_frame_index = total_frames // 2
        video.set(cv2.CAP_PROP_POS_FRAMES, middle_frame_index)

        print(f"Extracting frame at index {middle_frame_index} (middle frame)...")
        success, frame = video.read()
        video.release()

        if not success:
            print("Error: Could not read the middle frame.")
            return None, None

        # Convert the frame from OpenCV's BGR format to RGB and then to a PIL Image
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        key_frame = Image.fromarray(frame_rgb)

        print("Generating caption for the key frame...")
        # Prepare the image for the model
        pixel_values = feature_extractor(images=[key_frame], return_tensors="pt").pixel_values
        pixel_values = pixel_values.to(device)

        # Generate caption IDs
        output_ids = model.generate(pixel_values, max_length=16, num_beams=4)

        # Decode the IDs to a text caption
        caption = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
        print(f"Generated Caption: '{caption}'")

        return caption.capitalize(), key_frame

    except Exception as e:
        print(f"An error occurred during caption generation: {e}")
        return None, None
    finally:
      # Clean up the uploaded file
      if os.path.exists(video_path):
        os.remove(video_path)
        print(f"Cleaned up video file: {video_path}")


print("\nStep 3 Complete: Captioning function is defined.")


Step 3 Complete: Captioning function is defined.


In [4]:
# =============================================================================
# STEP 4: Run the UI to Upload a Video and Get a Caption
#
# This cell provides a simple UI to upload a video file from your computer.
# After uploading, it will process the video and display the one-liner caption.
# =============================================================================

def run_captioning_bot():
    """
    Provides a file upload interface and runs the captioning process.
    """
    print("\nPlease upload a video file from your computer.")
    uploaded = files.upload()

    if not uploaded:
        print("\nNo file uploaded. Please run the cell again to try once more.")
        return

    # Get the name of the uploaded file
    file_name = next(iter(uploaded))
    print(f"\nUser uploaded file '{file_name}' ({len(uploaded[file_name])} bytes)")

    # Write the uploaded bytes to a temporary file
    with open(file_name, 'wb') as f:
        f.write(uploaded[file_name])

    # Generate the caption
    caption, key_frame = get_caption_for_video(file_name, model, feature_extractor, tokenizer, device)

    if caption and key_frame:
        print("\n===============================================")
        print("          VIDEO CAPTIONING RESULT          ")
        print("-----------------------------------------------")
        print(f" ONE-LINER CAPTION: {caption}")
        print("===============================================")
        # To display the key frame that was captioned:
        # from IPython.display import display
        # print("\nKey Frame Used for Captioning:")
        # display(key_frame.resize((400, int(400 * key_frame.height / key_frame.width))))
    else:
        print("\nCould not generate a caption for the provided video.")


# Run the bot
run_captioning_bot()

print("\nStep 4 Complete: Bot execution finished.")


Please upload a video file from your computer.


Saving WhatsApp Video 2025-09-22 at 11.07.17.mp4 to WhatsApp Video 2025-09-22 at 11.07.17.mp4

User uploaded file 'WhatsApp Video 2025-09-22 at 11.07.17.mp4' (1403641 bytes)
Processing video: WhatsApp Video 2025-09-22 at 11.07.17.mp4


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Extracting frame at index 93 (middle frame)...
Generating caption for the key frame...


We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Generated Caption: 'a large group of people standing around a fire'
Cleaned up video file: WhatsApp Video 2025-09-22 at 11.07.17.mp4

          VIDEO CAPTIONING RESULT          
-----------------------------------------------
 ONE-LINER CAPTION: A large group of people standing around a fire

Step 4 Complete: Bot execution finished.
