<a href="https://colab.research.google.com/github/Sahanave/found_it_using_gemma/blob/main/offline_preprocessing_and_embedding_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><h1>Retriving detected objects using Gemma</h1></center>
<center><img src="https://res.infoq.com/news/2024/02/google-gemma-open-model/en/headerimage/generatedHeaderImage-1708977571481.jpg" width="400"></center>


# Introduction

Aging parents often struggle with memory, making it hard to find everyday items, which can feel overwhelming. To address this, we propose an assistive AI-powered application leveraging Gemma models and video data to answer queries like "Where did I last see my pen?". The system processes video frames, generates embeddings, and stores them in ChromaDB alongside metadata for efficient retrieval. With a simple, user-friendly interface, users can locate items quickly and independently. Designed for elderly individuals and their families, this solution fosters autonomy, eases caregiving challenges, and offers peace of mind to loved ones living abroad.

\# Using PaliGemma with 🤗 transformers

PaliGemma is a new vision language model released by Google. In this notebook, we will see how to use 🤗 transformers for PaliGemma inference.
First, install below libraries with update flag as we need to use the latest version of 🤗 transformers along with others.

In [None]:
!pip install -q -U accelerate bitsandbytes git+https://github.com/huggingface/transformers.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.4/302.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
! pip install opencv-python
! pip install hachoir

Collecting hachoir
  Downloading hachoir-3.3.0-py3-none-any.whl.metadata (2.9 kB)
Downloading hachoir-3.3.0-py3-none-any.whl (650 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m650.4/650.4 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hachoir
Successfully installed hachoir-3.3.0


PaliGemma requires users to accept Gemma license, so make sure to go to [the repository]() and ask for access. If you have previously accepted Gemma license, you will have access to this model as well. Once you have the access, login to Hugging Face Hub using `notebook_login()` and pass your access token by running the cell below.

In [9]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Step 1 : Extract Frames from Video along with Metadata

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import cv2
import os
import json
from datetime import datetime
from hachoir.metadata import extractMetadata
from hachoir.parser import createParser

def extract_frames_from_directory(input_dir, output_dir, frame_interval=1):
    """
    Extract frames from all .MOV and .mp4 video files in a directory.
    Saves frames as images and creates metadata for each video.

    Parameters:
        input_dir (str): Directory containing video files.
        output_dir (str): Directory where extracted images will be saved.
        frame_interval (int): Save every `frame_interval`-th frame (default is 1)
    """
    # Check if input directory exists
    if not os.path.exists(input_dir):
        print(f"Error: Input directory '{input_dir}' does not exist.")
        return

    # Check if output directory exists, create if not
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Supported video formats
    supported_formats = ['.mov', '.mp4']

    # Iterate over all files in the input directory
    for filename in os.listdir(input_dir):
        file_path = os.path.join(input_dir, filename)

        # Process only .MOV and .mp4 files
        if os.path.isfile(file_path) and os.path.splitext(filename)[1].lower() in supported_formats:
            # Extract frames from the video
            extract_frames(file_path, output_dir, frame_interval)

def extract_frames(video_path, output_dir, frame_interval=1):
    """
    Extract frames from a video file and save them as images.
    Also saves metadata about the video and extraction.

    Parameters:
        video_path (str): Path to the input video file.
        output_dir (str): Directory where extracted images will be saved.
        frame_interval (int): Save every `frame_interval`-th frame (default is 1).
    """
    # Capture the video
    cap = cv2.VideoCapture(video_path)

    if not cap.isOpened():
        print(f"Error: Unable to open video file '{video_path}'.")
        return

    # Extract video name without extension for use in frame naming
    video_name = os.path.splitext(os.path.basename(video_path))[0]

    frame_count = 0
    saved_count = 0

    # Attempt to extract the metadata creation date if possible
    try:
        # Extract creation time from file metadata
        file_stats = os.stat(video_path)
        creation_time = datetime.fromtimestamp(file_stats.st_mtime).strftime("%Y-%m-%d %H:%M:%S")
    except Exception as e:
        print(f"Unable to extract creation date from the video file. Error: {e}")
        creation_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Initialize additional metadata values
    location = {}
    device_details = {}

    # Attempt to extract metadata using hachoir
    try:
        parser = createParser(video_path)
        if not parser:
            print(f"Unable to parse file '{video_path}'.")
        else:
            metadata = extractMetadata(parser)
            if metadata:
                for item in metadata.exportPlaintext():
                    # Extract location details
                    if "GPS" in item or "Latitude" in item or "Longitude" in item:
                        if "Latitude" in item:
                            location['latitude'] = item.split(": ")[1].strip()
                        elif "Longitude" in item:
                            location['longitude'] = item.split(": ")[1].strip()
                    # Extract device details
                    if "Make" in item or "Model" in item:
                        if "Make" in item:
                            device_details['make'] = item.split(": ")[1].strip()
                        elif "Model" in item:
                            device_details['model'] = item.split(": ")[1].strip()
    except Exception as e:
        print(f"Unable to extract additional metadata from the video file. Error: {e}")

    # Metadata dictionary to store video information
    video_metadata = {
        "video_path": video_path,
        "frame_interval": frame_interval,
        "frames_extracted": 0,
        "date_created": creation_time,
        "original_frame_count": int(cap.get(cv2.CAP_PROP_FRAME_COUNT)),
        "fps": cap.get(cv2.CAP_PROP_FPS),
        "location": location if location else None,
        "device_details": device_details if device_details else None,
        "frame_names": []  # To store names of extracted frames
    }

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # Save frame only if it's the nth frame based on frame_interval
        if frame_count % frame_interval == 0:
            frame_filename = f"{video_name}_frame_{frame_count:04d}.jpg"
            frame_filepath = os.path.join(output_dir, frame_filename)
            cv2.imwrite(frame_filepath, frame)
            saved_count += 1
            video_metadata["frame_names"].append(frame_filename)

        frame_count += 1

    cap.release()

    # Update metadata with the number of frames saved
    video_metadata["frames_extracted"] = saved_count

    # Save metadata to JSON file
    metadata_filename = os.path.join(output_dir, f"{video_name}_metadata.json")
    with open(metadata_filename, "w") as metadata_file:
        json.dump(video_metadata, metadata_file, indent=4)

    print(f"Done! Extracted {saved_count} frames from '{video_path}' and saved them in '{output_dir}'.")
    print(f"Metadata saved as '{metadata_filename}'.")

In [5]:
input_folder = "/content/drive/MyDrive/Home_videos"  # Replace with your directory containing video files
output_folder = "/content/drive/MyDrive/Home_videos/frames"  # Replace with your desired output directory
extract_frames_from_directory(input_folder, output_folder, frame_interval=30)  # Extract every 30th frame from each video

Done! Extracted 9 frames from '/content/drive/MyDrive/Home_videos/PXL_20241130_153431655.mp4' and saved them in '/content/drive/MyDrive/Home_videos/frames'.
Metadata saved as '/content/drive/MyDrive/Home_videos/frames/PXL_20241130_153431655_metadata.json'.
Done! Extracted 23 frames from '/content/drive/MyDrive/Home_videos/PXL_20241130_153332633.mp4' and saved them in '/content/drive/MyDrive/Home_videos/frames'.
Metadata saved as '/content/drive/MyDrive/Home_videos/frames/PXL_20241130_153332633_metadata.json'.


## Step 2 : Generate Semantic Embedding of images using Gemma models

In [6]:
import torch
import numpy as np
from PIL import Image

You can load PaliGemma model and processor like below.

In [10]:
from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "google/paligemma-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
processor = PaliGemmaProcessor.from_pretrained(model_id)

config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/62.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/699 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.26M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/607 [00:00<?, ?B/s]

The processor preprocesses both the image and text, so we will pass them.

In [13]:
# Testing the model and processor
input_text = "What is in this image?"
input_image = Image.open('/content/drive/MyDrive/Home_videos/frames/PXL_20241130_153332633_frame_0420.jpg')

In [11]:
inputs = processor(text=input_text, images=input_image,
                  padding="longest", do_convert_rgb=True, return_tensors="pt").to("cuda")
model.to(device)
inputs = inputs.to(dtype=model.dtype)


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text and `<bos>` token after that. For this call, we will infer how many images each text has and add special tokens.


We can pass in our preprocessed inputs.

In [12]:
with torch.no_grad():
  output = model.generate(**inputs, max_length=496)

print(processor.decode(output[0], skip_special_tokens=True))

What is in this image?
A person holds a pink pen in their hand, the pen being pink and the hand holding it also being pink. The floor is made of wood and the table is made of wood. The shelf is made of wood and has a bottle on it. The bottle is white and the cap is on the bottle. The cord is white and the wire is white. The pen is in the hand and the hand is holding the pen.
