## Image Descriptions with Gemini

Generate detailed textual descriptions for extracted images using Gemini 2.5 Flash.

**Prerequisites:**
- Make sure you rag-data dir with extracted dir like markdown, images and tables
- Google API key set in .env file

**Output:**
- Markdown descriptions saved to `data/rag-data/images_desc/{company}/{document}/page_X.md`

### Setup and Imports

In [None]:
!pip install python-dotenv langchain-google-genai Pillow

In [2]:
from dotenv import load_dotenv
load_dotenv()

from pathlib import Path
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage

from PIL import Image

import base64
import io

### Configuration

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
from google.colab import userdata

In [5]:
from pathlib import Path
import base64, io
from PIL import Image
from huggingface_hub import InferenceClient

In [6]:
# Paths
IMAGES_DIR = "/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/images"
OUTPUT_DESC_DIR = "/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/images_desc"

# Model configuration
MODEL_NAME = "Qwen/Qwen2.5-VL-7B-Instruct"
client = InferenceClient(model=MODEL_NAME)

### Description Generation Function

In [7]:
describe_image_prompt = """Analyze this financial document page and extract meaningful data in a concise format.

For charts and graphs:
- Identify the metric being measured
- List key data points and values
- Note significant trends (growth, decline, stability)

For tables:
- Extract column headers and key rows
- Note important values and totals

For text:
- Summarize key facts and numbers only
- Skip formatting, headers, and navigation elements

Be direct and factual. Focus on numbers, trends, and insights that would be useful for retrieval."""

In [8]:
from langchain.messages import SystemMessage


def generate_image_description(image_path: Path):
    # Load image and convert to base64
    image = Image.open(image_path).convert("RGB")
    buffered = io.BytesIO()
    image.save(buffered, format="PNG")
    image_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")

    # Qwen2.5-VL uses OpenAI-style chat with image + text
    response = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": "You are an AI assistant specialized in financial document understanding."
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": describe_image_prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_base64}"
                        }
                    }
                ]
            }
        ],
        max_tokens=1200,
        temperature=0.2,
    )

    return response.choices[0].message.content

In [9]:
IMAGES_DIR

'/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/images'

In [10]:
image_path = Path(r'/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/images/meta/meta 10-k 2024/page_64.png')

response = generate_image_description(image_path)

In [11]:
response

'### Trends in Our Revenue by User Geography\n\n**Revenue Worldwide:**\n- **2022:** $32,165 million\n- **2023:** $40,111 million\n- **2024:** $48,385 million\n- **Growth:** 25% from 2022 to 2024\n\n**Revenue US & Canada:**\n- **2022:** $15,005 million\n- **2023:** $17,842 million\n- **2024:** $21,783 million\n- **Growth:** 19% from 2022 to 2024\n\n**Revenue Europe:**\n- **2022:** $7,050 million\n- **2023:** $7,777 million\n- **2024:** $11,503 million\n- **Growth:** 66% from 2022 to 2024\n\n**Revenue Asia-Pacific:**\n- **2022:** $5,968 million\n- **2023:** $7,316 million\n- **2024:** $9,245 million\n- **Growth:** 55% from 2022 to 2024\n\n**Revenue Rest of World:**\n- **2022:** $3,429 million\n- **2023:** $4,251 million\n- **2024:** $5,854 million\n- **Growth:** 69% from 2022 to 2024\n\n### Key Insights:\n- Revenue in the Rest of World grew the most significantly, followed by Europe and Asia-Pacific.\n- The US & Canada and Europe regions saw moderate growth.\n- Revenue in the US & Canada

In [12]:
# print(response)

def generate_and_save_description(image_path: Path):
    company_name = image_path.parent.parent.name
    doc_name = image_path.parent.name

    output_dir = Path(OUTPUT_DESC_DIR)/company_name/doc_name
    output_dir.mkdir(parents=True, exist_ok=True)

    desc_file = output_dir / f"{image_path.stem}.md"

    if desc_file.exists():
        return False

    description = generate_image_description(image_path)
    desc_file.write_text(description, encoding='utf-8')

    return True

In [13]:
image_path = Path(r'/content/drive/MyDrive/Udemy/KGP-TALKIE/Deep_Agent/resources/data/rag-data/images/meta/meta 10-k 2024/page_64.png')

response = generate_and_save_description(image_path)

In [14]:
from tqdm import tqdm

images_path = Path(IMAGES_DIR)
image_files = list(images_path.rglob("page_*.png"))

for image_path in tqdm(image_files):
    response = generate_and_save_description(image_path)


100%|██████████| 77/77 [10:52<00:00,  8.47s/it]
