<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Gen AI Experiments](https://img.shields.io/badge/Gen%20AI%20Experiments-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://github.com/buildfastwithai/gen-ai-experiments)
[![Gen AI Experiments GitHub](https://img.shields.io/github/stars/buildfastwithai/gen-ai-experiments?style=for-the-badge&logo=github&color=gold)](http://github.com/buildfastwithai/gen-ai-experiments)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/[NOTEBOOK_ID])

## Master Generative AI in 8 Weeks
**What You'll Learn:**
- Master cutting-edge AI tools & frameworks
- 6 weeks of hands-on, project-based learning
- Weekly live mentorship sessions
- No coding experience required
- Join Innovation Community

Transform your AI ideas into reality through hands-on projects and expert mentorship.

[Start Your Journey](https://www.buildfastwithai.com/genai-course)

---

# Gemini 3 Pro - Multimodal RAG

**Created by:** @BuildFastWithAI  
**Model:** Google Gemini 3 Pro  
**Last Updated:** November 2025

Build RAG systems that handle text, images, and documents.

In [None]:
!pip install -q google-generativeai pillow pypdf2 pdf2image

In [None]:
import google.generativeai as genai
from google.colab import userdata
from PIL import Image
import io
import base64

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

## 1. Image Understanding

In [None]:
# Create sample image
from PIL import Image, ImageDraw, ImageFont

img = Image.new('RGB', (400, 200), color='white')
draw = ImageDraw.Draw(img)
draw.text((50, 80), "Sample Chart Data", fill='black')
img.save('sample_chart.png')

# Analyze image
model = genai.GenerativeModel('gemini-3-pro-vision')

img = Image.open('sample_chart.png')
response = model.generate_content([
    "Describe this image in detail.",
    img
])

print(response.text)

## 2. Visual Question Answering

In [None]:
# VQA with Gemini
def visual_qa(image_path: str, question: str) -> str:
    """Answer questions about an image."""
    model = genai.GenerativeModel('gemini-3-pro-vision')
    img = Image.open(image_path)
    
    response = model.generate_content([question, img])
    return response.text

# Test VQA
answer = visual_qa(
    'sample_chart.png',
    'What text is visible in this image?'
)
print(f"Answer: {answer}")

## 3. Document Processing with Images

In [None]:
class MultimodalDocument:
    def __init__(self, text: str = "", images: list = None):
        self.text = text
        self.images = images or []
        self.image_descriptions = []
    
    def process_images(self, model):
        """Generate descriptions for all images."""
        for img_path in self.images:
            img = Image.open(img_path)
            response = model.generate_content([
                "Describe this image concisely.",
                img
            ])
            self.image_descriptions.append(response.text)
    
    def get_full_content(self) -> str:
        """Get combined text and image descriptions."""
        content = self.text
        for i, desc in enumerate(self.image_descriptions, 1):
            content += f"\n\nImage {i}: {desc}"
        return content

# Test
doc = MultimodalDocument(
    text="This document contains analysis.",
    images=['sample_chart.png']
)

vision_model = genai.GenerativeModel('gemini-3-pro-vision')
doc.process_images(vision_model)

print(doc.get_full_content())

## 4. Multimodal RAG System

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_community.vectorstores import FAISS
from langchain.schema import Document

class MultimodalRAG:
    def __init__(self, api_key: str):
        self.embeddings = GoogleGenerativeAIEmbeddings(
            model="models/embedding-001",
            google_api_key=api_key
        )
        self.text_llm = ChatGoogleGenerativeAI(
            model="gemini-3-pro",
            google_api_key=api_key
        )
        self.vision_model = genai.GenerativeModel('gemini-3-pro-vision')
        self.vectorstore = None
        self.documents = []
    
    def add_document(self, text: str, images: list = None):
        """Add multimodal document."""
        doc = MultimodalDocument(text, images)
        if images:
            doc.process_images(self.vision_model)
        
        self.documents.append(doc)
    
    def build_index(self):
        """Build vector index from documents."""
        docs = [
            Document(page_content=doc.get_full_content())
            for doc in self.documents
        ]
        self.vectorstore = FAISS.from_documents(docs, self.embeddings)
    
    def query(self, question: str, k: int = 3) -> str:
        """Query the multimodal RAG."""
        if not self.vectorstore:
            return "No index built"
        
        # Retrieve
        docs = self.vectorstore.similarity_search(question, k=k)
        context = "\n\n".join([doc.page_content for doc in docs])
        
        # Generate
        prompt = f"""
Answer based on the context (includes text and image descriptions).

Context:
{context}

Question: {question}

Answer:
"""
        
        return self.text_llm.predict(prompt)

# Test multimodal RAG
rag = MultimodalRAG(GOOGLE_API_KEY)

rag.add_document(
    text="Q4 sales report shows growth.",
    images=['sample_chart.png']
)
rag.add_document(
    text="Revenue increased by 25% year over year."
)

rag.build_index()

answer = rag.query("What does the report show?")
print(f"Answer: {answer}")

## Key Takeaways

âœ… **Multimodal Capabilities:**
- Image understanding and captioning
- Visual question answering
- Document processing with images
- Cross-modal retrieval

ðŸ“Œ **Use Cases:**
- Document Q&A with charts/diagrams
- Product catalogs with images
- Research papers with figures
- Medical records with scans

ðŸ”— **Resources:**
- Follow [@BuildFastWithAI](https://twitter.com/BuildFastWithAI)