# Multimodal RAG Workflow

This notebook demonstrates the complete multimodal RAG pipeline:
1. **Document Processing**: PDF, Word, Excel, PowerPoint, Images
2. **Chunking**: Intelligent text splitting
3. **Embedding**: Vector generation with Sentence Transformers
4. **Storage**: OpenSearch vector database
5. **Retrieval**: Semantic and hybrid search
6. **Generation**: LLM-based answer generation

## Setup & Configuration

In [1]:
# !pip install -r requirements.txt

In [14]:
import sys
import os
from pathlib import Path

# Add app to path if running from notebooks directory
if 'app' not in sys.path:
    sys.path.insert(0, str(Path.cwd().parent))

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

print("Environment loaded!")
print(f"OpenSearch Host: {os.getenv('OPENSEARCH_HOST', 'http://localhost:9200')}")

Environment loaded!
OpenSearch Host: http://localhost:9200


## 1. Document Processing

Test the document processors for different file types.

In [15]:
from app.processors import get_factory, process_file, Document

# Get processor factory
factory = get_factory()

print("Supported file formats:")
for ext in sorted(factory.supported_extensions):
    print(f"  {ext}")

Supported file formats:
  .bmp
  .csv
  .doc
  .docx
  .gif
  .jpeg
  .jpg
  .markdown
  .md
  .pdf
  .png
  .ppt
  .pptx
  .rst
  .text
  .tsv
  .txt
  .webp
  .xls
  .xlsx


In [16]:
# Example: Process a PDF file
# Replace with your actual file path
pdf_path = Path("../data/uploads/179124.pdf")

if pdf_path.exists():
    documents = process_file(pdf_path)
    print(f"Extracted {len(documents)} document(s) from PDF")
    
    for i, doc in enumerate(documents[:10]):  # Show first 2
        print(f"\n--- Document {i+1} ---")
        print(f"Page: {doc.metadata.get('page_number')}")
        print(f"Content preview: {doc.content[:300]}...")
else:
    print(f"Sample file not found: {pdf_path}")
    print("Upload a PDF to test processing.")

Extracted 14 document(s) from PDF

--- Document 1 ---
Page: 1
Content preview: # **SAFETY DATA SHEET**

according to Regulation (EC) No. 1907/2006



Version 6.12

Revision Date 30.12.2023



according to Regulation (EC) No. 1907/2006 Print Date 30.10.2025

GENERIC EU MSDS - NO COUNTRY SPECIFIC DATA - NO OEL DATA

**SECTION 1: Identification of the substance/mixture and of the...

--- Document 2 ---
Page: 2
Content preview: **2.2** **Label elements**


**Labelling according Regulation (EC) No 1272/2008**
Pictogram


Signal Word Danger

Hazard Statements
H225 Highly flammable liquid and vapor.
H319 Causes serious eye irritation.
H336 May cause drowsiness or dizziness.

Precautionary Statements
P210 Keep away from heat, ...

--- Document 3 ---
Page: 3
Content preview: **SECTION 3: Composition/information on ingredients**


**3.1** **Substances**

Formula : C3H6O
Molecular weight : 58,08 g/mol
CAS-No. : 67-64-1

EC-No. : 200-662-2

Index-No. : 606-001-00-8










|Component|Classificat

## 2. Text Chunking

Split documents into manageable chunks with overlap.

In [17]:
from app.chunking import TextChunker, ChunkConfig, chunk_documents

# Configure chunking
config = ChunkConfig(
    chunk_size=500,
    chunk_overlap=100,
)

chunker = TextChunker(config)

# Test with sample text
sample_text = """
Vitamin C, also known as ascorbic acid, is a water-soluble vitamin found in citrus fruits and vegetables.
It plays a crucial role in collagen synthesis, immune function, and acts as a powerful antioxidant.

The recommended daily intake varies by age and health status. Adults typically need 65-90 mg per day.
Deficiency can lead to scurvy, characterized by fatigue, gum disease, and skin problems.

Food sources rich in Vitamin C include oranges, strawberries, bell peppers, and broccoli.
Cooking can reduce the vitamin content, so raw or lightly cooked vegetables retain more nutrients.
"""

chunks = chunker.chunk_text(sample_text)
print(f"Original text: {len(sample_text)} characters")
print(f"Created {len(chunks)} chunks")

for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ({len(chunk)} chars) ---")
    print(chunk + "..." if len(chunk) > 200 else chunk)

Original text: 589 characters
Created 2 chunks

--- Chunk 1 (397 chars) ---
Vitamin C, also known as ascorbic acid, is a water-soluble vitamin found in citrus fruits and vegetables.
It plays a crucial role in collagen synthesis, immune function, and acts as a powerful antioxidant.

The recommended daily intake varies by age and health status. Adults typically need 65-90 mg per day.
Deficiency can lead to scurvy, characterized by fatigue, gum disease, and skin problems....

--- Chunk 2 (286 chars) ---
per day.
Deficiency can lead to scurvy, characterized by fatigue, gum disease, and skin problems. Food sources rich in Vitamin C include oranges, strawberries, bell peppers, and broccoli.
Cooking can reduce the vitamin content, so raw or lightly cooked vegetables retain more nutrients....


## 3. Embedding Generation

Generate vector embeddings using Sentence Transformers.

In [18]:
from app.embeddings import embed_async, embed_sync, get_embedding_dimension
import numpy as np

# Check embedding dimension
dim = get_embedding_dimension()
print(f"Embedding dimension: {dim}")

# Generate embeddings for test texts
test_texts = [
    "Vitamin C is essential for immune function.",
    "Antioxidants help prevent cellular damage.",
    "The weather is sunny today.",  # Unrelated text
]

vectors = embed_sync(test_texts)
print(f"\nGenerated {len(vectors)} embeddings")
print(f"Vector shape: {vectors.shape}")

# Compute similarity
from numpy import dot
from numpy.linalg import norm

def cosine_sim(a, b):
    return dot(a, b) / (norm(a) * norm(b))

print("\nCosine similarities:")
print(f"  Text 1 vs Text 2 (related): {cosine_sim(vectors[0], vectors[1]):.4f}")
print(f"  Text 1 vs Text 3 (unrelated): {cosine_sim(vectors[0], vectors[2]):.4f}")

Embedding dimension: 384


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Generated 3 embeddings
Vector shape: (3, 384)

Cosine similarities:
  Text 1 vs Text 2 (related): 0.4052
  Text 1 vs Text 3 (unrelated): 0.0076


## 4. OpenSearch Integration

Test connection and operations with OpenSearch.

In [19]:
from app.opensearch import (
    get_client, 
    ensure_index_exists, 
    bulk_insert, 
    knn_query,
    get_document_stats
)

# Test connection
try:
    client = get_client()
    info = client.info()
    print(f"Connected to OpenSearch {info['version']['number']}")
    
    # Check index
    ensure_index_exists()
    stats = get_document_stats()
    print(f"\nIndex stats:")
    print(f"  Total documents: {stats['total_documents']}")
    print(f"  File types: {stats['file_types']}")
    
except Exception as e:
    print(f"OpenSearch connection failed: {e}")
    print("Make sure OpenSearch is running: docker-compose up opensearch")

Connected to OpenSearch 2.9.0

Index stats:
  Total documents: 0
  File types: {}


## 5. Full Ingestion Pipeline

Test the complete ingestion workflow.

In [20]:
from app.ingestion import ingest_texts, ingest_file

# Ingest sample texts
sample_docs = [
    "Emulsifiers are compounds that help mix oil and water-based ingredients. Common emulsifiers include lecithin, mono and diglycerides, and polysorbates.",
    "The pH of a formulation affects stability, efficacy, and safety. Most skincare products have a pH between 4.5 and 6.5 to match skin's natural acidity.",
    "Antioxidants like Vitamin E and BHT are added to prevent oxidation of oils and fats. This extends product shelf life and maintains quality.",
    "Preservatives such as parabens, phenoxyethanol, and potassium sorbate prevent microbial growth in water-based formulations.",
    "Viscosity modifiers like carbomers, xanthan gum, and cellulose derivatives control product texture and flow properties.",
]

result = await ingest_texts(
    texts=sample_docs,
    source_name="formulation_guide",
)

print("Ingestion result:")
print(f"  Success: {result['success']}")
print(f"  Texts provided: {result['texts_provided']}")
print(f"  Chunks stored: {result['chunks_stored']}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Ingestion result:
  Success: True
  Texts provided: 5
  Chunks stored: 5


## 6. Retrieval Testing

Test semantic search and retrieval.

In [21]:
from app.retrieval import retrieve_top_k, retrieve_with_sources

# Test retrieval
query = "What helps prevent oxidation in formulations?"

results = await retrieve_with_sources(query, top_k=3)

print(f"Query: {query}\n")
print(f"Retrieved {len(results)} documents:\n")

for i, r in enumerate(results):
    print(f"--- Result {i+1} (score: {r['score']:.4f}) ---")
    print(f"Source: {r['source']}")
    print(f"Content: {r['content'][:200]}...\n")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: What helps prevent oxidation in formulations?

Retrieved 3 documents:

--- Result 1 (score: 0.6859) ---
Source: formulation_guide
Content: Antioxidants like Vitamin E and BHT are added to prevent oxidation of oils and fats. This extends product shelf life and maintains quality....

--- Result 2 (score: 0.6438) ---
Source: formulation_guide
Content: Preservatives such as parabens, phenoxyethanol, and potassium sorbate prevent microbial growth in water-based formulations....

--- Result 3 (score: 0.5840) ---
Source: formulation_guide
Content: Viscosity modifiers like carbomers, xanthan gum, and cellulose derivatives control product texture and flow properties....



In [22]:
# Test with filter
results_filtered = await retrieve_top_k(
    query="emulsifiers",
    top_k=3,
    filters={"file_type": "text"},
)

print(f"Filtered results: {len(results_filtered)}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Filtered results: 3


## 7. RAG Answer Generation

Test the complete RAG pipeline with LLM generation.

In [23]:
from app.generation import generate_answer, generate_answer_with_sources

# Full RAG pipeline
question = "What are the main types of preservatives used in formulations and why are they important?"

# Step 1: Retrieve
docs = await retrieve_with_sources(question, top_k=5)
print(f"Retrieved {len(docs)} relevant documents\n")

# Step 2: Generate
if docs:
    response = await generate_answer_with_sources(
        question=question,
        documents=docs,
    )
    
    print("Question:", question)
    print("\n" + "="*50 + "\n")
    print("Answer:")
    print(response["answer"])
    print("\nSources:", response["sources"])
else:
    print("No documents retrieved. Please ingest some documents first.")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Retrieved 5 relevant documents

Question: What are the main types of preservatives used in formulations and why are they important?


Answer:
The main types of preservatives used in formulations are parabens, phenoxyethanol, and potassium sorbate. They are important because they prevent microbial growth in water-based formulations, ensuring the safety and longevity of the products [Source: formulation_guide].

Sources: ['formulation_guide']


## 8. File Upload & Processing Demo

Test uploading and processing different file types.

In [24]:
# Upload and process a file (adjust path as needed)
from pathlib import Path

# Example: Process uploaded files
upload_dir = Path("../data/uploads")
upload_dir.mkdir(parents=True, exist_ok=True)

# List uploaded files
uploaded_files = list(upload_dir.glob("*"))
print(f"Files in upload directory: {len(uploaded_files)}")
for f in uploaded_files[:10]:
    print(f"  - {f.name}")

Files in upload directory: 3
  - .gitkeep
  - 179124.pdf
  - S25255.pdf


In [None]:
# Process a specific file (uncomment and adjust path)
file_to_process = upload_dir / "179124.pdf"

if file_to_process.exists():
    result = await ingest_file(file_to_process)
    print("Ingestion result:")
    for key, value in result.items():
        print(f"  {key}: {value}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Ingestion result:
  success: True
  file: 179124.pdf
  file_type: pdf
  documents_extracted: 14
  chunks_created: 32
  chunks_stored: 32
  processing_time_seconds: 2.16


## 9. API Testing

Test the FastAPI endpoints.

In [12]:
import httpx

API_URL = "http://localhost:8000"

async def test_api():
    async with httpx.AsyncClient() as client:
        # Health check
        response = await client.get(f"{API_URL}/health")
        print(f"Health: {response.json()}")
        
        # Supported formats
        response = await client.get(f"{API_URL}/supported-formats")
        print(f"\nSupported formats: {response.json()}")
        
        # Stats
        response = await client.get(f"{API_URL}/stats")
        print(f"\nDocument stats: {response.json()}")

try:
    await test_api()
except Exception as e:
    print(f"API not available: {e}")
    print("Start the API with: docker-compose up api")

API not available: Server disconnected without sending a response.
Start the API with: docker-compose up api


In [None]:
# Test RAG query via API
async def query_api(question: str):
    async with httpx.AsyncClient(timeout=60.0) as client:
        response = await client.post(
            f"{API_URL}/query",
            json={
                "question": question,
                "top_k": 5,
            }
        )
        return response.json()

try:
    result = await query_api("What are emulsifiers?")
    print("API Response:")
    print(f"Answer: {result.get('answer', 'N/A')}")
    print(f"Sources: {result.get('sources', [])}")
except Exception as e:
    print(f"Query failed: {e}")

## 10. Cleanup

Optional: Clear the index and start fresh.

In [13]:
# WARNING: This will delete all indexed documents!
# Uncomment to execute

from app.opensearch import clear_index
clear_index()
print("Index cleared!")

Index cleared!
