Advanced Multi-Modal RAG

An end-to-end Retrieval-Augmented Generation pipeline that ingests PDFs containing text, tables, and images, summarises every modality, and serves answers via a custom multi-vector retriever.

Built during my AI Engineer internship at Otsuka Corporation, Tokyo (May–Jul 2024). The system was prototyped as the retrieval backbone for an internal knowledge-base assistant where most source documents are PDF reports rich with tables and figures — domains where naive text-only RAG fails badly.

Why this exists

Vanilla RAG tokenises every page as raw text and loses two things that matter most in real-world docs:

Tables — flattened to gibberish character soup
Figures / charts / diagrams — completely ignored

This pipeline treats text, tables, and images as first-class retrievable units, each with its own extraction path, summarisation strategy, and embedding.

Architecture

PDF ──> unstructured.io ──> { text, tables, images }
                              │       │        │
                              ▼       ▼        ▼
                          Summarise  HTML→JSON  VLM
                          (LangChain) summarise  (MILVLG/imp-v1-3b)
                              │       │        │
                              └───────┴────────┘
                                      │
                              Multi-Vector Retriever
                                  (LlamaIndex)
                                      │
                                      ▼
                                  LLM Answer

Stack

Layer	Tooling
Extraction	`unstructured.io` (text + tables + image segmentation)
Summarisation	LangChain (text/table batch summariser)
Vision-Language Model	MILVLG/imp-v1-3b for image summarisation
Indexing	LlamaIndex (multi-vector retriever, custom node graph)
Embeddings	sentence-transformers, Azure OpenAI embeddings
Evaluation	RAGAS, ROUGE, BERTScore
Runtime	Python 3.10, PyTorch 2.3, Jupyter

Pipeline (in plain English)

1 · Extract — `unstructured.io`

PDFs are decomposed into typed elements (text blocks, tables as HTML, images as PIL). unstructured consistently produced the cleanest output of every library tested, despite its dependency-conflict headaches.

2 · Summarise text + tables — LangChain

Tables are converted from HTML to JSON, then both text chunks and tables go through a batch summariser. Storing summaries (not raw chunks) as the retrieval target proved far better for cross-modal recall.

3 · Summarise images — Vision-Language Model

Image regions are passed through MILVLG/imp-v1-3b — a small VLM that gave near-CLIP-Large performance at a fraction of the storage / inference cost.

4 · Custom multi-vector retriever — LlamaIndex

Three retrievers were prototyped and benchmarked:

LangChain Multi-Vector — each text chunk mapped to its parent doc
LlamaIndex with metadata — summaries + tables attached as node metadata
Custom retriever — embeds the summaries as separate nodes connected back to source chunks (best recall for cross-modal queries)

Quickstart

git clone https://github.com/Arun-Raghav-S/Advanced_RAG.git
cd Advanced_RAG
pip install -r requirements.txt
cp .env.example .env   # fill in Azure OpenAI keys

# Open the notebook
jupyter notebook otsukafinal.ipynb

Heads up: unstructured[all-docs] pulls heavy native deps (poppler, tesseract). After install, restart the kernel and clear all outputs before running cells, otherwise import order causes runtime errors.

Test PDFs and source docs are included under source_docs/ and test_data/ so you can run the full pipeline end-to-end.

Results & honest limitations

Retrieval accuracy improved ~25% vs. text-only RAG baseline on the internal Otsuka eval set (RAGAS faithfulness + answer relevancy).
Token usage / cost dropped meaningfully because retrieved context is summary-first.
Limitation: the custom retriever's node-linking strategy was not fully completed — answers occasionally surface the summary instead of the underlying source span. Documented in the notebook for future work.

Also on Kaggle

kaggle.com/code/arunraghavs/otsukafinal — runnable end-to-end with a hosted GPU.

License

Educational / research use. Built as part of the Otsuka Corporation summer internship programme.

— Arun Raghav S · arunraghavdev.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced Multi-Modal RAG

Why this exists

Architecture

Stack

Pipeline (in plain English)

1 · Extract — `unstructured.io`

2 · Summarise text + tables — LangChain

3 · Summarise images — Vision-Language Model

4 · Custom multi-vector retriever — LlamaIndex

Quickstart

Results & honest limitations

Also on Kaggle

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
figures		figures
source_docs		source_docs
test_data		test_data
.env		.env
.gitignore		.gitignore
Readme.md		Readme.md
otsukafinal.ipynb		otsukafinal.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Advanced Multi-Modal RAG

Why this exists

Architecture

Stack

Pipeline (in plain English)

1 · Extract — unstructured.io

2 · Summarise text + tables — LangChain

3 · Summarise images — Vision-Language Model

4 · Custom multi-vector retriever — LlamaIndex

Quickstart

Results & honest limitations

Also on Kaggle

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1 · Extract — `unstructured.io`

Packages