Skip to content

Arun-Raghav-S/Advanced_RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Advanced Multi-Modal RAG

An end-to-end Retrieval-Augmented Generation pipeline that ingests PDFs containing text, tables, and images, summarises every modality, and serves answers via a custom multi-vector retriever.

Built during my AI Engineer internship at Otsuka Corporation, Tokyo (May–Jul 2024). The system was prototyped as the retrieval backbone for an internal knowledge-base assistant where most source documents are PDF reports rich with tables and figures — domains where naive text-only RAG fails badly.


Why this exists

Vanilla RAG tokenises every page as raw text and loses two things that matter most in real-world docs:

  1. Tables — flattened to gibberish character soup
  2. Figures / charts / diagrams — completely ignored

This pipeline treats text, tables, and images as first-class retrievable units, each with its own extraction path, summarisation strategy, and embedding.


Architecture

Pipeline Flowchart

PDF ──> unstructured.io ──> { text, tables, images }
                              │       │        │
                              ▼       ▼        ▼
                          Summarise  HTML→JSON  VLM
                          (LangChain) summarise  (MILVLG/imp-v1-3b)
                              │       │        │
                              └───────┴────────┘
                                      │
                              Multi-Vector Retriever
                                  (LlamaIndex)
                                      │
                                      ▼
                                  LLM Answer

Stack

Layer Tooling
Extraction unstructured.io (text + tables + image segmentation)
Summarisation LangChain (text/table batch summariser)
Vision-Language Model MILVLG/imp-v1-3b for image summarisation
Indexing LlamaIndex (multi-vector retriever, custom node graph)
Embeddings sentence-transformers, Azure OpenAI embeddings
Evaluation RAGAS, ROUGE, BERTScore
Runtime Python 3.10, PyTorch 2.3, Jupyter

Pipeline (in plain English)

1 · Extract — unstructured.io

Data Extraction

PDFs are decomposed into typed elements (text blocks, tables as HTML, images as PIL). unstructured consistently produced the cleanest output of every library tested, despite its dependency-conflict headaches.

2 · Summarise text + tables — LangChain

Text Processing

Tables are converted from HTML to JSON, then both text chunks and tables go through a batch summariser. Storing summaries (not raw chunks) as the retrieval target proved far better for cross-modal recall.

3 · Summarise images — Vision-Language Model

Summary Generation

Image regions are passed through MILVLG/imp-v1-3b — a small VLM that gave near-CLIP-Large performance at a fraction of the storage / inference cost.

4 · Custom multi-vector retriever — LlamaIndex

Three retrievers were prototyped and benchmarked:

  1. LangChain Multi-Vector — each text chunk mapped to its parent doc
  2. LlamaIndex with metadata — summaries + tables attached as node metadata
  3. Custom retriever — embeds the summaries as separate nodes connected back to source chunks (best recall for cross-modal queries)

Custom Retriever


Quickstart

git clone https://github.com/Arun-Raghav-S/Advanced_RAG.git
cd Advanced_RAG
pip install -r requirements.txt
cp .env.example .env   # fill in Azure OpenAI keys

# Open the notebook
jupyter notebook otsukafinal.ipynb

Heads up: unstructured[all-docs] pulls heavy native deps (poppler, tesseract). After install, restart the kernel and clear all outputs before running cells, otherwise import order causes runtime errors.

Test PDFs and source docs are included under source_docs/ and test_data/ so you can run the full pipeline end-to-end.


Results & honest limitations

  • Retrieval accuracy improved ~25% vs. text-only RAG baseline on the internal Otsuka eval set (RAGAS faithfulness + answer relevancy).
  • Token usage / cost dropped meaningfully because retrieved context is summary-first.
  • Limitation: the custom retriever's node-linking strategy was not fully completed — answers occasionally surface the summary instead of the underlying source span. Documented in the notebook for future work.

Also on Kaggle

kaggle.com/code/arunraghavs/otsukafinal — runnable end-to-end with a hosted GPU.


License

Educational / research use. Built as part of the Otsuka Corporation summer internship programme.

Arun Raghav S · arunraghavdev.com

About

End-to-end multi-modal RAG over PDFs (text + tables + images via VLM) with a custom multi-vector retriever. Built at Otsuka Corporation, Tokyo.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors