An end-to-end Retrieval-Augmented Generation pipeline that ingests PDFs containing text, tables, and images, summarises every modality, and serves answers via a custom multi-vector retriever.
Built during my AI Engineer internship at Otsuka Corporation, Tokyo (May–Jul 2024). The system was prototyped as the retrieval backbone for an internal knowledge-base assistant where most source documents are PDF reports rich with tables and figures — domains where naive text-only RAG fails badly.
Vanilla RAG tokenises every page as raw text and loses two things that matter most in real-world docs:
- Tables — flattened to gibberish character soup
- Figures / charts / diagrams — completely ignored
This pipeline treats text, tables, and images as first-class retrievable units, each with its own extraction path, summarisation strategy, and embedding.
PDF ──> unstructured.io ──> { text, tables, images }
│ │ │
▼ ▼ ▼
Summarise HTML→JSON VLM
(LangChain) summarise (MILVLG/imp-v1-3b)
│ │ │
└───────┴────────┘
│
Multi-Vector Retriever
(LlamaIndex)
│
▼
LLM Answer
| Layer | Tooling |
|---|---|
| Extraction | unstructured.io (text + tables + image segmentation) |
| Summarisation | LangChain (text/table batch summariser) |
| Vision-Language Model | MILVLG/imp-v1-3b for image summarisation |
| Indexing | LlamaIndex (multi-vector retriever, custom node graph) |
| Embeddings | sentence-transformers, Azure OpenAI embeddings |
| Evaluation | RAGAS, ROUGE, BERTScore |
| Runtime | Python 3.10, PyTorch 2.3, Jupyter |
PDFs are decomposed into typed elements (text blocks, tables as HTML, images as PIL). unstructured consistently produced the cleanest output of every library tested, despite its dependency-conflict headaches.
Tables are converted from HTML to JSON, then both text chunks and tables go through a batch summariser. Storing summaries (not raw chunks) as the retrieval target proved far better for cross-modal recall.
Image regions are passed through MILVLG/imp-v1-3b — a small VLM that gave near-CLIP-Large performance at a fraction of the storage / inference cost.
Three retrievers were prototyped and benchmarked:
- LangChain Multi-Vector — each text chunk mapped to its parent doc
- LlamaIndex with metadata — summaries + tables attached as node metadata
- Custom retriever — embeds the summaries as separate nodes connected back to source chunks (best recall for cross-modal queries)
git clone https://github.com/Arun-Raghav-S/Advanced_RAG.git
cd Advanced_RAG
pip install -r requirements.txt
cp .env.example .env # fill in Azure OpenAI keys
# Open the notebook
jupyter notebook otsukafinal.ipynbHeads up:
unstructured[all-docs]pulls heavy native deps (poppler, tesseract). After install, restart the kernel and clear all outputs before running cells, otherwise import order causes runtime errors.
Test PDFs and source docs are included under source_docs/ and test_data/ so you can run the full pipeline end-to-end.
- Retrieval accuracy improved ~25% vs. text-only RAG baseline on the internal Otsuka eval set (RAGAS faithfulness + answer relevancy).
- Token usage / cost dropped meaningfully because retrieved context is summary-first.
- Limitation: the custom retriever's node-linking strategy was not fully completed — answers occasionally surface the summary instead of the underlying source span. Documented in the notebook for future work.
kaggle.com/code/arunraghavs/otsukafinal — runnable end-to-end with a hosted GPU.
Educational / research use. Built as part of the Otsuka Corporation summer internship programme.




