# Semi Structured RAG

Many documents contain a mixture of content types, including text and tables.

Semi-structured data can be challenging for conventional RAG for at least two reasons:

   - Text splitting may break up tables, corrupting the data in retrieval
   - Embedding tables may pose challenges for semantic similarity search

This **Notebook** shows how to perform RAG on documents with semi-structured data:

-  We will use Unstructured to parse both text and tables from documents (PDFs).
- We will use the multi-vector retriever to store raw tables, text along with table summaries better suited for retrieval.
- We will use LCEL(LangChain Expression Language) to implement the chains used.


## Packages

In [1]:
# 1. Install required packages
! pip install langchain unstructured[all-docs] langchain_community chromadb pydantic lxml langchainhub -q

In [2]:
# Install Tesseract and Poppler-utils using apt
# !sudo apt install tesseract-ocr 


In [3]:
!tesseract --version

tesseract 4.1.1
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found SSE
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8


In [4]:
!pip install poppler-utils
!pip install pytesseract -q
!pip install nltk -q

Defaulting to user installation because normal site-packages is not writeable


The PDF partitioning used by Unstructured will use:

   - tesseract for Optical Character Recognition (OCR)
   - poppler for PDF rendering and processing



## 2. Import Libraries

In [5]:
from typing import Any
import pytesseract
import nltk
import nltk.internals
# nltk.download('punkt')
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
import os




## 3. Define Path and Load PDF

Partition PDF tables and text

Apply to the Gemini paper: https://arxiv.org/abs/2312.11805

We use the Unstructured partition_pdf, which segments a PDF document by using a layout model.

This layout model makes it possible to extract elements, such as tables, from pdfs.

We also can use Unstructured chunking, which:

 -   Tries to identify document sections (e.g., Introduction, etc)
-   Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes


In [7]:
import os
from unstructured.partition.pdf import partition_pdf

path = "min_gemini.pdf"  # actual path of PDF

# Check if the file exists
if not os.path.isfile(path):
    raise FileNotFoundError(f"The file {path} does not exist.")

# Try to load and parse the PDF
try:
    # Get elements from PDF
    raw_pdf_elements = partition_pdf(
        filename=path,
        extract_images_in_pdf=False,
        infer_table_structure=True,
        chunking_strategy="by_title",
        max_characters=4000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000,
        image_output_dir_path=path,
    )
    print("PDF parsed successfully!")
except FileNotFoundError as e:
    print(f"File not found: {e}")
except PermissionError as e:
    print(f"Permission denied: {e}")
except Exception as e:
    print(f"An error occurred while parsing the PDF: {e}")
    print("Please check if the PDF file is corrupted or if it's password-protected.")

PDF parsed successfully!


## 4. Examine Extracted Elements



We can examine the elements extracted by partition_pdf.

CompositeElement are aggregated chunks.


In [8]:
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

unique_categories = set(category_counts.keys())
print(category_counts)


{"<class 'unstructured.documents.elements.CompositeElement'>": 12, "<class 'unstructured.documents.elements.Table'>": 4}


In [9]:
class Element(BaseModel):
    type: str
    text: Any

categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))


4
12


In [10]:
import os

os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

In [11]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

## 5. Summarize Tables and Text

Use *multi-vector-retriever* to produce summaries of tables and, optionally, text.

With the summary, we will also store the raw table elements.

The summaries are used to improve the quality of retrieval, as explained in the multi vector retriever docs.

The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer.

### **Summarie**

In [12]:
from langchain import hub
obj = hub.pull("rlm/multi-vector-retriever-summarization")

In [13]:
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [14]:
# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

In [15]:
# Apply to texts
texts = [i.text for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

## 6. Add to Vector Store

Use Multi Vector Retriever with summaries:

  -   InMemoryStore stores the raw text, tables
  -  vectorstore stores the embedded summaries


In [16]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

# Add texts
if texts and text_summaries:
    doc_ids = [str(uuid.uuid4()) for _ in texts]
    summary_texts = [
        Document(page_content=s, metadata={id_key: doc_ids[i]})
        for i, s in enumerate(text_summaries)
    ]
    retriever.vectorstore.add_documents(summary_texts)
    retriever.docstore.mset(list(zip(doc_ids, texts)))
else:
    print("No texts or text summaries to add.")

# Add tables
if tables and table_summaries:
    table_ids = [str(uuid.uuid4()) for _ in tables]
    summary_tables = [
        Document(page_content=s, metadata={id_key: table_ids[i]})
        for i, s in enumerate(table_summaries)
    ]
    retriever.vectorstore.add_documents(summary_tables)
    retriever.docstore.mset(list(zip(table_ids, tables)))
else:
    print("No tables or table summaries to add.")

  warn_deprecated(


In [17]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [18]:
chain.invoke("Give an overview of of the Gemini 1.0 model family")

'The Gemini 1.0 model family consists of three sizes: Ultra, Pro, and Nano. \n\n- Ultra: This model is the most capable in the family, delivering state-of-the-art performance across a wide range of highly complex tasks, including reasoning and multimodal tasks. It is efficiently serveable at scale on TPU accelerators due to the Gemini architecture.\n- Pro: A performance-optimized model that delivers significant performance across a wide range of tasks in terms of cost and latency. It exhibits strong reasoning performance and broad multimodal capabilities.\n- Nano: The most efficient model designed to run on-device, with two versions - Nano-1 with 1.8B parameters and Nano-2 with 3.25B parameters, targeting low and high memory devices respectively. It is trained by distilling from larger Gemini models, 4-bit quantized for deployment, and provides best-in-class performance.'

In [19]:
chain.invoke("What is the performance of Gemini Ultra performance on the MMMU benchmark per discipline as per Table 8?")

'The performance of Gemini Ultra on the MMLU benchmark is an accuracy of 90.04%.'

In [20]:
chain.invoke("What are the results of Automatic speech recognition taks on Youtube")

'The provided context does not contain any information about the results of Automatic speech recognition tasks on YouTube.'