### Semi-structured RAG

Many documents contain a mixture of content types, including text and tables.

Semi-structured data can be challenging for conventional RAG for at least two reasons:

Text splitting may break up tables, corrupting the data in retrieval
Embedding tables may pose challenges for semantic similarity search.

We will use `Unstructured` to parse both text and tables from documents (PDFs).

We will use the `multi-vector retriever` to store raw tables, text along with table summaries better suited for retrieval.

We will use `LCEL` to implement the chains used.

In [None]:
!pip3 install langchain unstructured pydantic lxml langchainhub

In [None]:
!pip3 install tesseract
!pip3 install poppler

In [None]:
!pip3 install --upgrade pdfminer.six

In [1]:
import os

# Add Homebrew's bin directory to PATH so Python can find poppler utilities
os.environ["PATH"] = "/opt/homebrew/bin:" + os.environ.get("PATH", "")


In [2]:
path = '/Users/I572648/Library/CloudStorage/OneDrive-SAPSE/Desktop/Git/GenAI/3-rag/3-semi_structured-rag/data/llava.pdf'

In [5]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

# Get elements
raw_pdf_elements = partition_pdf(
    filename=path,
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,  # Maximum number of characters per chunk
    new_after_n_chars=3800,  # New chunk after this number of characters. Hard limit.
    combine_text_under_n_chars=2000,  # Combine text blocks under this number of characters with previous text block
    image_output_dir_path='/Users/I572648/Library/CloudStorage/OneDrive-SAPSE/Desktop/Git/GenAI/3-rag/3-semi_structured-rag/data',
)

In [7]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set (category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 27}

In [8]:
class Element(BaseModel):
    type: str
    text: Any


# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

0
27


### Multi Vector Retriever

In [9]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

In [1]:
from langchain import hub
multi_vector_prompt1 = hub.pull("rlm/multi-vector-retriever-summarization")
multi_vector_prompt1



PromptTemplate(input_variables=['element'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'multi-vector-retriever-summarization', 'lc_hub_commit_hash': 'd822e5e6d60be1e8e19b1e849a99ab65d384972cfd2414e4281b386e287b122d'}, template='You are an assistant tasked with summarizing tables and text. \\ \nGive a concise summary of the table or text. Table or text chunk: {element}')

In [23]:
prompt_text = """You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo", max_retries=3, request_timeout=20)
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [19]:
# Apply to tables
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

In [24]:
# Apply to texts
texts = [i.text for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 2})

### Add to vectorstore

Use Multi Vector Retriever with summaries:

InMemoryStore stores the raw text, tables

vectorstore stores the embedded summaries

In [None]:
!pip3 install langchain_community

In [32]:
# Check if we have any data
print(f"Number of texts: {len(texts)}")
print(f"Number of text_summaries: {len(text_summaries)}")
print(f"Number of tables: {len(tables)}")
print(f"Number of table_summaries: {len(table_summaries)}")

if len(texts) == 0 and len(tables) == 0:
    print("\nWARNING: No texts or tables were extracted from the PDF!")
    print("This could mean:")
    print("1. The PDF is empty or has no extractable content")
    print("2. The extraction parameters need adjustment")
    print("3. The model download is still in progress")


Number of texts: 27
Number of text_summaries: 27
Number of tables: 0
Number of table_summaries: 0


In [33]:
import uuid

from langchain_classic.retrievers.multi_vector import MultiVectorRetriever
from langchain_classic.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

# Add texts (only if we have any)
if len(texts) > 0 and len(text_summaries) > 0:
    doc_ids = [str(uuid.uuid4()) for _ in texts]
    summary_texts = [
        Document(page_content=s, metadata={id_key: doc_ids[i]})
        for i, s in enumerate(text_summaries)
    ]
    retriever.vectorstore.add_documents(summary_texts)
    retriever.docstore.mset(list(zip(doc_ids, texts)))
    print(f"Added {len(texts)} text documents")
else:
    print("No texts to add")

# Add tables (only if we have any)
if len(tables) > 0 and len(table_summaries) > 0:
    table_ids = [str(uuid.uuid4()) for _ in tables]
    summary_tables = [
        Document(page_content=s, metadata={id_key: table_ids[i]})
        for i, s in enumerate(table_summaries)
    ]
    retriever.vectorstore.add_documents(summary_tables)
    retriever.docstore.mset(list(zip(table_ids, tables)))
    print(f"Added {len(tables)} table documents")
else:
    print("No tables to add")

Added 27 text documents
No tables to add


In [34]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature=0, model="gpt-4")

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)


In [35]:
chain.invoke("Get Ablation on LLaVA-Bench (COCO) with different training data? ")

'The ablation on LLaVA-Bench (COCO) with different training data is as follows:\n\n- Full data: 83.1 for Conversation, 75.3 for Detail description, 96.5 for Complex reasoning, and 85.1 for All.\n- Detail + Complex: 81.5 for Conversation (-1.6 compared to Full data), 73.3 for Detail description (-2.0), 90.8 for Complex reasoning (-5.7), and 81.9 for All (-3.2).\n- Conv + 5% Detail + 10% Complex: 81.0 for Conversation (-2.1), 68.4 for Detail description (-7.1), 91.5 for Complex reasoning (-5.0), and 80.5 for All (-4.4).\n- Conversation: 76.5 for Conversation (-6.6), 59.8 for Detail description (-16.2), 84.9 for Complex reasoning (-12.4), and 73.8 for All (-11.3).\n- No Instruction Tuning: 22.0 for Conversation (-61.1), 24.0 for Detail description (-51.3), 18.5 for Complex reasoning (-78.0), and 21.5 for All (-63.6).'

'The ablation on LLaVA-Bench (COCO) with different training data is as follows:\n\n- Full data: 83.1 for Conversation, 75.3 for Detail description, 96.5 for Complex reasoning, and 85.1 for All.\n- Detail + Complex: 81.5 for Conversation (-1.6 compared to Full data), 73.3 for Detail description (-2.0), 90.8 for Complex reasoning (-5.7), and 81.9 for All (-3.2).\n- Conv + 5% Detail + 10% Complex: 81.0 for Conversation (-2.1), 68.4 for Detail description (-7.1), 91.5 for Complex reasoning (-5.0), and 80.5 for All (-4.4).\n- Conversation: 76.5 for Conversation (-6.6), 59.8 for Detail description (-16.2), 84.9 for Complex reasoning (-12.4), and 73.8 for All (-11.3).\n- No Instruction Tuning: 22.0 for Conversation (-61.1), 24.0 for Detail description (-51.3), 18.5 for Complex reasoning (-78.0), and 21.5 for All (-63.6).'