In [1]:
! pip install langchain==0.1.6 unstructured[all-docs]==0.12.0 pydantic lxml langchainhub

Collecting langchain==0.1.6
  Using cached langchain-0.1.6-py3-none-any.whl.metadata (13 kB)
Collecting langchain-core<0.2,>=0.1.22 (from langchain==0.1.6)
  Using cached langchain_core-0.1.52-py3-none-any.whl.metadata (5.9 kB)
Collecting langsmith<0.1,>=0.0.83 (from langchain==0.1.6)
  Using cached langsmith-0.0.92-py3-none-any.whl.metadata (9.9 kB)
Collecting onnxruntime<1.16 (from unstructured-inference==0.7.21->unstructured[all-docs]==0.12.0)
  Using cached onnxruntime-1.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
INFO: pip is looking at multiple versions of langchain-core to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-core<0.2,>=0.1.22 (from langchain==0.1.6)
  Using cached langchain_core-0.1.51-py3-none-any.whl.metadata (5.9 kB)
  Using cached langchain_core-0.1.50-py3-none-any.whl.metadata (5.9 kB)
  Using cached langchain_core-0.1.49-py3-none-any.whl.metadata (5.9 kB)
  Using 

In [2]:
!pip install --upgrade --quiet  fastembed

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unstructured-inference 0.7.21 requires onnxruntime<1.16, but you have onnxruntime 1.19.2 which is incompatible.[0m[31m
[0m

In [3]:
!apt-get install -y poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.5).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


In [4]:
!apt install -y tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


In [5]:
! pip install pytesseract



In [6]:
!pip install nltk



In [7]:
!pip install chromadb



## Data Loading

### Partition PDF tables and text

Apply to the [`Gemini`](https://arxiv.org/abs/2312.11805) paper.

We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf), which segments a PDF document by using a layout model.

This layout model makes it possible to extract elements, such as tables, from pdfs.

We also can use `Unstructured` chunking, which:

* Tries to identify document sections (e.g., Introduction, etc)
* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes

In [8]:
path = "/kaggle/input/nutrition/General Information.pdf"

In [9]:
import pytesseract
import nltk
import nltk.internals
nltk.download('punkt')
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
pytesseract.pytesseract.tesseract_cmd = ( r'/usr/bin/tesseract' )
# Get elements
raw_pdf_elements = partition_pdf(
    filename=path,
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We can examine the elements extracted by `partition_pdf`.

`CompositeElement` are aggregated chunks.

In [10]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 8,
 "<class 'unstructured.documents.elements.Table'>": 4}

In [11]:
class Element(BaseModel):
    type: str
    text: Any


# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

# Tables
table_elements = [e for e in categorized_elements if e.type == "table"]
print(len(table_elements))

# Text
text_elements = [e for e in categorized_elements if e.type == "text"]
print(len(text_elements))

4
8


## Multi-vector retriever

Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text.

With the summary, we will also store the raw table elements.

The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).

The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer.  

### Summaries

In [12]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
#from langchain_openai import ChatOpenAI

We create a simple summarize chain for each element.

You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).

```
from langchain import hub
obj = hub.pull("rlm/multi-vector-retriever-summarization")
```

In [13]:
from huggingface_hub import login
login(token="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")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [14]:
# Import necessary libraries
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
from langchain.schema import BaseOutputParser
import gc

# Define a custom StrOutputParser
class StrOutputParser(BaseOutputParser):
    """
    A simple output parser that returns the generated text as-is after stripping whitespace.
    """
    def parse(self, text: str) -> str:
        return text.strip()

# 1. Initialize the tokenizer and model
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

# Check for available device (use CPU as fallback)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Load the model with half precision (float16), and enable gradient checkpointing
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",  # Automatically place model layers across GPU(s)
    torch_dtype=torch.float16,  # Use half precision to save memory
    low_cpu_mem_usage=True
)

# Enable gradient checkpointing to reduce memory
model.gradient_checkpointing_enable()

# 2. Set up the HuggingFace pipeline with reduced batch size and token limit
hf_pipeline = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    do_sample=False,
    max_new_tokens=100,  # Reduce max token output to save memory
    batch_size=1  # Reduce batch size to avoid memory issues
)

# 3. Create the HuggingFacePipeline for LangChain
hf_model = HuggingFacePipeline(pipeline=hf_pipeline)

# 4. Define the prompt template
prompt_text = """You are an assistant tasked with summarizing tables and text. \
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# 5. Set up the summarization chain with the custom parser
summarize_chain = LLMChain(
    llm=hf_model,
    prompt=prompt,
    output_parser=StrOutputParser(),  # Use the custom parser
    verbose=True  # Set to False to reduce verbosity
)

# Define your table and text elements
tables = [element.text for element in table_elements]
texts = [element.text for element in text_elements]

try:
    # 6. Generate summaries for tables with memory management
    print("Generating summaries for tables...")
    table_summaries = summarize_chain.batch(
        inputs=tables,
        max_concurrency=5  # Adjust based on system capabilities
    )
    print("\nTable Summaries:")
    for i, summary in enumerate(table_summaries, 1):
        print(f"{i}. {summary}")
        # Clear memory after processing a batch
        torch.cuda.empty_cache()
        gc.collect()

    # 7. Generate summaries for texts with memory management
    print("\nGenerating summaries for texts...")
    text_summaries = summarize_chain.batch(
        inputs=texts,
        max_concurrency=5  # Adjust based on system capabilities
    )
    print("\nText Summaries:")
    for i, summary in enumerate(text_summaries, 1):
        print(f"{i}. {summary}")
        # Clear memory after processing a batch
        torch.cuda.empty_cache()
        gc.collect()

except Exception as e:
    print(f"An error occurred during summarization: {e}")

# Final memory cleanup
torch.cuda.empty_cache()
gc.collect()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Generating summaries for tables...


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHuman: You are an assistant tasked with summarizing tables and text. Give a concise summary of the table or text. Table or text chunk: A B C D E F G H I J K L M N O P Q R S T Cereals and Millets Grain Legumes Green Leafy Vegetables Other Vegetables Fruits Roots and Tubers Condiments and Spices Nuts and Oil Seeds Sugars Mushrooms Miscellaneous Foods Milk and Milk Products Egg and Egg Products Poultry Animal Meat Marine Fish Marine Shellfish Marine Mollusks Fresh Water Fish and Shellfish Edible Oils and Fats 24 25 34 78 68 19 33 21 2 4 2 4 15 19 63 92 8 7 10 9 [0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHuman: You are an assistant tasked with summarizing tables and text. Give a concise summary of the table or text. Table or text chunk: Carbohydrate Equivalent after Hydrolysis (g/100g) Conversion to monosaccharide equivalent 1 2 Mono

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


2. {'element': 'Carbohydrate Equivalent after Hydrolysis (g/100g) Conversion to monosaccharide equivalent 1 2 Monosaccharides e.g. glucose Disaccharides e.g. sucrose, lactose, maltose 100 105 No conversion necessary x 1.05 or ÷ 0.95 3 Oligosaccharides a. Raffinose (trisaccharide) 107 x 1.07 or ÷ 0.93 4 b. Stachyose (tetrasaccharide) c. Verbascose (pentasaccharide) Polysaccharides e.g. starch 108 109 110 x 1.08 or ÷ 0.93 x 1.09 or ÷ 0.92 x 1.10 or ÷ 0.90', 'text': 'This table illustrates the conversion factors for different types of carbohydrates from their original form to their monosaccharide equivalent. Monosaccharides, like glucose, require no conversion. Disaccharides, such as sucrose, lactose, and maltose, need a conversion factor of 1.05 or 0.95. Oligosaccharides, including raffinose, stachyose,'}
3. {'element': 'Table 3. Jones factors for conversion of nitrogen to protein. Food Barley and its Flour; Rye and its flour; Oats Rice and its flour Wheat whole Wheat bran Refined wheat 

0

In [15]:
print(table_summaries)

[{'element': 'A B C D E F G H I J K L M N O P Q R S T Cereals and Millets Grain Legumes Green Leafy Vegetables Other Vegetables Fruits Roots and Tubers Condiments and Spices Nuts and Oil Seeds Sugars Mushrooms Miscellaneous Foods Milk and Milk Products Egg and Egg Products Poultry Animal Meat Marine Fish Marine Shellfish Marine Mollusks Fresh Water Fish and Shellfish Edible Oils and Fats 24 25 34 78 68 19 33 21 2 4 2 4 15 19 63 92 8 7 10 9', 'text': '11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 4'}, {'element': 'Carbohydrate Equivalent after Hydrolysis (g/100g) Conversion to monosaccharide equivalent 1 2 Monosaccharides e.g. glucose Disaccharides e.g. sucrose, lactose, maltose 100 105 No conversion necessary x 1.05 or ÷ 0.95 3 Oligosaccharides a. Raffinose (trisaccharide) 107 x 1.07 or ÷ 0.93 4 b. Stachyose (tetrasaccharide) c. Verbascose (pentasaccharide) Polysaccharides e.g. starch 108 109 110 x 1.08 or ÷ 0.93 x 1.09 or ÷ 0.92 x 1

### Add to vectorstore

Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries:

* `InMemoryStore` stores the raw text, tables
* `vectorstore` stores the embedded summaries

In [31]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
#from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=FastEmbedEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
    
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=str(s), metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, summary_texts)))  # Corrected mapping

# Add table summaries to the vector store
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=str(s), metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, summary_tables)))  # Corrected mapping


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

## RAG

Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval).

In [32]:
from langchain_core.runnables import RunnablePassthrough

# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [21]:
!pip install --upgrade langchain langchain-community transformers torch


Collecting langchain-community
  Downloading langchain_community-0.3.0-py3-none-any.whl.metadata (2.8 kB)
Collecting transformers
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m786.1 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
Collecting torch
  Downloading torch-2.4.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.5.2-py3-none-any.whl.metadata (3.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylin

In [33]:
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain.prompts import PromptTemplate
from langchain.schema import BaseOutputParser

# Define the prompt template for Q&A
qa_prompt = PromptTemplate(
    template="""Answer the question based only on the following context, which can include text and tables:
{context}

Question: {question}
""",
    input_variables=["context", "question"],
)

# Initialize the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=hf_model,
    chain_type="stuff",  # 'stuff' is a simple chain type; other options include 'map_reduce', 'refine', etc.
    retriever=retriever,
    chain_type_kwargs={"prompt": qa_prompt},  # Move 'prompt' inside 'chain_type_kwargs'
    verbose=True
)



In [34]:
# Define your question
question = "What are the different food groups in IFCT?"

# Invoke the chain using 'invoke' instead of 'run'
try:
    response = qa_chain.invoke(question)
    print("\nResponse:", response)
except Exception as e:
    print(f"An error occurred during QA: {e}")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

Response: {'query': 'What are the different food groups in IFCT?', 'result': '\nAnswer: The Indian Food Composition Tables (IFCT) 2017 have categorized foods into 20 different food groups: Cereals and Millets, Grain Legumes, Green Leafy Vegetables, Other Vegetables, Fruits, Roots and Tubers, Condiments and Spices, Nuts and Oil Seeds, Sugars, Mushrooms, Miscellaneous Foods, Milk and Milk'}


In [35]:
# Define your question
question = "What is the metabolic energy conversion factor for protein in kcal/g?"

# Invoke the chain using 'invoke' instead of 'run'
try:
    response = qa_chain.invoke(question)
    print("\nResponse:", response)
except Exception as e:
    print(f"An error occurred during QA: {e}")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

Response: {'query': 'What is the metabolic energy conversion factor for protein in kcal/g?', 'result': '\nAnswer: The metabolic energy conversion factor for protein in kcal/g is 4 kcal/g. This value can be calculated by dividing the value of the Jones factor for protein (0.20) by the protein content in the table (15 g/100 g). Therefore, 0.20 ÷ 15 = 0.0133, and 0.0133 × 100 ×'}
