# 🎯 RAG Pipeline from PDFs with Images & Tables using LangChain, Unstructured & OpenAI

## 📘 Overview
This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system that can parse PDFs (including images and tables), embed the content, and answer natural language queries using LangChain and OpenAI.

## 🔍 What this notebook does:
- Extracts structured and unstructured content from PDFs using `unstructured`.
- Parses tables, images, and text with high-resolution mode.
- Embeds parsed chunks using `OpenAIEmbeddings`.
- Stores and retrieves documents using `FAISS` vectorstore.
- Answers questions contextually using `ChatOpenAI` and LangChain RAG chain.

## 💡 Example Query
"Compare Artificial Intelligence and Machine Learning from the document."


### ⚙️ Setup
🔧 Install dependencies

In [1]:
!pip install --quiet faiss-cpu pytesseract unstructured-client "unstructured[all-docs]"
!pip install langchain_openai langchain-community
!apt-get install -y poppler-utils tesseract-ocr libtesseract-dev


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m460.8/981.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.0/117.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.4/189.4 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━

###🔑 Set up API key

In [2]:
import os
from google.colab import userdata
# os.environ["OPENAI_API_KEY"] = "Enter Your OPENAI_API_KEY"
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY') # Uncomment if you Stored your key securely in Colab



### 📄 Step 1: Parse PDF with Unstructured

In [3]:
from unstructured.partition.pdf import partition_pdf

filename = "/content/Ml_sample.pdf" # Path to your PDF file

# Extract elements including images, tables, and structured text
pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=True,               # Enable image extraction
    strategy="hi_res",                       # Use high-resolution parsing
    hi_res_model_name="yolox",               # YOLOX model for detecting layout
    infer_table_structure=True,               # Try to parse tables
    chunking_strategy="by_title",            # Split text by document headings
    max_characters=3000,
    combine_text_under_n_chars=200
)


🔍 Inspect parsed elements




In [4]:
# Analyze the types of elements extracted
from collections import Counter
category_counts = Counter(str(type(element)) for element in pdf_elements)
category_counts


Counter({"<class 'unstructured.documents.elements.CompositeElement'>": 47})

###📚 Step 2: Create LangChain Documents

In [5]:
# Convert each element into a searchable Document
from langchain.schema import Document
documents = [Document(page_content=el.text, metadata={"source": filename}) for el in pdf_elements]


### 🧠 Step 3: Embed with OpenAI + FAISS

In [6]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
#Build FAISS Vector Store from the documents
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)

#Set up a retriever from the vectorstore
retriever = vectorstore.as_retriever()


### 🧩 Step 4: Setup LangChain RAG Chain

In [7]:
#Create the RAG pipeline using LangChain
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

llm = ChatOpenAI()

# Define how the question and context are formatted to the model
template = """
You are a helpful assistant that answers questions based on the provided context, which can include text and tables.
Use the provided context to answer the question.
Question: {input}
Context: {context}
Answer:
"""

prompt = ChatPromptTemplate.from_template(template)

# Chain: Retrieve context → Fill prompt → Run LLM → Return response
rag_chain = (
    {"context": retriever,  "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


### ❓ Run a Query

In [16]:
import pandas as pd

queries = [
    "Compare Artificial Intelligence and Machine Learning from the document.",
    "what is machine learing "
]

# Run each query through the RAG pipeline and get context + response
data = []
for query in queries:
    response = rag_chain.invoke(query)
    context_docs = retriever.get_relevant_documents(query)
    context_text = "\n---\n".join([doc.page_content for doc in context_docs])
    data.append({
        "query": query,
        "context": context_text,
        "output": response
    })

df = pd.DataFrame(data)
print(df)


                                               query  \
0  Compare Artificial Intelligence and Machine Le...   
1                           what is machine learing    

                                             context  \
0  TOPIC-1: Introduction- Artificial Intelligence...   
1  Arthur Samuel\n\n• The term machine learning w...   

                                              output  
0  Artificial Intelligence (AI) is a branch of co...  
1  Machine learning is a technology that enables ...  


## Evaluate the output with Llumo Ai sdk

---



In [10]:
!pip install llumo

Collecting llumo
  Downloading llumo-0.1.4-py3-none-any.whl.metadata (721 bytes)
Collecting python-socketio==5.13.0 (from python-socketio[client]==5.13.0->llumo)
  Downloading python_socketio-5.13.0-py3-none-any.whl.metadata (3.2 kB)
Collecting google-generativeai==0.8.5 (from llumo)
  Downloading google_generativeai-0.8.5-py3-none-any.whl.metadata (3.9 kB)
Collecting bidict>=0.21.0 (from python-socketio==5.13.0->python-socketio[client]==5.13.0->llumo)
  Downloading bidict-0.23.1-py3-none-any.whl.metadata (8.7 kB)
Collecting python-engineio>=4.11.0 (from python-socketio==5.13.0->python-socketio[client]==5.13.0->llumo)
  Downloading python_engineio-4.12.0-py3-none-any.whl.metadata (2.3 kB)
Collecting simple-websocket>=0.10.0 (from python-engineio>=4.11.0->python-socketio==5.13.0->python-socketio[client]==5.13.0->llumo)
  Downloading simple_websocket-1.1.0-py3-none-any.whl.metadata (1.5 kB)
Collecting wsproto (from simple-websocket>=0.10.0->python-engineio>=4.11.0->python-socketio==5.13.

In [17]:
from llumo import  LlumoClient
client = LlumoClient(api_key="key_MThiM2E0NDM0NDQ3YThlZGMzOTQ3OTU0_46c2f578d3c35625f06f02524dbf38822638bd335c74e05257267df56f9b37a81e18a1a879d6738dc9dc55f7768a4a53aa19cfe06fa90e91918211cfdd60ce731a6a6fc924f5273c2cdf8ca2befa1c40f345e8d56ad7cbcdac135101f32c8cdfcaddb1e0d377e9250a9ea08924cbe283c4c405f1de42067e88ba3da74cdb00fe")


In [21]:
df.to_csv("output.csv",index=False)## Saving for future use

In [26]:
client = LlumoClient(api_key="key_MThiM2E0NDM0NDQ3YThlZGMzOTQ3OTU0_46c2f578d3c35625f06f02524dbf38822638bd335c74e05257267df56f9b37a81e18a1a879d6738dc9dc55f7768a4a53aa19cfe06fa90e91918211cfdd60ce731a6a6fc924f5273c2cdf8ca2befa1c40f345e8d56ad7cbcdac135101f32c8cdfcaddb1e0d377e9250a9ea08924cbe283c4c405f1de42067e88ba3da74cdb00fe")


df = pd.read_csv("output.csv")
#
res = client.evaluate(df,evals = ["Context Utilization"],prompt_template="Give me answer to following: {{query}} based on: {{context}}")
# res = client.evaluateCompressor(df,prompt_template="Give me answer to following: {{query}} based on: {{context}}")


Connecting to socket server...
Attempting direct WebSocket connection...
Socket connection established
Engine.IO connection established with SID: 4v4dfmKYvPIoiA7QAAA0
Waiting for server to acknowledge connection with connection-established event...
Server acknowledged connection with 'connection-established' event: {'socketId': 'nWUBZWHBrfLyKQ6UAAA1'}
Received server socket ID: nWUBZWHBrfLyKQ6UAAA1
Connection fully established. Server socket ID: nWUBZWHBrfLyKQ6UAAA1
Connected with socket ID: nWUBZWHBrfLyKQ6UAAA1

Validating API key for Context Utilization...
Making API key validation request to: https://backend-api.llumo.ai/api/v1/workspace-key-details
Request body: {'analytics': ['Context Utilization']}
{"success":true,"message":"Workspace hits details fetched successfully","data":{"analyticsMapping":{"Context Utilization":"Context Utilization: This metric evaluates how well the AI's response uses the provided context to deliver accurate, detailed, and complete information. It measure

In [27]:
res

Unnamed: 0,query,context,output,Context Utilization
0,Compare Artificial Intelligence and Machine Le...,TOPIC-1: Introduction- Artificial Intelligence...,Artificial Intelligence (AI) is a branch of co...,99
1,what is machine learing,Arthur Samuel\n\n• The term machine learning w...,Machine learning is a technology that enables ...,89
