# üéØ RAG Pipeline from PDFs with Images & Tables using LangChain, Unstructured & OpenAI

## üìò Overview
This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system that can parse PDFs (including images and tables), embed the content, and answer natural language queries using LangChain and OpenAI.

## üîç What this notebook does:
- Extracts structured and unstructured content from PDFs using `unstructured`.
- Parses tables, images, and text with high-resolution mode.
- Embeds parsed chunks using `OpenAIEmbeddings`.
- Stores and retrieves documents using `FAISS` vectorstore.
- Answers questions contextually using `ChatOpenAI` and LangChain RAG chain.

## üí° Example Query
"Compare Artificial Intelligence and Machine Learning from the document."


### ‚öôÔ∏è Setup
üîß Install dependencies

In [None]:
!pip install --quiet faiss-cpu pytesseract unstructured-client "unstructured[all-docs]"
!pip install langchain_openai langchain-community
!apt-get install -y poppler-utils tesseract-ocr libtesseract-dev


###üîë Set up API key

In [1]:
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = "Enter Your OPENAI_API_KEY"
# os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY') # Uncomment if you Stored your key securely in Colab



### üìÑ Step 1: Parse PDF with Unstructured

In [None]:
from unstructured.partition.pdf import partition_pdf

filename = "/content/Ml_sample.pdf" # Path to your PDF file

# Extract elements including images, tables, and structured text
pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=True,               # Enable image extraction
    strategy="hi_res",                       # Use high-resolution parsing
    hi_res_model_name="yolox",               # YOLOX model for detecting layout
    infer_table_structure=True,               # Try to parse tables
    chunking_strategy="by_title",            # Split text by document headings
    max_characters=3000,
    combine_text_under_n_chars=200
)


üîç Inspect parsed elements




In [2]:
# Analyze the types of elements extracted
from collections import Counter
category_counts = Counter(str(type(element)) for element in pdf_elements)
category_counts


Counter({"<class 'unstructured.documents.elements.CompositeElement'>": 47})


###üìö Step 2: Create LangChain Documents

In [None]:
# Convert each element into a searchable Document
from langchain.schema import Document
documents = [Document(page_content=el.text, metadata={"source": filename}) for el in pdf_elements]


### üß† Step 3: Embed with OpenAI + FAISS

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
#Build FAISS Vector Store from the documents
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)

#Set up a retriever from the vectorstore
retriever = vectorstore.as_retriever()


### üß© Step 4: Setup LangChain RAG Chain

In [None]:
#Create the RAG pipeline using LangChain
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

llm = ChatOpenAI()

# Define how the question and context are formatted to the model
template = """
You are a helpful assistant that answers questions based on the provided context, which can include text and tables.
Use the provided context to answer the question.
Question: {input}
Context: {context}
Answer:
"""

prompt = ChatPromptTemplate.from_template(template)

# Chain: Retrieve context ‚Üí Fill prompt ‚Üí Run LLM ‚Üí Return response
rag_chain = (
    {"context": retriever,  "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


### ‚ùì Run a Query

In [3]:
# Ask a question to the RAG pipeline
response1 = rag_chain.invoke("Compare Artificial Intelligence and Machine Learning from the document.")
print(response1)
response2 = rag_chain.invoke("what is machine learing ")
print(response2)



Artificial Intelligence (AI) can be defined as the branch of computer science that aims to create intelligent machines capable of behaving like humans, thinking like humans, and making decisions. AI involves machines having human-based skills such as learning, reasoning, and problem-solving. 

On the other hand, Machine Learning is a growing technology that enables computers to learn automatically from past data. It uses various algorithms to build mathematical models and make predictions based on historical data or information. Machine Learning is currently being utilized for tasks such as image recognition, speech recognition, email filtering, and recommender systems.

Deep Learning is a subset of AI and Machine Learning that is based on neural networks imitating the human brain. It involves nonlinear processing units for feature extraction and transformation. Deep learning is implemented using Neural Networks, inspired by the biological neurons in the brain. This technique allows f

## Evaluate the output with Llumo Ai

---



In [4]:
# Code to Evaluate the output
User_Query="Compare Artificial Intelligence"
llm_response=response1
inputs = {}
import requests
# Define the endpoint, headers, and payload
LLUMO_ENDPOINT = "https://app.llumo.ai/api/create-eval-analytics"
headers = {
    "Authorization": "Bearer {Your llumo api key}", # Replace with your LLumo API key it will look like this "Bearer A1B2C3"
    "Content-Type": "application/json"
}
payload = {
    "prompt": User_Query,
    "input": inputs,
    "output": llm_response,
    "analytics": ["Context Utilization"] # ANALYTICS NAME are Confidence,Clarity,Context.....etc.
}
# Make the API request
response = requests.post(LLUMO_ENDPOINT, json=payload, headers=headers)
try:
    result = response.json()  # Parse the JSON response
    print("statusCode : ", result['data']['statusCode'])
    print("message : ",result['data']['message'])
    # Extract the 'data' part
    data = result.get('data', {})
    print("Analytics:", data)
    # Return the data and a success flag

except Exception as e:
  print(e)



statusCode :  200
message :  SUCCESS
Analytics: {'data': '{"analyticsScore": {"*the output should correctly define ai and ml and compare them.": 75, "*the output should provide a clear and concise comparison of ai and ml, highlighting their key differences and similarities.": 70, "*the output should be well-structured, easy to understand, and free of grammatical errors.": 85, "*the output should accurately reflect the information present in the provided document (although no document was provided in this example).": 0, "overallScore": 58}, "reasoning": {"*the output should correctly define ai and ml and compare them.": ["The output correctly defines AI as a branch of computer science aiming to create intelligent machines.", "It accurately describes ML as enabling computers to learn from data using algorithms.", "The comparison is present but could be more explicit.  While it contrasts the approaches, a more direct comparative analysis of their capabilities, limitations, or applications