# Build Your Own Retrieval Augmented Generation (RAG) Bot

### [Youtube video covering this notebook](https://youtu.be/m_3q3XnLlTI?si=rI9mCpNcYpVB5jZF)

## Tech used
- LangChain [link](https://www.langchain.com/)
- Unstructured [link](https://unstructured.io/)
- LangSmith [link](https://smith.langchain.com/)
- Qdrant Cloud [link](https://cloud.qdrant.io)
- Groq API [link](https://console.groq.com/playground)
- Llama3 via Groq API
- Fastembed [link](https://github.com/qdrant/fastembed)

## File type used
- PDF
- Markdown

### Some important links
- https://unstructured.io/
- https://unstructured-io.github.io/unstructured/index.html
- https://docs.unstructured.io/api-reference/api-services/python-sdk
- https://www.deeplearning.ai/short-courses/preprocessing-unstructured-data-for-llm-applications/

## Setup

In [3]:
#%%capture
%pip install "unstructured[all-docs]" unstructured-client watermark langchain-groq langchain fastembed qdrant_client python-dotenv

Collecting unstructured[all-docs]
  Using cached unstructured-0.18.15-py3-none-any.whl (1.8 MB)
Collecting watermark
  Using cached watermark-2.5.0-py2.py3-none-any.whl (7.7 kB)
Collecting langchain-groq
  Using cached langchain_groq-1.0.0-py3-none-any.whl (16 kB)
Collecting langchain
  Using cached langchain-1.0.1-py3-none-any.whl (106 kB)
Collecting fastembed
  Using cached fastembed-0.7.3-py3-none-any.whl (105 kB)
Collecting qdrant_client
  Using cached qdrant_client-1.15.1-py3-none-any.whl (337 kB)
Collecting dataclasses-json
  Using cached dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting effdet
  Using cached effdet-0.4.1-py3-none-any.whl (112 kB)
Collecting msoffcrypto-tool
  Using cached msoffcrypto_tool-5.4.2-py3-none-any.whl (48 kB)
Collecting google-cloud-vision
  Using cached google_cloud_vision-3.11.0-py3-none-any.whl (529 kB)
Collecting onnx>=1.17.0
  Using cached onnx-1.19.1-cp311-cp311-win_amd64.whl (16.5 MB)
Collecting unstructured-inference>=1.0.5
  Using cac

ERROR: Could not install packages due to an OSError: [WinError 32] The process cannot access the file because it is being used by another process: 'd:\\MultiModulRag\\venv\\Lib\\site-packages\\transformers\\models\\clip\\tokenization_clip.py'
Check the permissions.


[notice] A new release of pip available: 22.3 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [99]:
%load_ext watermark

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark


In [5]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

In [None]:
from unstructured.chunking.title import chunk_by_title
from unstructured.partition.md import partition_md
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import dict_to_elements

In [None]:
import langchain, groq, fastembed, qdrant_client, unstructured

In [9]:
%watermark --iversions

fastembed    : 0.7.3
qdrant_client: 1.15.1
langchain    : 1.0.1
unstructured : 0.18.15
groq         : 0.32.0



In [None]:
import unstructured.partition

help(unstructured.partition)

Help on package unstructured.partition in unstructured:

NAME
    unstructured.partition

PACKAGE CONTENTS
    api
    auto
    common (package)
    csv
    doc
    docx
    email
    epub
    html (package)
    image
    json
    md
    model_init
    msg
    ndjson
    odt
    org
    pdf
    pdf_image (package)
    ppt
    pptx
    rst
    rtf
    strategies
    text
    text_type
    tsv
    utils (package)
    xlsx
    xml

FILE
    d:\multimodulrag\venv\lib\site-packages\unstructured\partition\__init__.py




## Preprocess the PDF

In [11]:
partition_pdf??

[31mSignature:[39m
partition_pdf(
    filename: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
    file: [33m'Optional[IO[bytes]]'[39m = [38;5;28;01mNone[39;00m,
    include_page_breaks: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    strategy: [33m'str'[39m = [33m'auto'[39m,
    infer_table_structure: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    ocr_languages: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
    languages: [33m'Optional[list[str]]'[39m = [38;5;28;01mNone[39;00m,
    detect_language_per_element: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    metadata_last_modified: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
    chunking_strategy: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
    hi_res_model_name: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
    extract_images_in_pdf: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    extract_image_block_types: [33m'Optional[list[str]]'[39m = [38;5;28;01mNone[3

In [1]:
from unstructured.partition.pdf import partition_pdf

# Specify the path to your PDF file
filename = r"D:\MultiModulRag\docs\NIPS-2017-attention-is-all-you-need-Paper.pdf"
#path = "images"

# Extract images, tables, and chunk text
pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=False,
    strategy = "hi_res",
    hi_res_model_name="yolox",
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=3000,
    #new_after_n_chars=3800,
    combine_text_under_n_chars=200,
    #extract_image_block_output_dir=path,
)



The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.


In [2]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 24}

In [3]:
element_dict = [el.to_dict() for el in pdf_elements]

unique_types = set()

for item in element_dict:
    unique_types.add(item['type'])

print(unique_types)

{'CompositeElement'}


In [4]:
# Extract images, tables, and chunk text
path = r"D:\MultiModulRag\Backend\output\images\groq"
pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=True,
    strategy = "hi_res",
    hi_res_model_name="yolox",
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=3000,
    extract_image_block_types=["Image"],
    new_after_n_chars=3800,
    combine_text_under_n_chars=200,
    extract_image_block_output_dir=path,
)

for element in pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
print(category_counts)

element_dict = [el.to_dict() for el in pdf_elements]

unique_types = set()

for item in element_dict:
    unique_types.add(item['type'])

print(unique_types)

{"<class 'unstructured.documents.elements.CompositeElement'>": 48}
{'CompositeElement'}


In [5]:
pdf_elements[1].to_dict()

{'type': 'CompositeElement',
 'element_id': '3a5e68b361dc8794d9341af4d9b4f022',
 'text': 'Illia Polosukhin∗ ‡\n\nillia.polosukhin@gmail.com\n\nAbstract\n\nThe dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring signiﬁcantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score

In [7]:
tables = [el for el in pdf_elements if el.category == "Table"]
tables

[]

In [None]:
table_html = tables[0].metadata.text_as_html()

IndexError: list index out of range

In [85]:
from io import StringIO 
from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)
file_obj = StringIO(table_html)
tree = etree.parse(file_obj, parser)
print(etree.tostring(tree, pretty_print=True).decode())

<table>
  <thead>
    <tr>
      <th>Layer Type</th>
      <th>Complexity per Layer</th>
      <th>Sequential Operations</th>
      <th>Maximum Path Length</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Self-Attention</td>
      <td>O(n? - d)</td>
      <td>O(1)</td>
      <td>O(1)</td>
    </tr>
    <tr>
      <td>Recurrent</td>
      <td>O(n-d?)</td>
      <td>O(n)</td>
      <td>O(n)</td>
    </tr>
    <tr>
      <td>Convolutional</td>
      <td>O(k-n-d?)</td>
      <td>olny</td>
      <td>O(logx(n))</td>
    </tr>
    <tr>
      <td>Self-Attention (restricted)</td>
      <td>O(r-n-d)</td>
      <td>ol)</td>
      <td>O(n/r)</td>
    </tr>
  </tbody>
</table>



In [73]:
# Find the element with text "References" and category "Title"
reference_title = [
    el for el in pdf_elements
    if el.text == "References"
    and el.category == "Title"][0]

IndexError: list index out of range

In [74]:
reference_title.to_dict()

{'type': 'Title',
 'element_id': '89beabec90ed5c180a6f9323a7fa5cf4',
 'text': 'References',
 'metadata': {'detection_class_prob': 0.8838503956794739,
  'coordinates': {'points': ((np.float64(299.9165344238281),
     np.float64(201.40676888888876)),
    (np.float64(299.9165344238281), np.float64(234.61565777777764)),
    (np.float64(457.28265380859375), np.float64(234.61565777777764)),
    (np.float64(457.28265380859375), np.float64(201.40676888888876))),
   'system': 'PixelSpace',
   'layout_width': 1700,
   'layout_height': 2200},
  'last_modified': '2025-10-20T15:54:36',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 10,
  'file_directory': 'D:\\MultiModulRag\\docs',
  'filename': 'NIPS-2017-attention-is-all-you-need-Paper.pdf'}}

In [75]:
# Get the ID of the reference title element
references_id = reference_title.id

In [76]:
for element in pdf_elements:
    if element.metadata.parent_id == references_id:
        print(element)
        break

In [77]:
# Filter out elements with a parent_id matching references_id
pdf_elements = [el for el in pdf_elements if el.metadata.parent_id != references_id]

### Filter out headers

In [78]:
headers = [el for el in pdf_elements if el.category == "Header"]

In [79]:
len(headers)

0

In [80]:
headers[0].to_dict()

IndexError: list index out of range

In [81]:
# Filters out elements from the `pdf_elements` list that have the category "Header".
pdf_elements = [el for el in pdf_elements if el.category != "Header"]

In [82]:
len(pdf_elements)

24

In [98]:
# lets again see some random index
pdf_elements[10].to_dict()

{'type': 'CompositeElement',
 'element_id': '5e6dd3e3a8b20747a9afd1979fe37fff',
 'text': '3.3 Position-wise Feed-Forward Networks\n\nIn addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.\n\nFFN(x) = max(0,xW1 + b1)W2 + b2 (2)\n\nWhile the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality dff = 2048.',
 'metadata': {'filetype': 'application/pdf',
  'languages': ['eng'],
  'last_modified': '2025-10-20T15:54:36',
  'page_number': 5,
  'orig_elements': 'eJy9VU1v4zYQ/SsDnRLUFERK1EeAPewlQIE2CNoUOQSBQYkji1iZMkg6trvof+9QSnaDrbuHAIkvlh755oNv+PTwNcERt

## Preprocess the README

In [None]:
filename_md = "data/uber_10q_march_2022.md"

In [46]:
md_elements = partition_md(filename=filename_md)

FileNotFoundError: [Errno 2] No such file or directory: 'data/uber_10q_march_2022.md'

In [47]:
# lets again see some random index
md_elements[33].to_dict(), pdf_elements[33].to_dict()

NameError: name 'md_elements' is not defined

#### Let's still do some more exploration

In [48]:
len(pdf_elements), len(md_elements)

NameError: name 'md_elements' is not defined

In [34]:
elements = chunk_by_title(pdf_elements + md_elements) # you can play around with the chunk_by_title arguments

In [35]:
len(elements)

731

In [36]:
pdf_elements[0].to_dict()

{'type': 'UncategorizedText',
 'element_id': 'b0c5cfcf93a217591e27d5c97845f59b',
 'text': '3 2 0 2',
 'metadata': {'coordinates': {'points': ((45.388888888888886,
     732.8055555555557),
    (45.388888888888886, 843.9166666666669),
    (100.94444444444446, 843.9166666666669),
    (100.94444444444446, 732.8055555555557)),
   'system': 'PixelSpace',
   'layout_width': 1654,
   'layout_height': 2339},
  'last_modified': '2024-05-03T13:31:00',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 1,
  'file_directory': 'data',
  'filename': 'gpt4all.pdf'}}

In [37]:
pdf_elements[1].to_dict()

{'type': 'UncategorizedText',
 'element_id': '07edc40df2508eb1259212408427c16f',
 'text': '1 v 1 3 9 4 0 . 1 1 3 2 : v i X r a',
 'metadata': {'coordinates': {'points': ((45.388888888888886,
     1218.8611111111109),
    (45.388888888888886, 1680.25),
    (100.94444444444446, 1680.25),
    (100.94444444444446, 1218.8611111111109)),
   'system': 'PixelSpace',
   'layout_width': 1654,
   'layout_height': 2339},
  'last_modified': '2024-05-03T13:31:00',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 1,
  'parent_id': '4bff1bcde9e4a6e875fb8a8fc7b79e19',
  'file_directory': 'data',
  'filename': 'gpt4all.pdf'}}

In [38]:
elements[0].to_dict()

{'type': 'CompositeElement',
 'element_id': 'de7552d5-35e4-4f67-9e92-8bb73a59f958',
 'text': '3 2 0 2\n\n1 v 1 3 9 4 0 . 1 1 3 2 : v i X r a\n\nGPT4All: An Ecosystem of Open Source Compressed Language Models\n\nYuvanesh Anand Nomic AI yuvanesh@nomic.ai\n\nZach Nussbaum Nomic AI zach@nomic.ai\n\nAdam Treat Nomic AI adam@nomic.ai\n\nAaron Miller Nomic AI aaron@nomic.ai\n\nRichard Guo Nomic AI richard@nomic.ai\n\nBen Schmidt Nomic AI ben@nomic.ai\n\nGPT4All Community Planet Earth\n\nBrandon Duderstadt∗ Nomic AI brandon@nomic.ai\n\nAndriy Mulyar∗ Nomic AI andriy@nomic.ai',
 'metadata': {'file_directory': 'data',
  'filename': 'gpt4all.pdf',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'last_modified': '2024-05-03T13:31:00',
  'page_number': 1,
  'orig_elements': 'eJzlWNtuGzcQ/RVCz86W94uf6rRBUKBJg8YF2rqBwMvQWmAvwmqVRgny75292FZiubEL6MH1k3QOh1guz+FwZi8+LaCCGpp+WabFKVkEGlXMMTvhOTPKMeAmqeiMlSorFxYnZFFD75PvPcZ/WsS27VLZ+B42I678rt32yxWUl6seGS6Ewzkz/XeZ+hWyTCuJ7Lotm36Yd3EhVYGMEbyw707I

In [39]:
chunk_by_title??

[0;31mSignature:[0m
[0mchunk_by_title[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0melements[0m[0;34m:[0m [0;34m'Iterable[Element]'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcombine_text_under_n_chars[0m[0;34m:[0m [0;34m'Optional[int]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minclude_orig_elements[0m[0;34m:[0m [0;34m'Optional[bool]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_characters[0m[0;34m:[0m [0;34m'Optional[int]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmultipage_sections[0m[0;34m:[0m [0;34m'Optional[bool]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnew_after_n_chars[0m[0;34m:[0m [0;34m'Optional[int]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moverlap[0m[0;34m:[0m [0;34m'Optional[int]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m

In [40]:
#chunk_elements = chunk_by_title((pdf_elements + md_elements),combine_text_under_n_chars=100,max_characters=3000)
#len(chunk_elements)

## Load the Documents into the Vector DB

In [41]:
import os
from langchain_core.documents import Document
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain_community.vectorstores import Qdrant

In [42]:
documents = []
for element in elements:
    metadata = element.metadata.to_dict()
    del metadata["languages"]
    metadata["source"] = metadata["filename"]
    documents.append(Document(page_content=element.text, metadata=metadata))

In [94]:
len(documents)

731

In [43]:
from dotenv import load_dotenv
load_dotenv()

True

In [44]:
qdrant_url = os.getenv("QDRANT_URL")
qdrant_api_key = os.getenv("QDRANT_API_KEY")
groq_api_key = os.getenv("GROQ_API_KEY")

In [45]:
embeddings = FastEmbedEmbeddings()

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 78251.94it/s]


In [46]:
# this will take some time, patience is the key :)
vectorstore = Qdrant.from_documents(documents=documents,
                                    embedding = embeddings,
                                    url = qdrant_url,
                                    collection_name="rag",
                                    api_key=qdrant_api_key)

In [47]:
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)

## Let's create RAG (Qdrant, Groq, LangChain, Llama3)

In [48]:
from langchain.prompts.prompt import PromptTemplate
from langchain_groq import ChatGroq
from langchain.chains import chat_history_aware_retriever
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

In [59]:
template = """You are an AI assistant for answering questions about the GPT4All paper and Quarterly Report Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934 for the quarterly period ended March 31, 2022.
You are given the following extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
Question: {question}
=========
{context}
=========
Answer in Markdown:"""
prompt = PromptTemplate(template=template, input_variables=["question", "context"])

In [60]:
llm = ChatGroq(temperature=0,model_name="llama3-8b-8192")

doc_chain = load_qa_with_sources_chain(llm, chain_type="map_reduce")
question_generator_chain = LLMChain(llm=llm, prompt=prompt)
qa_chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator_chain,
    combine_docs_chain=doc_chain,
)

In [61]:
qa_chain.invoke({
    "question": "What was the net loss including non-controlling interests of Uber in 2021", #line 533
    "chat_history": []
})["answer"]

'The net loss including non-controlling interests of Uber in 2021 was $(122) million.\nSOURCES: uber_10q_march_2022.md'

In [87]:
# hybrid search in action
filter_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1, "filter": {"source": "gpt4all.pdf"}}
)

In [88]:
filter_chain = ConversationalRetrievalChain(
    retriever=filter_retriever,
    question_generator=question_generator_chain,
    combine_docs_chain=doc_chain,
)

In [92]:
filter_chain.invoke({
    "question": "How was GPT4All-Snoozy developed ?",
    "chat_history": [],
    "filter": filter,
})["answer"]

"I'm happy to help!\n\nFINAL ANSWER: The president did not mention Michael Jackson.\nSOURCES:\n\nFINAL ANSWER: This Agreement is governed by English law.\nSOURCES: 28-pl\n\nFINAL ANSWER: The president did not mention Michael Jackson.\nSOURCES:\n\nFINAL ANSWER: GPT4All-Snoozy was developed using roughly the same procedure as the previous GPT4All models, but with a few key modifications.\nSOURCES: gpt4all.pdf"