# Build Your Own Retrieval Augmented Generation (RAG) Bot

### [Youtube video covering this notebook](https://youtu.be/m_3q3XnLlTI?si=rI9mCpNcYpVB5jZF)

## Tech used
- LangChain [link](https://www.langchain.com/)
- Unstructured [link](https://unstructured.io/)
- LangSmith [link](https://smith.langchain.com/)
- Qdrant Cloud [link](https://cloud.qdrant.io)
- Groq API [link](https://console.groq.com/playground)
- Llama3 via Groq API
- Fastembed [link](https://github.com/qdrant/fastembed)

## File type used
- PDF
- Markdown

### Some important links
- https://unstructured.io/
- https://unstructured-io.github.io/unstructured/index.html
- https://docs.unstructured.io/api-reference/api-services/python-sdk
- https://www.deeplearning.ai/short-courses/preprocessing-unstructured-data-for-llm-applications/

## Setup

In [10]:
#%%capture
%pip install "unstructured[all-docs]" unstructured-client watermark langchain-groq langchain fastembed qdrant_client python-dotenv

^C
Note: you may need to restart the kernel to use updated packages.


In [7]:
%load_ext watermark

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark


In [8]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

In [9]:
from unstructured.chunking.title import chunk_by_title
from unstructured.partition.md import partition_md
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import dict_to_elements

ModuleNotFoundError: No module named 'unstructured.chunking'

In [8]:
#import langchain, groq, fastembed, qdrant_client, unstructured

In [9]:
%watermark --iversions

langchain    : 0.1.19
qdrant_client: 1.9.1
unstructured : 0.13.7
fastembed    : 0.2.7
groq         : 0.6.0



In [93]:
import unstructured.partition

help(unstructured.partition)

Help on package unstructured.partition in unstructured:

NAME
    unstructured.partition

PACKAGE CONTENTS
    api
    auto
    common
    csv
    doc
    docx
    email
    epub
    html
    image
    json
    lang
    md
    model_init
    msg
    odt
    org
    pdf
    pdf_image (package)
    ppt
    pptx
    rst
    rtf
    strategies
    text
    text_type
    tsv
    utils (package)
    xlsx
    xml

FILE
    /Users/sudarshan/Documents/yt-code/youtube-stuffs/data-cleaning/.venv/lib/python3.11/site-packages/unstructured/partition/__init__.py




## Preprocess the PDF

In [10]:
partition_pdf??

[0;31mSignature:[0m
[0mpartition_pdf[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfilename[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;34m''[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfile[0m[0;34m:[0m [0;34m'Optional[IO[bytes]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minclude_page_breaks[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstrategy[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;34m'auto'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minfer_table_structure[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mocr_languages[0m[0;34m:[0m [0;34m'Optional[str]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlanguages[0m[0;34m:[0m [0;34m'Optional[list[str]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minclude_metadata[0m[0;34m:[0m [0;34m'bool'[0m 

In [11]:
from unstructured.partition.pdf import partition_pdf

# Specify the path to your PDF file
filename = "data/gpt4all.pdf"
#path = "images"

# Extract images, tables, and chunk text
pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=False,
    strategy = "hi_res",
    hi_res_model_name="yolox",
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=3000,
    #new_after_n_chars=3800,
    combine_text_under_n_chars=200,
    #extract_image_block_output_dir=path,
)

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 15,
 "<class 'unstructured.documents.elements.TableChunk'>": 2}

In [13]:
element_dict = [el.to_dict() for el in pdf_elements]

unique_types = set()

for item in element_dict:
    unique_types.add(item['type'])

print(unique_types)

{'Table', 'CompositeElement'}


In [14]:
# Extract images, tables, and chunk text
pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=False,
    strategy = "hi_res",
    hi_res_model_name="yolox",
    infer_table_structure=True,
    #chunking_strategy="by_title",
    max_characters=3000,
    #new_after_n_chars=3800,
    combine_text_under_n_chars=200,
    #extract_image_block_output_dir=path,
)

for element in pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
print(category_counts)

element_dict = [el.to_dict() for el in pdf_elements]

unique_types = set()

for item in element_dict:
    unique_types.add(item['type'])

print(unique_types)

{"<class 'unstructured.documents.elements.CompositeElement'>": 15, "<class 'unstructured.documents.elements.TableChunk'>": 2, "<class 'unstructured.documents.elements.Text'>": 11, "<class 'unstructured.documents.elements.Header'>": 1, "<class 'unstructured.documents.elements.Title'>": 22, "<class 'unstructured.documents.elements.NarrativeText'>": 32, "<class 'unstructured.documents.elements.Footer'>": 1, "<class 'unstructured.documents.elements.Image'>": 6, "<class 'unstructured.documents.elements.FigureCaption'>": 2, "<class 'unstructured.documents.elements.Table'>": 1, "<class 'unstructured.documents.elements.ListItem'>": 29}
{'Title', 'UncategorizedText', 'Footer', 'Table', 'FigureCaption', 'NarrativeText', 'ListItem', 'Header', 'Image'}


In [15]:
pdf_elements[0].to_dict()

{'type': 'UncategorizedText',
 'element_id': 'b0c5cfcf93a217591e27d5c97845f59b',
 'text': '3 2 0 2',
 'metadata': {'coordinates': {'points': ((45.388888888888886,
     732.8055555555557),
    (45.388888888888886, 843.9166666666669),
    (100.94444444444446, 843.9166666666669),
    (100.94444444444446, 732.8055555555557)),
   'system': 'PixelSpace',
   'layout_width': 1654,
   'layout_height': 2339},
  'last_modified': '2024-05-03T13:31:00',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 1,
  'file_directory': 'data',
  'filename': 'gpt4all.pdf'}}

In [16]:
tables = [el for el in pdf_elements if el.category == "Table"]

In [17]:
table_html = tables[0].metadata.text_as_html

In [18]:
from io import StringIO 
from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)
file_obj = StringIO(table_html)
tree = etree.parse(file_obj, parser)
print(etree.tostring(tree, pretty_print=True).decode())

<table>
  <thead>
    <th>Model</th>
    <th>BoolQ</th>
    <th>PIQA</th>
    <th>HellaSwag</th>
    <th>WinoG.</th>
    <th>ARC-e</th>
    <th>ARC-c</th>
    <th>OBQA</th>
    <th>Avg.</th>
  </thead>
  <tr>
    <td>GPT4AII-J 6B v1.0*</td>
    <td>73.4</td>
    <td>74.8</td>
    <td>63.4</td>
    <td>64.7</td>
    <td>54.9</td>
    <td>36</td>
    <td>40.2</td>
    <td>58.2</td>
  </tr>
  <tr>
    <td>GPT4AIl-J v1.1-breezy*</td>
    <td>74</td>
    <td>75.1</td>
    <td>63.2</td>
    <td>63.6</td>
    <td>55.4</td>
    <td>34.9</td>
    <td>38.4</td>
    <td>57.8</td>
  </tr>
  <tr>
    <td>GPT4AII-J v1.2-jazzy*</td>
    <td>74.8</td>
    <td>74.9</td>
    <td>63.6</td>
    <td>63.8</td>
    <td>56.6</td>
    <td>35.3</td>
    <td>41</td>
    <td>58.6</td>
  </tr>
  <tr>
    <td>GPT4AII-J v1.3-groovy*</td>
    <td>73.6</td>
    <td>74.3</td>
    <td>63.8</td>
    <td>63.5</td>
    <td>57.7</td>
    <td>35</td>
    <td>38.8</td>
    <td>58.1</td>
  </tr>
  <tr>
    <td>GPT4AII-J Lora 6

In [19]:
# Find the element with text "References" and category "Title"
reference_title = [
    el for el in pdf_elements
    if el.text == "References"
    and el.category == "Title"
][0]

In [20]:
reference_title.to_dict()

{'type': 'Title',
 'element_id': 'd3f115969fa159c8ae83287b2de7a62e',
 'text': 'References',
 'metadata': {'detection_class_prob': 0.8571382164955139,
  'coordinates': {'points': ((196.85, 199.78396606445312),
    (196.85, 235.40411376953125),
    (351.13720722222223, 235.40411376953125),
    (351.13720722222223, 199.78396606445312)),
   'system': 'PixelSpace',
   'layout_width': 1654,
   'layout_height': 2339},
  'last_modified': '2024-05-03T13:31:00',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 5,
  'file_directory': 'data',
  'filename': 'gpt4all.pdf'}}

In [21]:
# Get the ID of the reference title element
references_id = reference_title.id

In [22]:
for element in pdf_elements:
    if element.metadata.parent_id == references_id:
        print(element)
        break

Nomic AI. 2023. Atlas. https://atlas.nomic.ai/.


In [23]:
# Filter out elements with a parent_id matching references_id
pdf_elements = [el for el in pdf_elements if el.metadata.parent_id != references_id]

### Filter out headers

In [24]:
headers = [el for el in pdf_elements if el.category == "Header"]

In [25]:
len(headers)

1

In [26]:
headers[0].to_dict()

{'type': 'Header',
 'element_id': '4bff1bcde9e4a6e875fb8a8fc7b79e19',
 'text': 'v o N 6 ] L C . s c [',
 'metadata': {'detection_class_prob': 0.5846725702285767,
  'coordinates': {'points': ((45.388888888888886, 816.0743408203125),
    (45.388888888888886, 1539.7510986328125),
    (100.94444444444446, 1539.7510986328125),
    (100.94444444444446, 816.0743408203125)),
   'system': 'PixelSpace',
   'layout_width': 1654,
   'layout_height': 2339},
  'last_modified': '2024-05-03T13:31:00',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 1,
  'file_directory': 'data',
  'filename': 'gpt4all.pdf'}}

In [27]:
# Filters out elements from the `pdf_elements` list that have the category "Header".
pdf_elements = [el for el in pdf_elements if el.category != "Header"]

In [28]:
len(pdf_elements)

72

In [29]:
# lets again see some random index
pdf_elements[33].to_dict()

{'type': 'Title',
 'element_id': '12c1dd0555bedb5ccc2a4d6366af96c7',
 'text': '3 From a Model to an Ecosystem',
 'metadata': {'detection_class_prob': 0.8266856670379639,
  'coordinates': {'points': ((193.63523864746094, 1666.2607421875),
    (193.63523864746094, 1700.904488611111),
    (686.5115356445312, 1700.904488611111),
    (686.5115356445312, 1666.2607421875)),
   'system': 'PixelSpace',
   'layout_width': 1654,
   'layout_height': 2339},
  'last_modified': '2024-05-03T13:31:00',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 2,
  'file_directory': 'data',
  'filename': 'gpt4all.pdf'}}

## Preprocess the README

In [30]:
filename_md = "data/uber_10q_march_2022.md"

In [31]:
md_elements = partition_md(filename=filename_md)

In [32]:
# lets again see some random index
md_elements[33].to_dict(), pdf_elements[33].to_dict()

({'type': 'Title',
  'element_id': 'd7e7b8d880ae1ff8fd2f389fd1f21329',
  'text': 'PART I - FINANCIAL INFORMATION',
  'metadata': {'last_modified': '2024-03-23T21:26:32',
   'languages': ['eng'],
   'filetype': 'text/markdown',
   'file_directory': 'data',
   'filename': 'uber_10q_march_2022.md'}},
 {'type': 'Title',
  'element_id': '12c1dd0555bedb5ccc2a4d6366af96c7',
  'text': '3 From a Model to an Ecosystem',
  'metadata': {'detection_class_prob': 0.8266856670379639,
   'coordinates': {'points': ((193.63523864746094, 1666.2607421875),
     (193.63523864746094, 1700.904488611111),
     (686.5115356445312, 1700.904488611111),
     (686.5115356445312, 1666.2607421875)),
    'system': 'PixelSpace',
    'layout_width': 1654,
    'layout_height': 2339},
   'last_modified': '2024-05-03T13:31:00',
   'filetype': 'application/pdf',
   'languages': ['eng'],
   'page_number': 2,
   'file_directory': 'data',
   'filename': 'gpt4all.pdf'}})

#### Let's still do some more exploration

In [33]:
len(pdf_elements), len(md_elements)

(72, 1506)

In [34]:
elements = chunk_by_title(pdf_elements + md_elements) # you can play around with the chunk_by_title arguments

In [35]:
len(elements)

731

In [36]:
pdf_elements[0].to_dict()

{'type': 'UncategorizedText',
 'element_id': 'b0c5cfcf93a217591e27d5c97845f59b',
 'text': '3 2 0 2',
 'metadata': {'coordinates': {'points': ((45.388888888888886,
     732.8055555555557),
    (45.388888888888886, 843.9166666666669),
    (100.94444444444446, 843.9166666666669),
    (100.94444444444446, 732.8055555555557)),
   'system': 'PixelSpace',
   'layout_width': 1654,
   'layout_height': 2339},
  'last_modified': '2024-05-03T13:31:00',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 1,
  'file_directory': 'data',
  'filename': 'gpt4all.pdf'}}

In [37]:
pdf_elements[1].to_dict()

{'type': 'UncategorizedText',
 'element_id': '07edc40df2508eb1259212408427c16f',
 'text': '1 v 1 3 9 4 0 . 1 1 3 2 : v i X r a',
 'metadata': {'coordinates': {'points': ((45.388888888888886,
     1218.8611111111109),
    (45.388888888888886, 1680.25),
    (100.94444444444446, 1680.25),
    (100.94444444444446, 1218.8611111111109)),
   'system': 'PixelSpace',
   'layout_width': 1654,
   'layout_height': 2339},
  'last_modified': '2024-05-03T13:31:00',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'page_number': 1,
  'parent_id': '4bff1bcde9e4a6e875fb8a8fc7b79e19',
  'file_directory': 'data',
  'filename': 'gpt4all.pdf'}}

In [38]:
elements[0].to_dict()

{'type': 'CompositeElement',
 'element_id': 'de7552d5-35e4-4f67-9e92-8bb73a59f958',
 'text': '3 2 0 2\n\n1 v 1 3 9 4 0 . 1 1 3 2 : v i X r a\n\nGPT4All: An Ecosystem of Open Source Compressed Language Models\n\nYuvanesh Anand Nomic AI yuvanesh@nomic.ai\n\nZach Nussbaum Nomic AI zach@nomic.ai\n\nAdam Treat Nomic AI adam@nomic.ai\n\nAaron Miller Nomic AI aaron@nomic.ai\n\nRichard Guo Nomic AI richard@nomic.ai\n\nBen Schmidt Nomic AI ben@nomic.ai\n\nGPT4All Community Planet Earth\n\nBrandon Duderstadt∗ Nomic AI brandon@nomic.ai\n\nAndriy Mulyar∗ Nomic AI andriy@nomic.ai',
 'metadata': {'file_directory': 'data',
  'filename': 'gpt4all.pdf',
  'filetype': 'application/pdf',
  'languages': ['eng'],
  'last_modified': '2024-05-03T13:31:00',
  'page_number': 1,
  'orig_elements': 'eJzlWNtuGzcQ/RVCz86W94uf6rRBUKBJg8YF2rqBwMvQWmAvwmqVRgny75292FZiubEL6MH1k3QOh1guz+FwZi8+LaCCGpp+WabFKVkEGlXMMTvhOTPKMeAmqeiMlSorFxYnZFFD75PvPcZ/WsS27VLZ+B42I678rt32yxWUl6seGS6Ewzkz/XeZ+hWyTCuJ7Lotm36Yd3EhVYGMEbyw707I

In [39]:
chunk_by_title??

[0;31mSignature:[0m
[0mchunk_by_title[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0melements[0m[0;34m:[0m [0;34m'Iterable[Element]'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcombine_text_under_n_chars[0m[0;34m:[0m [0;34m'Optional[int]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minclude_orig_elements[0m[0;34m:[0m [0;34m'Optional[bool]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_characters[0m[0;34m:[0m [0;34m'Optional[int]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmultipage_sections[0m[0;34m:[0m [0;34m'Optional[bool]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnew_after_n_chars[0m[0;34m:[0m [0;34m'Optional[int]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moverlap[0m[0;34m:[0m [0;34m'Optional[int]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m

In [40]:
#chunk_elements = chunk_by_title((pdf_elements + md_elements),combine_text_under_n_chars=100,max_characters=3000)
#len(chunk_elements)

## Load the Documents into the Vector DB

In [41]:
import os
from langchain_core.documents import Document
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain_community.vectorstores import Qdrant

In [42]:
documents = []
for element in elements:
    metadata = element.metadata.to_dict()
    del metadata["languages"]
    metadata["source"] = metadata["filename"]
    documents.append(Document(page_content=element.text, metadata=metadata))

In [94]:
len(documents)

731

In [43]:
from dotenv import load_dotenv
load_dotenv()

True

In [44]:
qdrant_url = os.getenv("QDRANT_URL")
qdrant_api_key = os.getenv("QDRANT_API_KEY")
groq_api_key = os.getenv("GROQ_API_KEY")

In [45]:
embeddings = FastEmbedEmbeddings()

Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 78251.94it/s]


In [46]:
# this will take some time, patience is the key :)
vectorstore = Qdrant.from_documents(documents=documents,
                                    embedding = embeddings,
                                    url = qdrant_url,
                                    collection_name="rag",
                                    api_key=qdrant_api_key)

In [47]:
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)

## Let's create RAG (Qdrant, Groq, LangChain, Llama3)

In [48]:
from langchain.prompts.prompt import PromptTemplate
from langchain_groq import ChatGroq
from langchain.chains import chat_history_aware_retriever
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

In [59]:
template = """You are an AI assistant for answering questions about the GPT4All paper and Quarterly Report Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934 for the quarterly period ended March 31, 2022.
You are given the following extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
Question: {question}
=========
{context}
=========
Answer in Markdown:"""
prompt = PromptTemplate(template=template, input_variables=["question", "context"])

In [60]:
llm = ChatGroq(temperature=0,model_name="llama3-8b-8192")

doc_chain = load_qa_with_sources_chain(llm, chain_type="map_reduce")
question_generator_chain = LLMChain(llm=llm, prompt=prompt)
qa_chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator_chain,
    combine_docs_chain=doc_chain,
)

In [61]:
qa_chain.invoke({
    "question": "What was the net loss including non-controlling interests of Uber in 2021", #line 533
    "chat_history": []
})["answer"]

'The net loss including non-controlling interests of Uber in 2021 was $(122) million.\nSOURCES: uber_10q_march_2022.md'

In [87]:
# hybrid search in action
filter_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1, "filter": {"source": "gpt4all.pdf"}}
)

In [88]:
filter_chain = ConversationalRetrievalChain(
    retriever=filter_retriever,
    question_generator=question_generator_chain,
    combine_docs_chain=doc_chain,
)

In [92]:
filter_chain.invoke({
    "question": "How was GPT4All-Snoozy developed ?",
    "chat_history": [],
    "filter": filter,
})["answer"]

"I'm happy to help!\n\nFINAL ANSWER: The president did not mention Michael Jackson.\nSOURCES:\n\nFINAL ANSWER: This Agreement is governed by English law.\nSOURCES: 28-pl\n\nFINAL ANSWER: The president did not mention Michael Jackson.\nSOURCES:\n\nFINAL ANSWER: GPT4All-Snoozy was developed using roughly the same procedure as the previous GPT4All models, but with a few key modifications.\nSOURCES: gpt4all.pdf"