<a href="https://colab.research.google.com/github/SarathM1/RAG/blob/main/RAG_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective

Design a custom RAG pipeline to answer questions from this textbook -
https://openstax.org/details/books/concepts-biology

## Important Pointers:
1. Download the pdf from the link above
2. To make indexing faster, you can pick any 2 chapters from the pdf and treat it as a
source.
3. Use any in-memory vector database if required.
4. Use any open source HuggingFace model as the LLM Model

## Output artifacts
1. Entire codebase in GitHub with links to access
artifacts we need for evaluation:
a. Please add docstrings wherever necessary.
2. Additional Colab notebook to run the backend logic and evaluations:
a. Please add text blocks in your Colab to add scenarios/assumptions etc to make it readable.
3. Any additional artifacts like system design architecture, assumptions, list of issues you couldn’t solve because of time constraints and how you can fix it in future.

## Additional (bonus):
1. Streamlit/Gradio Frontend to interact with your pipeline
2. Wrap the entire application inside a docker container
3. Draft and implement all the necessary APIs using FastAPI or any other python web
framework of choice
4. Produce alternative way to do the RAG without using any library like Langchain,
LLamaIndex or Haystack

# LLM model
Mistral 7B

# Install the Dependencies

In [10]:
%%capture
! pip install -U langchain_community tiktoken chromadb langchain langchainhub sentence_transformers PyMuPDF>=1.24.0

In [11]:
!pip show PyMuPDF

Name: PyMuPDF
Version: 1.24.3
Summary: A high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Home-page: 
Author: Artifex
Author-email: support@artifex.com
License: GNU AFFERO GPL 3.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: PyMuPDFb
Required-by: 


In [12]:
import fitz

# Install Ollama

Ollama is a framework that allows you to run Open Source LLM models locally.

In [2]:
%%capture
!curl https://ollama.ai/install.sh | sh

# Start Ollama server in the background

In [3]:
import subprocess
import time

# Start ollama as a backrgound process
command = "nohup ollama serve&"

# Use subprocess.Popen to start the process in the background
process = subprocess.Popen(command,
                            shell=True,
                           stdout=subprocess.PIPE,
                           stderr=subprocess.PIPE)
print("Process ID:", process.pid)
time.sleep(5)  # Makes Python wait for 5 seconds

Process ID: 1019


In [4]:
# Test if Ollama serve is up
!ollama -v

ollama version is 0.1.36


In [5]:
%%capture
# Pull the model
!ollama pull mistral

# Parsing the PDF Document

In [6]:
# !pip install unstructured pdf2image pdfminer.six pillow_heif PyPDF2 pytesseract pikepdf

In [7]:
#from unstructured.partition.pdf import partition_pdf
#from unstructured.chunking.title import chunk_by_title

# Download the file

In [8]:
!mkdir data
!wget -O './data/Concepts_of_Biology.pdf' -nc 'https://assets.openstax.org/oscms-prodcms/media/documents/ConceptsofBiology-WEB.pdf?_gl=1*11bd84p*_ga*OTg3MTMyOTg1LjE3MTUyNTI1MjY.*_ga_T746F8B0QC*MTcxNTM0NTM1Ni4yLjAuMTcxNTM0NTM1Ny41OS4wLjA.'

--2024-05-11 12:30:35--  https://assets.openstax.org/oscms-prodcms/media/documents/ConceptsofBiology-WEB.pdf?_gl=1*11bd84p*_ga*OTg3MTMyOTg1LjE3MTUyNTI1MjY.*_ga_T746F8B0QC*MTcxNTM0NTM1Ni4yLjAuMTcxNTM0NTM1Ny41OS4wLjA.
Resolving assets.openstax.org (assets.openstax.org)... 13.35.116.61, 13.35.116.9, 13.35.116.116, ...
Connecting to assets.openstax.org (assets.openstax.org)|13.35.116.61|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 153179709 (146M) [application/pdf]
Saving to: ‘./data/Concepts_of_Biology.pdf’


2024-05-11 12:30:36 (123 MB/s) - ‘./data/Concepts_of_Biology.pdf’ saved [153179709/153179709]



## Extract the first two chapters from PDF for easier processing

In [13]:
import pathlib
import string

import fitz

In [14]:
def to_markdown(doc: fitz.Document, pages: list = None) -> str:
    """Process the document and return the text of its selected pages."""
    if isinstance(doc, str):
        doc = fitz.open(doc)
    SPACES = set(string.whitespace)  # used to check relevance of text pieces
    if not pages:  # use all pages if argument not given
        pages = range(doc.page_count)

    class IdentifyHeaders:
        """Compute data for identifying header text."""

        def __init__(self, doc, pages: list = None, body_limit: float = None):
            """Read all text and make a dictionary of fontsizes.

            Args:
                pages: optional list of pages to consider
                body_limit: consider text with larger font size as some header
            """
            if pages is None:  # use all pages if omitted
                pages = range(doc.page_count)
            fontsizes = {}
            for pno in pages:
                page = doc[pno]
                blocks = page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]
                for span in [  # look at all non-empty horizontal spans
                    s
                    for b in blocks
                    for l in b["lines"]
                    for s in l["spans"]
                    if not SPACES.issuperset(s["text"])
                ]:
                    fontsz = round(span["size"])
                    count = fontsizes.get(fontsz, 0) + len(span["text"].strip())
                    fontsizes[fontsz] = count

            # maps a fontsize to a string of multiple # header tag characters
            self.header_id = {}

            # If not provided, choose the most frequent font size as body text.
            # If no text at all on all pages, just use 12
            if body_limit is None:
                temp = sorted(
                    [(k, v) for k, v in fontsizes.items()],
                    key=lambda i: i[1],
                    reverse=True,
                )
                if temp:
                    body_limit = temp[0][0]
                else:
                    body_limit = 12

            sizes = sorted(
                [f for f in fontsizes.keys() if f > body_limit], reverse=True
            )

            # make the header tag dictionary
            for i, size in enumerate(sizes):
                self.header_id[size] = "#" * (i + 1) + " "

        def get_header_id(self, span):
            """Return appropriate markdown header prefix.

            Given a text span from a "dict"/"radict" extraction, determine the
            markdown header prefix string of 0 to many concatenated '#' characters.
            """
            fontsize = round(span["size"])  # compute fontsize
            hdr_id = self.header_id.get(fontsize, "")
            return hdr_id

    def resolve_links(links, span):
        """Accept a span bbox and return a markdown link string."""
        bbox = fitz.Rect(span["bbox"])  # span bbox
        # a link should overlap at least 70% of the span
        bbox_area = 0.7 * abs(bbox)
        for link in links:
            hot = link["from"]  # the hot area of the link
            if not abs(hot & bbox) >= bbox_area:
                continue  # does not touch the bbox
            text = f'[{span["text"].strip()}]({link["uri"]})'
            return text

    def write_text(page, clip, hdr_prefix):
        """Output the text found inside the given clip.

        This is an alternative for plain text in that it outputs
        text enriched with markdown styling.
        The logic is capable of recognizing headers, body text, code blocks,
        inline code, bold, italic and bold-italic styling.
        There is also some effort for list supported (ordered / unordered) in
        that typical characters are replaced by respective markdown characters.
        """
        out_string = ""
        code = False  # mode indicator: outputting code

        # extract URL type links on page
        links = [l for l in page.get_links() if l["kind"] == 2]

        blocks = page.get_text(
            "dict",
            clip=clip,
            flags=fitz.TEXTFLAGS_TEXT,
            sort=True,
        )["blocks"]

        for block in blocks:  # iterate textblocks
            previous_y = 0
            for line in block["lines"]:  # iterate lines in block
                if line["dir"][1] != 0:  # only consider horizontal lines
                    continue
                spans = [s for s in line["spans"]]

                this_y = line["bbox"][3]  # current bottom coord

                # check for still being on same line
                same_line = abs(this_y - previous_y) <= 3 and previous_y > 0

                if same_line and out_string.endswith("\n"):
                    out_string = out_string[:-1]

                # are all spans in line in a mono-spaced font?
                all_mono = all([s["flags"] & 8 for s in spans])

                # compute text of the line
                text = "".join([s["text"] for s in spans])
                if not same_line:
                    previous_y = this_y
                    if not out_string.endswith("\n"):
                        out_string += "\n"

                if all_mono:
                    # compute approx. distance from left - assuming a width
                    # of 0.5*fontsize.
                    delta = int(
                        (spans[0]["bbox"][0] - block["bbox"][0])
                        / (spans[0]["size"] * 0.5)
                    )
                    if not code:  # if not already in code output  mode:
                        out_string += "```"  # switch on "code" mode
                        code = True
                    if not same_line:  # new code line with left indentation
                        out_string += "\n" + " " * delta + text + " "
                        previous_y = this_y
                    else:  # same line, simply append
                        out_string += text + " "
                    continue  # done with this line

                for i, s in enumerate(spans):  # iterate spans of the line
                    # this line is not all-mono, so switch off "code" mode
                    if code:  # still in code output mode?
                        out_string += "```\n"  # switch of code mode
                        code = False
                    # decode font properties
                    mono = s["flags"] & 8
                    bold = s["flags"] & 16
                    italic = s["flags"] & 2

                    if mono:
                        # this is text in some monospaced font
                        out_string += f"`{s['text'].strip()}` "
                    else:  # not a mono text
                        # for first span, get header prefix string if present
                        if i == 0:
                            hdr_string = hdr_prefix.get_header_id(s)
                        else:
                            hdr_string = ""
                        prefix = ""
                        suffix = ""
                        if hdr_string == "":
                            if bold:
                                prefix = "**"
                                suffix += "**"
                            if italic:
                                prefix += "_"
                                suffix = "_" + suffix

                        ltext = resolve_links(links, s)
                        if ltext:
                            text = f"{hdr_string}{prefix}{ltext}{suffix} "
                        else:
                            text = f"{hdr_string}{prefix}{s['text'].strip()}{suffix} "
                        text = (
                            text.replace("<", "&lt;")
                            .replace(">", "&gt;")
                            .replace(chr(0xF0B7), "-")
                            .replace(chr(0xB7), "-")
                            .replace(chr(8226), "-")
                            .replace(chr(9679), "-")
                        )
                        out_string += text
                previous_y = this_y
                if not code:
                    out_string += "\n"
            out_string += "\n"
        if code:
            out_string += "```\n"  # switch of code mode
            code = False
        return out_string.replace(" \n", "\n")

    hdr_prefix = IdentifyHeaders(doc, pages=pages)
    md_string = ""

    for pno in pages:
        page = doc[pno]
        # 1. first locate all tables on page
        tabs = page.find_tables()

        # 2. make a list of table boundary boxes, sort by top-left corner.
        # Must include the header bbox, which may be external.
        tab_rects = sorted(
            [
                (fitz.Rect(t.bbox) | fitz.Rect(t.header.bbox), i)
                for i, t in enumerate(tabs.tables)
            ],
            key=lambda r: (r[0].y0, r[0].x0),
        )

        # 3. final list of all text and table rectangles
        text_rects = []
        # compute rectangles outside tables and fill final rect list
        for i, (r, idx) in enumerate(tab_rects):
            if i == 0:  # compute rect above all tables
                tr = page.rect
                tr.y1 = r.y0
                if not tr.is_empty:
                    text_rects.append(("text", tr, 0))
                text_rects.append(("table", r, idx))
                continue
            # read previous rectangle in final list: always a table!
            _, r0, idx0 = text_rects[-1]

            # check if a non-empty text rect is fitting in between tables
            tr = page.rect
            tr.y0 = r0.y1
            tr.y1 = r.y0
            if not tr.is_empty:  # empty if two tables overlap vertically!
                text_rects.append(("text", tr, 0))

            text_rects.append(("table", r, idx))

            # there may also be text below all tables
            if i == len(tab_rects) - 1:
                tr = page.rect
                tr.y0 = r.y1
                if not tr.is_empty:
                    text_rects.append(("text", tr, 0))

        if not text_rects:  # this will happen for table-free pages
            text_rects.append(("text", page.rect, 0))
        else:
            rtype, r, idx = text_rects[-1]
            if rtype == "table":
                tr = page.rect
                tr.y0 = r.y1
                if not tr.is_empty:
                    text_rects.append(("text", tr, 0))

        # we have all rectangles and can start outputting their contents
        for rtype, r, idx in text_rects:
            if rtype == "text":  # a text rectangle
                md_string += write_text(page, r, hdr_prefix)  # write MD content
                md_string += "\n"
            else:  # a table rect
                md_string += tabs[idx].to_markdown(clean=False)

        md_string += "\n-----\n\n"

    return md_string

# Convert PDF to Markdown
- The below function extracts the tables and text from PDF along with formatting and saves it to disk
- the file be saved in 'data' directory for later use

In [15]:
def pdf_to_markdown(fname, start_page, end_page):
    """Process the document and return the text of its selected pages."""
    doc = fitz.open(fname)
    pages = []
    pages.extend(range(start_page - 1, end_page))

    # Extract text and tables from PDF and save in same directory
    md_string = to_markdown(doc, pages)

    # Save the markdown file to disk, use the same filename but change extension
    outname = doc.name.replace(".pdf", ".md")
    pathlib.Path(outname).write_bytes(md_string.encode())

# Use chapters 2 and 3 for processing

- Please note that chapter 3 has a table which should give an example of how tables are parsed
- Please refer the generated markdown file in data directory to see the parsed table
- Also note that image and table headings are also parsed proparly

In [16]:
fname = './data/Concepts_of_Biology.pdf'
start_page = 41
end_page = 102

pdf_to_markdown(fname, start_page, end_page)

# Use unstructured.io library for Chunking the generated Markdown file

In [17]:
%%capture
!pip install unstructured[md]

In [18]:
DOC_PATH = './data/Concepts_of_Biology.md'

In [19]:
from unstructured.chunking.title import chunk_by_title
from unstructured.partition.md import partition_md
import collections

In [20]:
elements = partition_md(filename=DOC_PATH)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [21]:
print(f"total number of elements:{len(elements)}")

categories = [el.category for el in elements]
print(f"Count by category:{collections.Counter(categories).most_common()}")

total number of elements:961
Count by category:[('NarrativeText', 555), ('Title', 355), ('ListItem', 30), ('UncategorizedText', 19), ('Table', 2)]


# Check if tables are extracted
- For debugging the tables are saved as text file in Table.txt
- Each row is a single table
- Please ignore the formatting of the table. The table is formatted proparly in the object but while writing to disk the formatting is ignored for simplicity

In [22]:
others = [el.text for el in elements if el.category == "Table"]
with open('./data/Table.txt', 'w') as f:
    for each_txt in others:
        f.write(each_txt)
        f.write('\n')

In [23]:
# Check titles
titles = [el.text for el in elements if el.category == "Title"]
with open('./data/titles.txt', 'w') as f:
    for each_title in titles:
        f.write(each_title)
        f.write('\n')

# Chunking by Title using Unstructured.io
- Since unstructured has parsed the markdown file into Titles, Tables and Narrative text. It can now go ahead and chunk based on the different sections
- In this intelligent chunking strategy the section text is kept intact and each chunk is stopped at the next section title
- This ensures that the section text are not split between multiple chunks. Thus gives a better performance over simple chunknig strategies like RecursiveTextSplitter from langchain

In [24]:
from unstructured.chunking.title import chunk_by_title

In [25]:
chunks = chunk_by_title(elements)
print(len(chunks))

482


In [26]:
for each_chunk in chunks[:5]:
  print(each_chunk.text)
  metadata = each_chunk.metadata.to_dict()
  del metadata["orig_elements"]
  print("\n # Metadata\n")
  print(metadata)
  print('-'*30)

CHAPTER 2

Chemistry of Life

FIGURE 2.1 Foods such as bread, fruit, and cheese are rich sources of biological macromolecules. (credit:
modification of work by Bengt Nyman)

CHAPTER OUTLINE

2.1 The Building Blocks of Molecules
2.2 Water
2.3 Biological Molecules

INTRODUCTION

The elements carbon, hydrogen, nitrogen, oxygen, sulfur, and phosphorus are

 # Metadata

{'file_directory': './data', 'filename': 'Concepts_of_Biology.md', 'filetype': 'text/markdown', 'languages': ['eng'], 'last_modified': '2024-05-11T12:34:06'}
------------------------------
the key building blocks of the chemicals found in living things. They form the carbohydrates,
nucleic acids, proteins, and lipids (all of which will be defined later in this chapter) that are the
fundamental molecular components of all organisms. In this chapter, we will discuss these
important building blocks and learn how the unique properties of the atoms of different elements
affect their interactions with other atoms to form the molec

# Convert the parsed chunks to 'Document' format
- Before indexing the data into Chroma db it must be converted to a compatible format: List of Document
- While indexing the data we also add metadata which can be useful later for filtering the results. For example using the 'last_modeified' value we can decide to retrieve only the latest results

In [27]:
from langchain_core.documents import Document

documents = []
for each_chunk in chunks:
    metadata = each_chunk.metadata.to_dict()
    del metadata["languages"]
    del metadata["orig_elements"]
    metadata["source"] = metadata["filename"]
    documents.append(Document(page_content=each_chunk.text, metadata=metadata))

# Index the data into in-memory Vectore Database
- The next step is to index the parsed data into vector database for later use in information retrieval step of RAG pipeline
- For indexing the data we use **all-MiniLM-L6-v2** from sentence transformers. This embedding model maps sentences & paragraphs to a *384 dimensional dense vector space*
- We use Chroma as our vector **database**

In [28]:
from langchain.embeddings import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [29]:
from langchain_community.vectorstores import Chroma
vectordb_fname = '.data/chroma_db'
vectorstore = Chroma.from_documents(documents, embeddings, persist_directory=vectordb_fname)

# Test if the information retrieval works
- Use a sample query to retrieve similar documents from the vector database
- *test_retrieval* is a helper function to retrieve documents from vectorstore using cosine similarity as a matching criteria

In [30]:
def test_retrieval(query):
    # retrieve context - top 5 most relevant (closests) chunks to the query vector
    # (by default Langchain is using cosine distance metric)
    docs_chroma = vectorstore.similarity_search_with_score(query, k=5)

    if docs_chroma:
        # generate an answer based on given user query and retrieved context information
        context_text = "\n\n".join([doc.page_content for doc, _score in docs_chroma])

        print(context_text)
    else:
        print("No results found!")

In [31]:
test_retrieval(query="define Covalent Bonds")

Covalent Bonds
Another type of strong chemical bond between two or more atoms is a covalent bond . These bonds form when an
electron is shared between two elements and are the strongest and most common form of chemical bond in living
organisms. Covalent bonds form between the elements that make up the biological molecules in our cells. Unlike

Access for free at openstax.org

2.1 - The Building Blocks of Molecules 33

ionic bonds, covalent bonds do not dissociate in water.

There are two types of covalent bonds: polar and nonpolar. Nonpolar covalent bonds form between two atoms of
the same element or between different elements that share the electrons equally. For example, an oxygen atom can
bond with another oxygen atom to fill their outer shells. This association is nonpolar because the electrons will be
equally distributed between each oxygen atom. Two covalent bonds form between the two oxygen atoms because

atoms, each atom providing one. These elements all share the electrons equ

# RAG Pipeline

In [32]:
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough

In [33]:
retriever = vectorstore.as_retriever()
# Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Local LLM
ollama_llm = "mistral:latest"
model_local = ChatOllama(model=ollama_llm)

# Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model_local
    | StrOutputParser()
)

In [34]:
chain.invoke("define Covalent Bonds")

' Covalent bonds are strong chemical bonds between two or more atoms, formed when electrons are shared between the involved elements. They are the most common type of chemical bond in living organisms and play a significant role in forming the biological molecules within our cells. There are two types of covalent bonds: polar and nonpolar. In nonpolar covalent bonds, the atoms share electrons equally, while in polar covalent bonds, the shared electrons spend more time near one nucleus than the other, resulting in a slightly positive or negative charge on each atom.'

In [35]:
!ollama list

NAME          	ID          	SIZE  	MODIFIED      
mistral:latest	61e88e884507	4.1 GB	7 minutes ago	


In [36]:
chain.invoke("How does pH determine whether the solution is acidic or basic")

" A solution's pH value determines its acidity or basicity by measuring the concentration of hydrogen ions (H+) in the solution. An acidic solution has a high number of hydrogen ions and a low pH value, while a basic solution has a high number of hydroxide ions (OH-) and a high pH value. The pH scale ranges from 0 to 14, with a neutral pH being 7. Acids provide hydrogen ions and lower pH, whereas bases provide hydroxide ions and raise pH."

In [39]:
%%capture
! pip install gradio



In [44]:
def gradio_func(message, history):
  return chain.invoke(message)

In [45]:
import gradio as gr
gradio_interface = gr.ChatInterface(
        gradio_func,
        chatbot=gr.Chatbot(),
        textbox=gr.Textbox(placeholder="Example: Who is Alice?", container=False, scale=7),
        title="The Ollama test chatbot",
        description=f"Ask the Mistral chatbot a question!",
        theme='gradio/base', # themes at https://huggingface.co/spaces/gradio/theme-gallery
        retry_btn=None,
        undo_btn="Delete Previous",
        clear_btn="Clear",

)

In [46]:
gradio_interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://390960f7fc02c0fb9f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


