<a href="https://colab.research.google.com/github/SarathM1/RAG/blob/main/RAG_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective

Design a custom RAG pipeline to answer questions from this textbook -
https://openstax.org/details/books/concepts-biology

## Important Pointers:
1. Download the pdf from the link above
2. To make indexing faster, you can pick any 2 chapters from the pdf and treat it as a
source.
3. Use any in-memory vector database if required.
4. Use any open source HuggingFace model as the LLM Model

## Output artifacts
1. Entire codebase in GitHub with links to access
artifacts we need for evaluation:
a. Please add docstrings wherever necessary.
2. Additional Colab notebook to run the backend logic and evaluations:
a. Please add text blocks in your Colab to add scenarios/assumptions etc to make it readable.
3. Any additional artifacts like system design architecture, assumptions, list of issues you couldn’t solve because of time constraints and how you can fix it in future.

## Additional (bonus):
1. Streamlit/Gradio Frontend to interact with your pipeline
2. Wrap the entire application inside a docker container
3. Draft and implement all the necessary APIs using FastAPI or any other python web
framework of choice
4. Produce alternative way to do the RAG without using any library like Langchain,
LLamaIndex or Haystack

# LLM model
Gemma 7B

# Install the Dependencies

In [1]:
! pip install -U langchain_community tiktoken chromadb langchain langchainhub sentence_transformers



# Install Ollama

Ollama is a framework that allows you to run Open Source LLM models locally.

In [2]:
!curl https://ollama.ai/install.sh | sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0>>> Downloading ollama...
100 10091    0 10091    0     0  16586      0 --:--:-- --:--:-- --:--:-- 16597
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


# Start Ollama server in the background

In [3]:
import subprocess
import time

# Start ollama as a backrgound process
command = "nohup ollama serve&"

# Use subprocess.Popen to start the process in the background
process = subprocess.Popen(command,
                            shell=True,
                           stdout=subprocess.PIPE,
                           stderr=subprocess.PIPE)
print("Process ID:", process.pid)
# Let's use fly.io resources
#!OLLAMA_HOST=https://ollama-demo.fly.dev:443
time.sleep(5)  # Makes Python wait for 5 seconds

Process ID: 2161


In [4]:
# Test if Ollama serve is up
!ollama -v

ollama version is 0.1.34


In [5]:
# Pull the model
!ollama pull gemma:7b

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest 
pulling ef311de6af9d...   0% ▕▏    0 B/5.0 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling ef311de6af9d...   0% ▕▏    0 B/5.0 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling ef3

# Parsing the document using Unstructured library

In [35]:
!pip install unstructured pdf2image pdfminer.six pillow_heif PyPDF2 pytesseract pikepdf

Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10


In [33]:
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

# Download the file

In [15]:
import requests

url = 'https://assets.openstax.org/oscms-prodcms/media/documents/ConceptsofBiology-WEB.pdf?_gl=1*11bd84p*_ga*OTg3MTMyOTg1LjE3MTUyNTI1MjY.*_ga_T746F8B0QC*MTcxNTM0NTM1Ni4yLjAuMTcxNTM0NTM1Ny41OS4wLjA.'
r = requests.get(url, stream=True)
chunk_size = 2000
with open('./Concepts_of_Biology.pdf', 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)

## Extract the first two chapters from PDF for easier processing

In [31]:
from PyPDF2 import PdfWriter, PdfReader

def split_pdf(filename, page_number):
    pdf_reader = PdfReader(open(filename, "rb"))
    pdf_writer1 = PdfWriter()

    for page in range(page_number):
        pdf_writer1.add_page(pdf_reader.pages[page])

    with open("chapter_1_and_2.pdf", 'wb') as file1:
        pdf_writer1.write(file1)

In [32]:
split_pdf(filename='Concepts_of_Biology.pdf', page_number=68)

In [36]:
pdf_elements = partition_pdf("chapter_1_and_2.pdf", strategy="fast")
pdf_elements = [el for el in pdf_elements if el.category != "Header" and el.category
                != "UncategorizedText" and el.category != "Footer"]
elements = chunk_by_title(pdf_elements)

ERROR:unstructured:Following dependencies are missing: pikepdf. Please install them using `pip install pikepdf`.
