RAG Pipeline

1. Data ingestion

In [50]:
from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader

In [60]:
import bs4

#load, chunk and index the content of web page
loader = WebBaseLoader(web_path=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                       bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                           class_=("post-content","post-title","post-header")
                       )), )
text_document = loader.load()
# text_document

In [52]:
#pdf loader to read pdf document
loader = PyPDFLoader("metagpt.pdf")
text_documents = loader.load()

2. Transform data

In [53]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap=200)
documents = text_splitter.split_documents(text_documents)
# docuemnts[-1]

3. Convert chunks of data into vectors using vector embeddings then store it into vector store

In [56]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

#using chromadb as vector database
db = Chroma.from_documents(documents[:20], embedding_model)

Retrieve data from vector database using similarity search

In [57]:
query = "What is the main objective of the research paper"
result = db.similarity_search(query)
result[1].page_content

'et al., 2023; Zhou et al., 2023a; Qian et al., 2023; Tang et al., 2023b; Hong et al., 2024).\nThrough extensive collaborative practice, humans have developed widely accepted Standardized\nOperating Procedures (SOPs) across various domains (Belbin, 2012; Manifesto, 2001; DeMarco &\nLister, 2013). These SOPs play a critical role in supporting task decomposition and effective coor-\ndination. Furthermore, SOPs outline the responsibilities of each team member, while establishing\nstandards for intermediate outputs. Well-defined SOPs improve the consistent and accurate exe-\ncution of tasks that align with defined roles and quality standards (Belbin, 2012; Manifesto, 2001;\nDeMarco & Lister, 2013; Wooldridge & Jennings, 1998). For instance, in a software company,\nProduct Managers analyze competition and user needs to create Product Requirements Documents\n(PRDs) using a standardized structure, to guide the developmental process.'

In [58]:
#using faiss vector database
from langchain_community.vectorstores import FAISS
db_faiss = FAISS.from_documents(documents, embedding_model)

In [59]:
query = "What is the main objective of the research paper"
result = db.similarity_search(query)
result[1].page_content

'et al., 2023; Zhou et al., 2023a; Qian et al., 2023; Tang et al., 2023b; Hong et al., 2024).\nThrough extensive collaborative practice, humans have developed widely accepted Standardized\nOperating Procedures (SOPs) across various domains (Belbin, 2012; Manifesto, 2001; DeMarco &\nLister, 2013). These SOPs play a critical role in supporting task decomposition and effective coor-\ndination. Furthermore, SOPs outline the responsibilities of each team member, while establishing\nstandards for intermediate outputs. Well-defined SOPs improve the consistent and accurate exe-\ncution of tasks that align with defined roles and quality standards (Belbin, 2012; Manifesto, 2001;\nDeMarco & Lister, 2013; Wooldridge & Jennings, 1998). For instance, in a software company,\nProduct Managers analyze competition and user needs to create Product Requirements Documents\n(PRDs) using a standardized structure, to guide the developmental process.'