RAG Pipeline 
First step is to perform Data Ingestion
    - Text document
    - Web page scraping with Beautiful soup
    - Pdf document
    - Can be done for Excel / DIR files / readme file

STEP 1 : Data Ingestion

In [None]:
## TYPE 1 : Data ingestion | Text Document

from langchain_community.document_loaders import TextLoader
loader = TextLoader(r"C:/Users/LENOVO/Documents/Saranya/GitHUb/LANGCHAIN/rag/policy.txt")

In [55]:
text_documents  = loader.load()

In [56]:
text_documents

[Document(page_content="Identity-based policies â€“ Attach managed and inline policies to IAM identities (users, groups to which users belong, or roles). Identity-based policies grant permissions to an identity.\n\nResource-based policies â€“ Attach inline policies to resources. The most common examples of resource-based policies are Amazon S3 bucket policies and IAM role trust policies. Resource-based policies grant permissions to the principal that is specified in the policy. Principals can be in the same account as the resource or in other accounts.\n\nPermissions boundaries â€“ Use a managed policy as the permissions boundary for an IAM entity (user or role). That policy defines the maximum permissions that the identity-based policies can grant to an entity, but does not grant permissions. Permissions boundaries do not define the maximum permissions that a resource-based policy can grant to an entity.\n\nOrganizations SCPs â€“ Use an AWS Organizations service control policy (SCP) t

In [57]:
import os 
from dotenv import load_dotenv
load_dotenv()

os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")

In [58]:
## TYPE 2 : Data ingestion | Web source

from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(web_paths= ("https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies.html", ) , bs_kwargs=dict(parse_only=bs4.SoupStrainer(
    class_=("awsui-util-container")    
)))
web_documents = loader.load()

In [59]:
web_documents



In [60]:
## Type 3 : Data ingestion | PDF document

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(r"C:/Users/LENOVO/Documents/Saranya/GitHUb/LANGCHAIN/rag/DL_Paper.pdf")

In [61]:
pdf_documents = loader.load()

In [62]:
pdf_documents

[Document(page_content='A Distributional Perspective on Reinforcement Learning\nMarc G. Bellemare* 1Will Dabney* 1R´emi Munos1\nAbstract\nIn this paper we argue for the fundamental impor-\ntance of the value distribution : the distribution\nof the random return received by a reinforcement\nlearning agent. This is in contrast to the com-\nmon approach to reinforcement learning which\nmodels the expectation of this return, or value .\nAlthough there is an established body of liter-\nature studying the value distribution, thus far it\nhas always been used for a speciﬁc purpose such\nas implementing risk-aware behaviour. We begin\nwith theoretical results in both the policy eval-\nuation and control settings, exposing a signiﬁ-\ncant distributional instability in the latter. We\nthen use the distributional perspective to design\na new algorithm which applies Bellman’s equa-\ntion to the learning of approximate value distri-\nbutions. We evaluate our algorithm using the\nsuite of games from

STEP 2 : Data Transformation ( Convert into chunks to fit into model context window)

In [64]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
documents = text_splitter.split_documents(pdf_documents)


In [67]:
documents[:5]

[Document(page_content='A Distributional Perspective on Reinforcement Learning\nMarc G. Bellemare* 1Will Dabney* 1R´emi Munos1\nAbstract\nIn this paper we argue for the fundamental impor-\ntance of the value distribution : the distribution\nof the random return received by a reinforcement\nlearning agent. This is in contrast to the com-\nmon approach to reinforcement learning which\nmodels the expectation of this return, or value .\nAlthough there is an established body of liter-\nature studying the value distribution, thus far it\nhas always been used for a speciﬁc purpose such\nas implementing risk-aware behaviour. We begin\nwith theoretical results in both the policy eval-\nuation and control settings, exposing a signiﬁ-\ncant distributional instability in the latter. We\nthen use the distributional perspective to design\na new algorithm which applies Bellman’s equa-\ntion to the learning of approximate value distri-\nbutions. We evaluate our algorithm using the\nsuite of games from

STEP 3 : Convert text chunks to vectors
TEXT ------- > CHUNK ---------> Vectors ( Embeddings ) -------> vector Store (Chroma DB) / Faiss-cpu

In [72]:
# ChromaDB vector database

from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(documents[:5] , OpenAIEmbeddings())

  warn_deprecated(


In [74]:
## Query with normal query

query ="what is the paper about?"
result = db.similarity_search(query)
result[0].page_content

'pected outcome of the random transition (x,a)→(X′,A′):\nQ(x,a) =ER(x,a) +γEQ(X′,A′).\nIn this paper, we aim to go beyond the notion of value and\nargue in favour of a distributional perspective on reinforce-\n*Equal contribution1DeepMind, London, UK. Correspon-\ndence to: Marc G. Bellemare <bellemare@google.com >.\nProceedings of the 34thInternational Conference on Machine\nLearning , Sydney, Australia, PMLR 70, 2017. Copyright 2017\nby the author(s).ment learning. Speciﬁcally, the main object of our study is\nthe random return Zwhose expectation is the value Q. This\nrandom return is also described by a recursive equation, but\none of a distributional nature:\nZ(x,a)D=R(x,a) +γZ(X′,A′).\nThedistributional Bellman equation states that the distribu-\ntion ofZis characterized by the interaction of three random\nvariables: the reward R, the next state-action (X′,A′), and\nits random return Z(X′,A′). By analogy with the well-\nknown case, we call this quantity the value distribution .'

In [None]:
# Faiss vector database

from langchain_community.vectorstores import FAISS
db = FAISS.from_documents(documents[:5] , OpenAIEmbeddings())