# **Simple RAG**

In this notebook, we are consolidating the concepts from our previous sessions into a single project. We are creating a simple RAG (Retrieval-Augmented Generation) system by applying the knowledge we have gained so far using  `LangChain`.

In this notebook, we will use the following components:
1. **Loaders**
2. **Splitters**
3. **Embaddings**
4. **VectorStore**
5. **Messages**
6. **PromptTemplate**
7. **ChatModels**
8. **OutputParser**
9. **Chain**

----

### **Loading the Documents**

In [1]:
# In the first step we are loading all the files in a particular directory. Because we have all pdf files so we are using DirectoryLoader
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyPDFLoader
loader = DirectoryLoader('./Statictics PDF', glob='**/*.pdf', show_progress=True, use_multithreading=True, loader_cls=PyPDFLoader)
documents = loader.load()

100%|██████████| 9/9 [00:10<00:00,  1.19s/it]


In [5]:
documents[0].page_content

'Bernoulli distribution is a probability distribution that models a binary outcome, where the \noutcome can be either success (represented by the value 1) or failure (represented by the \nvalue 0). The Bernoulli distribution is named after the Swiss mathematician Jacob Bernoulli, \nwho first introduced it in the late 1600s.\nThe Bernoulli distribution is characterized by a single parameter, which is the probability of \nsuccess, denoted by p. The probability mass function (PMF) of the Bernoulli distribution is:\nThe Bernoulli distribution is commonly used in machine learning for modelling \nbinary outcomes, such as whether a customer will make a purchase or not, \nwhether an email is spam or not, or whether a patient will have a certain disease \nor not.\nBernoulli Distribution\n27 March 2023 16:06\n   Session on Central Limit Theorem Page 1    '

In [7]:
len(documents[0].page_content)

844

In [6]:
len(documents)

119

### **Splitting the Documents**

In [8]:
## Now in this step we are splitting the documents into chunks.
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=70, )

chunks = splitter.split_documents(documents)

In [12]:
len(chunks[0].page_content)

183

In [13]:
len(chunks)

477

### **Initializing the Embadding Model**

In [None]:
## We are using GoogleGemini Embadding Model
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import tqdm as notebook_tqdm
embadding = GoogleGenerativeAIEmbeddings(google_api_key="Your-API-KEY", model="models/embedding-001")

### **Using FAISS VectorStore**

In [16]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(chunks, embadding)

In [17]:
## Lets check the similarity search
qs = 'What is statistics'
retriver = vector_store.similarity_search(qs)

In [18]:
retriver 

[Document(id='6c875e84-01de-4c99-bde8-41abb38daf99', metadata={'source': 'Statictics PDF\\Descriptive_Statistics.pdf', 'page': 0}, page_content='Statistics is a branch of mathematics that involves collecting, \nanalysing, interpreting, and presenting data. It provides tools and \nmethods to understand and make sense of large amounts of data'),
 Document(id='9e812dda-c475-496a-9cf8-fb8cf7f30a90', metadata={'source': 'Statictics PDF\\Descriptive_Statistics.pdf', 'page': 0}, page_content='Environmental Science - Climate research4.\nWhat is Statistics\n09 March 2023 14:56\n   Session 1 on Descriptive Statistics Page 1'),
 Document(id='5fb46fb0-914f-4324-876e-eb70db6b11e3', metadata={'source': 'Statictics PDF\\Confidence__Interval.pdf', 'page': 0}, page_content='estimated based on available sample data.\nStatistic: A statistic is a numerical value that describes a characteristic of a sample, which is a \nsubset of the population. By using statistics calculated from a representative sample,'

In [22]:
retriver = vector_store.as_retriever()

In [24]:
retriver

VectorStoreRetriever(tags=['FAISS', 'GoogleGenerativeAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x0000025E79E0B0B0>, search_kwargs={})

### **System Message**

In [36]:
system_message = (
    """You are a very helpful assistant for solving the queries of users about statistics.

    Your role is to act as a highly professional statistician with expertise in statistical concepts, methods, and analysis. 
    Your primary job is to help students solve their queries and provide accurate, clear, and well-explained answers based on the retrieved information from the knowledge base. 

    Guidelines for responding:
    - Be precise and professional in your explanations.
    - Use simple, clear language when explaining complex concepts, but maintain accuracy.
    - Don't say that according to the retrival or context. Don't use these words. Just start the answer.
    - Provide examples, formulas, or step-by-step solutions where necessary.
    - If the retrieved information is insufficient, ask clarifying questions to better understand the user's needs.
    - Always ensure that your answers are correct and relevant to the query.

    {retrieved_content}
    """
)

### **ChatPromptTemplate**

In [37]:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate([
    ('system', system_message),
    ('human', '{input_message}')
])

In [38]:
prompt

ChatPromptTemplate(input_variables=['input_message', 'retrieved_content'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['retrieved_content'], input_types={}, partial_variables={}, template="You are a very helpful assistant for solving the queries of users about statistics.\n\n    Your role is to act as a highly professional statistician with expertise in statistical concepts, methods, and analysis. \n    Your primary job is to help students solve their queries and provide accurate, clear, and well-explained answers based on the retrieved information from the knowledge base. \n\n    Guidelines for responding:\n    - Be precise and professional in your explanations.\n    - Use simple, clear language when explaining complex concepts, but maintain accuracy.\n    - Don't say that according to the retrival or context. Don't use these words. Just start the answer.\n    - Provide examples, formulas, or step-by-step solutions

### **ChatModel**

In [25]:
## For chat-model I am using GROQ models
from langchain_groq import ChatGroq

llm = ChatGroq(api_key="Your-API-KEY", model='llama-3.1-8b-instant')

### **OutputParser**

In [28]:
from langchain_core.output_parsers import StrOutputParser

### **Chain**

In [39]:
from langchain_core.runnables import RunnablePassthrough


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"retrieved_content": retriver | format_docs, "input_message": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


In [40]:
for chunk in rag_chain.stream("Types of Statistics?"):
    print(chunk, end="", flush=True)

There are two primary types of statistics:

1. **Descriptive Statistics**: Descriptive statistics deals with the collection, organization, analysis, interpretation, and presentation of data. It focuses on summarizing and describing the main features of a set of data, without making inferences or predictions about the population. Examples of descriptive statistics include measures of central tendency (mean, median, mode), measures of variability (range, variance, standard deviation), and data visualization techniques (histograms, box plots).

2. **Inferential Statistics**: Inferential statistics, on the other hand, involves making inferences or predictions about a population based on a sample of data. It uses probability theory to estimate population parameters and make conclusions about the population. Examples of inferential statistics include hypothesis testing, confidence intervals, and regression analysis.

In addition to these two main types, there are also two subcategories withi