<a href="https://colab.research.google.com/github/SunHaoranSkillnet/L1ChatBot/blob/main/rag_L1_ChatBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a RAG application from scratch


In [1]:
!pip install langchain_community
!pip install sentence_transformers
!pip install PyPDF
!pip install PyPDF2
!pip install langchain_pinecone

Collecting langchain_community
  Downloading langchain_community-0.3.10-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.10 (from langchain_community)
  Downloading langchain-0.3.10-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.22 (from langchain_community)
  Downloading langchain_core-0.3.22-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.6.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.23.1-py3-none-any.whl.metadata (7.5 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

In [2]:
from langchain.prompts import PromptTemplate
from langchain import LLMChain
from langchain import HuggingFaceHub

Let's start by loading the environment variables we need to use.

## Setting up the model
Let's define the LLM model that we'll use as part of the workflow. You may change the LLM model repo and you need to get the HuggingFace API token at the [HuggingFace](https://https://huggingface.co/) website.

In [3]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = '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'

In [None]:
# model = HuggingFaceHub(repo_id = "google/gemma-2-9b-it")
model = HuggingFaceHub(repo_id = "Qwen/Qwen2.5-Coder-32B-Instruct")


We can test the model by asking a simple question. Now the plain chatBot is invoked. The next step is to bring RAG. Notiche that sometimes the model will be busy.

In [34]:
model.invoke("What MLB team won the World Series during the COVID-19 pandemic?")

'What MLB team won the World Series during the COVID-19 pandemic?\n\nThe **Los Angeles Dodgers** won the World Series in 2020 during the COVID-19 pandemic. \n'

The result from the model is an `AIMessage` instance containing the answer. We can extract this answer by chaining the model with an [output parser](https://python.langchain.com/docs/modules/model_io/output_parsers/). For this example, we'll use a simple `StrOutputParser` to extract the answer as a string.

Here is what chaining the model with an output parser looks like:

<img src='https://github.com/haoransun/youtube-rag/blob/main/images/chain1.png?raw=1' width="1200">



In [35]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser
chain.invoke("What MLB team won the World Series during the COVID-19 pandemic?")

'What MLB team won the World Series during the COVID-19 pandemic?\n\nThe **Los Angeles Dodgers** won the World Series in 2020 during the COVID-19 pandemic. \n'

## Introducing prompt templates

We want to provide the model with some context and the question. [Prompt templates](https://python.langchain.com/docs/concepts/prompt_templates/) are a simple way to define and reuse prompts.

In [None]:
# from langchain.prompts import ChatPromptTemplate

# template = """
# Answer the question based on the context below. If you can't
# answer the question, reply "I don't know".

# Context: {context}

# Question: {question}
# """

# prompt = ChatPromptTemplate.from_template(template)
# prompt.format(context="Mary's sister is Susana", question="Who is Mary's sister?")

'Human: \nAnswer the question based on the context below. If you can\'t\nanswer the question, reply "I don\'t know".\n\nContext: Mary\'s sister is Susana\n\nQuestion: Who is Mary\'s sister?\n'

In [36]:
from langchain.prompts import ChatPromptTemplate
template = """
Answer the question based on the context below. If you can answer that question,
add addtional comments "I am glad I can help you". If you can't
answer the question, reply "I don't know".

Context:{context}
Question:{question}
"""
prompt = ChatPromptTemplate.from_template(template)
prompt.format(context="Chatbot is an AI assistant", question="What is a Chatbot")

'Human: \nAnswer the question based on the context below. If you can answer that question,\nadd addtional comments "I am glad I can help you". If you can\'t\nanswer the question, reply "I don\'t know".\n\nContext:Chatbot is an AI assistant\nQuestion:What is a Chatbot\n'

We can now chain the prompt with the model and the output parser.

<img src='https://github.com/haoransun/youtube-rag/blob/main/images/chain2.png?raw=1' width="1200">

## Combining chains

We can combine different chains to create more complex workflows. For example, let's create a second chain that translates the answer from the first chain into a different language.

Let's start by creating a new prompt template for the translation chain:

In [None]:
translation_prompt = ChatPromptTemplate.from_template(
    "Translate {answer} to {language}"
)

We can now create a new translation chain that combines the result from the first chain with the translation prompt.

Here is what the new workflow looks like:

<img src='https://github.com/haoransun/youtube-rag/blob/main/images/chain3.png?raw=1' width="1200">

In [None]:
from operator import itemgetter

translation_chain = (
    {"answer": chain, "language": itemgetter("language")} | translation_prompt | model | parser
)

translation_chain.invoke(
    {
        "context": "Mary's sister is Susana. She doesn't have any more siblings.",
        "question": "How many sisters does Mary have?",
        "language": "Spanish",
    }
)

'Human: Translate Human: \nAnswer the question based on the context below. If you can answer that question,\nadd addtional comments "I am glad I can help you". If you can\'t\nanswer the question, reply "I don\'t know".\n\nContext:Mary\'s sister is Susana. She doesn\'t have any more siblings.\nQuestion:How many sisters does Mary have?\nAnswer:Mary has one sister. \nI am glad I can help you. \n to Spanish: \nAnswer:Mary has one sister. \nI am glad I can help you. \n \n'

## Loading the xstore PDF manu

The context we want to send the model comes from a Xstore manu in PDF format. There are also other kinds of text loader. But the final extracted object should be string type.

In [39]:
from PyPDF2 import PdfReader
with open("xocs-quick-reference-guide.pdf", "rb") as pdf_file:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        manu_xstore=page.extract_text().strip()
print(type(manu_xstore))


<class 'str'>


## Using the entire transcription as context

If we try to invoke the chain using the whole document as context, the model will return an error because the context is too long.

Large Language Models support limitted context sizes. The video we are using is too long for the model to handle, so we need to find a different solution.

In [50]:
try:
    chain.invoke({
        "context": manu_xstore,
        "question": "Is reading papers a good idea?"
    })
except Exception as e:
    print(e)

Invalid input type <class 'dict'>. Must be a PromptValue, str, or list of BaseMessages.


## Splitting the Document

Since we can't use the entire document as the context for the model, a potential solution is to split the document into smaller chunks. We can then invoke the model using only the relevant chunks to answer a particular question:

Let's start by loading the document in memory:

In [52]:

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("xocs-quick-reference-guide.pdf")
xstore_documents = loader.load()
#xstore_documents

There are many different ways to split a document. For this example, we'll use a simple splitter that splits the document into chunks of a fixed size. Check [Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/) for more information about different approaches to splitting documents.

For illustration purposes, let's split the transcription into chunks of 1000 characters with an overlap of 20 characters and display the first few chunks:

In [56]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


For our specific application, let's use 1000 characters:

In [57]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
xstore_documents = text_splitter.split_documents(xstore_documents)
text_splitter.split_documents(xstore_documents)[:5]

[Document(metadata={'source': 'xocs-quick-reference-guide.pdf', 'page': 0}, page_content='Oracle Retail Xstore Office \nCloud Service \n \nQuick Reference Guide \n \nFebruary 2024  |   \nCopyright © 2024, Oracle and/or its affiliates'),
 Document(metadata={'source': 'xocs-quick-reference-guide.pdf', 'page': 1}, page_content='1 Quick Reference Guide  \n Copyright © 2024, Oracle and/or its affiliates \nDisclaimer \nThis document in any form, software or printed matter, contains proprietary information that is the exclusive property \nof Oracle. Your access to and use of this confidential material is subject to the terms and conditions of your Oracle \nsoftware license and service agreement, which has been executed and with which you agree to comply. This \ndocument and information contained herein may not be disclosed, copied, reproduced or distributed to anyone \noutside Oracle without prior written consent of Oracle. This document is not part of your license agreement nor can it \nbe i

## Finding the relevant chunks

Given a particular question, we need to find the relevant chunks from the transcription to send to the model. Here is where the idea of **embeddings** comes into play.

An embedding is a mathematical representation of the semantic meaning of a word, sentence, or document. It's a projection of a concept in a high-dimensional space. Embeddings have a simple characteristic: The projection of related concepts will be close to each other, while concepts with different meanings will lie far away. You can use the [Cohere's Embed Playground](https://dashboard.cohere.com/playground/embed) to visualize embeddings in two dimensions.

To provide with the most relevant chunks, we can use the embeddings of the question and the chunks of the transcription to compute the similarity between them. We can then select the chunks with the highest similarity to the question and use them as the context for the model:

Let's generate embeddings for an arbitrary query:

In [58]:

from langchain_community.embeddings import HuggingFaceHubEmbeddings

embeddings = HuggingFaceHubEmbeddings()

embedded_query = embeddings.embed_query("What is xstore?")

print(f"Embedding length: {len(embedded_query)}")
print(embedded_query[:10])

Embedding length: 768
[0.00874686986207962, -0.10027685016393661, -0.004529685713350773, -0.0246969573199749, -0.017194503918290138, 0.032514624297618866, 0.025161515921354294, 0.02586253732442856, 0.018119338899850845, -0.017787758260965347]


  embeddings = HuggingFaceHubEmbeddings()


To illustrate how embeddings work, let's first generate the embeddings for two different sentences:

In [67]:
sentence1 = embeddings.embed_query("Mary's sister is Susana")
sentence2 = embeddings.embed_query("Pedro's mother is a teacher")
sentence3 = embeddings.embed_query("what is xstore")
sentence4 = embeddings.embed_query("What is xstore?")

We can now compute the similarity between the query and each of the two sentences. The closer the embeddings are, the more similar the sentences will be.

We can use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to calculate the similarity between the query and each of the sentences:

In [68]:
from sklearn.metrics.pairwise import cosine_similarity

query_sentence1_similarity = cosine_similarity([embedded_query], [sentence1])[0][0]
query_sentence2_similarity = cosine_similarity([embedded_query], [sentence2])[0][0]
query_sentence3_similarity = cosine_similarity([embedded_query], [sentence3])[0][0]
query_sentence4_similarity = cosine_similarity([embedded_query], [sentence4])[0][0]
query_sentence1_similarity, query_sentence2_similarity, query_sentence3_similarity, query_sentence4_similarity

(0.04001440613654937,
 -0.007844169000683063,
 0.9504557322801834,
 1.0000000000000009)

## Setting up a Vector Store

We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use a **vector store**.

A vector store is a database of embeddings that specializes in fast similarity searches.

To understand how a vector store works, let's create one in memory and add a few embeddings to it:

In [69]:
!pip install docarray

Collecting docarray
  Downloading docarray-0.40.0-py3-none-any.whl.metadata (36 kB)
Collecting types-requests>=2.28.11.6 (from docarray)
  Downloading types_requests-2.32.0.20241016-py3-none-any.whl.metadata (1.9 kB)
Downloading docarray-0.40.0-py3-none-any.whl (270 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.2/270.2 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading types_requests-2.32.0.20241016-py3-none-any.whl (15 kB)
Installing collected packages: types-requests, docarray
Successfully installed docarray-0.40.0 types-requests-2.32.0.20241016


In [70]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore1 = DocArrayInMemorySearch.from_texts(
    [
        "Mary's sister is Susana",
        "John and Tommy are brothers",
        "Patricia likes white cars",
        "Pedro's mother is a teacher",
        "Lucia drives an Audi",
        "Mary has two siblings",
        "what is xstore",
    ],
    embedding=embeddings,
)



We can now query the vector store to find the most similar embeddings to a given query:

In [72]:

vectorstore1.similarity_search_with_score(query="What is xtore?", k=3)

[(Document(metadata={}, page_content='what is xstore'), 0.7247137486961283),
 (Document(metadata={}, page_content='John and Tommy are brothers'),
  0.04248969508508129),
 (Document(metadata={}, page_content="Mary's sister is Susana"),
  0.008849909090213475)]

## Connecting the vector store to the chain

We can use the vector store to find the most relevant chunks from the transcription to send to the model. Here is how we can connect the vector store to the chain:

<img src='https://github.com/haoransun/youtube-rag/blob/main/images/chain4.png?raw=1' width="1200">

We need to configure a [Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/). The retriever will run a similarity search in the vector store and return the most similar documents back to the next step in the chain.

We can get a retriever directly from the vector store we created before:

In [73]:
retriever1 = vectorstore1.as_retriever()

retriever1.invoke("how does xstore work?")

[Document(metadata={}, page_content='what is xstore'),
 Document(metadata={}, page_content='Mary has two siblings'),
 Document(metadata={}, page_content="Mary's sister is Susana"),
 Document(metadata={}, page_content='John and Tommy are brothers')]

Our prompt expects two parameters, "context" and "question." We can use the retriever to find the chunks we'll use as the context to answer the question.

We can create a map with the two inputs by using the [`RunnableParallel`](https://python.langchain.com/docs/expression_language/how_to/map) and [`RunnablePassthrough`](https://python.langchain.com/docs/expression_language/how_to/passthrough) classes. This will allow us to pass the context and question to the prompt as a map with the keys "context" and "question."

In [75]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

setup = RunnableParallel(context=retriever1, question=RunnablePassthrough())
setup.invoke("how xstore works?")

{'context': [Document(metadata={}, page_content='what is xstore'),
  Document(metadata={}, page_content="Mary's sister is Susana"),
  Document(metadata={}, page_content='Mary has two siblings'),
  Document(metadata={}, page_content='John and Tommy are brothers')],
 'question': 'how xstore works?'}

Let's now add the setup map to the chain and run it:



In [103]:
chain = setup | prompt | model | parser

chain.invoke("What is xstore?")


'Human: \nAnswer the question based on the context below. If you can answer that question,\nadd addtional comments "I am glad I can help you". If you can\'t\nanswer the question, reply "I don\'t know".\n\nContext:[Document(metadata={}, page_content=\'what is xstore\'), Document(metadata={}, page_content="Mary\'s sister is Susana"), Document(metadata={}, page_content=\'Mary has two siblings\'), Document(metadata={}, page_content=\'John and Tommy are brothers\')]\nQuestion:What is xstore?\n\n'

Let's invoke the chain using another example:

In [78]:
chain.invoke("What is xcenter?")

'Human: \nAnswer the question based on the context below. If you can answer that question,\nadd addtional comments "I am glad I can help you". If you can\'t\nanswer the question, reply "I don\'t know".\n\nContext:[Document(metadata={}, page_content=\'what is xstore\'), Document(metadata={}, page_content=\'John and Tommy are brothers\'), Document(metadata={}, page_content=\'Mary has two siblings\'), Document(metadata={}, page_content="Pedro\'s mother is a teacher")]\nQuestion:What is xcenter?\n\n'

## Loading transcription into the vector store

We initialized the vector store with a few random strings. Let's create a new vector store using the chunks from the video transcription.

In [88]:
vectorstore_xocDoc = DocArrayInMemorySearch.from_documents(xstore_documents, embeddings)

Let's set up a new chain using the correct vector store. This time we are using a different equivalent syntax to specify the [`RunnableParallel`](https://python.langchain.com/docs/expression_language/how_to/map) portion of the chain:

In [104]:
chain = (
    {"context": vectorstore_xocDoc.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("what is the function of xstore?")

HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://api-inference.huggingface.co/models/google/gemma-2-9b-it (Request ID: ULgqqjk2NO_xwyT5r4O8L)

Model too busy, unable to get response in less than 60 second(s)

## Setting up Pinecone

So far we've used an in-memory vector store. In practice, we need a vector store that can handle large amounts of data and perform similarity searches at scale. For this example, we'll use [Pinecone](https://www.pinecone.io/).

The first step is to create a Pinecone account, set up an index and let its demiension match the embeddings, get an API key, and set it as an environment variable `PINECONE_API_KEY`.

Then, we can load the xstore documents into Pinecone:

In [91]:
# This is required to be changed to your API KEY from Pinecone
os.environ['PINECONE_API_KEY'] = 'pcsk_7LGVBg_CJMCAcnPB15MchbNFAxvyXKjhAXvzG76cnXV23C8b51hqVpqAtw6JBAATafyr3M'



In [94]:
from langchain_pinecone import PineconeVectorStore

index_name = "ragchatbot"
# The dimension is 768 to match the index created on Pinecone
pinecone = PineconeVectorStore.from_documents(
    xstore_documents, embeddings, index_name=index_name
)

In [95]:
index_name

'ragchatbot'

Let's now run a similarity search on pinecone to make sure everything works:

In [None]:
# return 3 most similar docs related to your question
pinecone.similarity_search("What is xstore?")[:3]

Let's setup the new chain using Pinecone as the vector store:

In [98]:
chain = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

chain.invoke("What is xstore?")

HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://api-inference.huggingface.co/models/google/gemma-2-9b-it (Request ID: n3oXlaW1W_wl9TOHF836f)

Model too busy, unable to get response in less than 60 second(s)

In [102]:
index_name = "ragchatbot"
# The dimension is 768 to match the index created on Pinecone
pinecone = PineconeVectorStore.from_documents(
    xstore_documents, embeddings, index_name=index_name
)
chain = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("What is the xstore to start doing?")

HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://api-inference.huggingface.co/models/google/gemma-2-9b-it (Request ID: fvC9ZN8i6z0ZI0FHzUChb)

Model too busy, unable to get response in less than 60 second(s)