# Rag From Scratch: Overview

These notebooks walk through the process of building RAG app(s) from scratch.

They will build towards a broader understanding of the RAG langscape, as shown here:

![Screenshot 2024-03-25 at 8.30.33 PM.png](attachment:c566957c-a8ef-41a9-9b78-e089d35cf0b7.png)

## Enviornment

`(1) Packages`

In [None]:
! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain

Collecting langchain_community
  Downloading langchain_community-0.2.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-openai
  Downloading langchain_openai-0.1.7-py3-none-any.whl (34 kB)
Collecting langchainhub
  Downloading langchainhub-0.1.15-py3-none-any.whl (4.6 kB)
Collecting chromadb
  Downloading chromadb-0.5.0-py3-none-any.whl (526 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.2.0-py3-none-any.whl (973 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.7/973.7 kB[0

`(2) LangSmith`

https://docs.smith.langchain.com/

In [None]:
import os
os.environ['LANGCHAIN_TRACING_V2']  = 'true'
os.environ['LANGCHAIN_ENDPOINT']    = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY']     = 'ENTER-YOUR-OPEN-API-KEY-HERE'
os.environ['OPENAI_API_KEY']        = 'ENTER-YOUR-OPEN-API-KEY-HERE'

`(3) API Keys`

## Part 1: Overview

[RAG quickstart](https://python.langchain.com/docs/use_cases/question_answering/quickstart)

In [None]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

#### INDEXING ####

# Load Documents
# loader = WebBaseLoader(
#     web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
#     bs_kwargs=dict(
#         parse_only=bs4.SoupStrainer(
#             class_=("post-content", "post-title", "post-header")
#         )
#     ),
# )
# docs = loader.load()

In [None]:
#### INDEXING ####

# Load Documents
loader = WebBaseLoader(
    web_paths=("https://imsdb.com/scripts/Avengers-Endgame.html",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("scrtext")
        )
    ),
)
docs = loader.load()


In [None]:
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Embed
vectorstore = Chroma.from_documents(documents=splits,
                                    embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

#### RETRIEVAL and GENERATION ####

# Prompt
prompt = hub.pull("rlm/rag-prompt")

# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

[Document(page_content='LLAMAS and FARM ANIMALS GRAZE A RUSTIC-TECH ECO-COMPOUND.\r\n\r\n\r\nEXT. STARK ECO-COMPOUND, WOODS - DAY\r\n\r\nTONY approaches A WOODED GLADE near the house.\r\n                                                                25\r\n\r\n\r\n                    TONY\r\n          Morgan H. Stark.   Chow time.    Want\r\n          some lunch?\r\n\r\nSilence. Then MORGAN STARK (4) steps out of her play tent,\r\nwearing A PURPLE-BLUE IRON MAN HELMET (RESCUE).\r\n\r\n                    MORGAN STARK\r\n          Define lunch or be disintegrated.', metadata={'source': 'https://imsdb.com/scripts/Avengers-Endgame.html'})]

In [None]:
# Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Question
rag_chain.invoke("How does Steve talk about life?")

'Steve talks about life as a series of challenges and uncertainties, but also as a journey that requires taking risks and making difficult decisions. He emphasizes the importance of taking that initial leap, even when the outcome is unknown.'

## Part 2: Indexing

![Screenshot 2024-02-12 at 1.36.56 PM.png](attachment:d1c0f19e-1f5f-4fc6-a860-16337c1910fa.png)

In [None]:
# Documents
question = "What kinds of pets do I like?"
document = "My favorite pet is a cat."

[Count tokens](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb) considering [~4 char / token](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)

In [None]:
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string(question, "cl100k_base")

8

[Text embedding models](https://python.langchain.com/docs/integrations/text_embedding/openai)

In [None]:
from langchain_openai import OpenAIEmbeddings
embd = OpenAIEmbeddings()
query_result = embd.embed_query(question)
document_result = embd.embed_query(document)
len(query_result)

1536

[Cosine similarity](https://platform.openai.com/docs/guides/embeddings/frequently-asked-questions) is reccomended (1 indicates identical) for OpenAI embeddings.

In [None]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

similarity = cosine_similarity(query_result, document_result)
print("Cosine Similarity:", similarity)

Cosine Similarity: 0.8806521938580575


[Document Loaders](https://python.langchain.com/docs/integrations/document_loaders/)

In [None]:
# #### INDEXING ####

# # Load blog
# import bs4
# from langchain_community.document_loaders import WebBaseLoader
# loader = WebBaseLoader(
#     web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
#     bs_kwargs=dict(
#         parse_only=bs4.SoupStrainer(
#             class_=("post-content", "post-title", "post-header")
#         )
#     ),
# )
# blog_docs = loader.load()

[Splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter)

> This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

In [None]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300,
    chunk_overlap=50)

# Make splits
# splits = text_splitter.split_documents(blog_docs)
splits = text_splitter.split_documents(docs)

[Vectorstores](https://python.langchain.com/docs/integrations/vectorstores/)

In [None]:
# Index
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits,
                                    embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

## Part 3: Retrieval

In [None]:
# Index
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits,
                                    embedding=OpenAIEmbeddings())


retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

In [None]:
docs = retriever.get_relevant_documents("How would Thanos say hello?")

In [None]:

docs

[Document(page_content="In all my years of conquest...\r\n\r\nThanos looks out at the tiny human struggling to stand.\r\n\r\n                    THANOS (CONT'D)\r\n          Of violence and slaughter...it was\r\n          never personal.\r\n\r\nThanos gestures behind him as A RUMBLE ECHOES.\r\n\r\nTHOUSANDS OF ALIENS RING THE LIP OF THE CRATER: THE BLACK\r\nORDER LEADS A PLATOON OF CHITAURI, SAKAARANS, AND OUTRIDERS.\r\n\r\n                    THANOS (CONT'D)\r\n          But I'll tell you now, the things\r\n          I'm about to do to your stubborn,\r\n          annoying, little planet...\r\n\r\nQ-SHIPS, DROPSHIPS, NECROCRAFT, AND LEVIATHANS BUZZ ABOVE.", metadata={'source': 'https://imsdb.com/scripts/Avengers-Endgame.html'})]

## Part 4: Generation

![Screenshot 2024-02-12 at 1.37.38 PM.png](attachment:f9b0e284-58e4-4d33-9594-2dad351c569a.png)

In [None]:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Prompt
template = """Rewrite the prompt in the style of the character mentioned. The context contains sample dialogue from the character, :

Character: {character}
Context: {context}

Promptq: {promptq}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt

ChatPromptTemplate(input_variables=['character', 'context', 'promptq'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['character', 'context', 'promptq'], template='Rewrite the prompt in the style of the character mentioned. The context contains sample dialogue from the character, :\n\nCharacter: {character}\nContext: {context}\n\nPromptq: {promptq}\n'))])

In [None]:
# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [None]:
# Chain
chain = prompt | llm

In [None]:
docs

[Document(page_content="In all my years of conquest...\r\n\r\nThanos looks out at the tiny human struggling to stand.\r\n\r\n                    THANOS (CONT'D)\r\n          Of violence and slaughter...it was\r\n          never personal.\r\n\r\nThanos gestures behind him as A RUMBLE ECHOES.\r\n\r\nTHOUSANDS OF ALIENS RING THE LIP OF THE CRATER: THE BLACK\r\nORDER LEADS A PLATOON OF CHITAURI, SAKAARANS, AND OUTRIDERS.\r\n\r\n                    THANOS (CONT'D)\r\n          But I'll tell you now, the things\r\n          I'm about to do to your stubborn,\r\n          annoying, little planet...\r\n\r\nQ-SHIPS, DROPSHIPS, NECROCRAFT, AND LEVIATHANS BUZZ ABOVE.", metadata={'source': 'https://imsdb.com/scripts/Avengers-Endgame.html'})]

In [None]:
# Run
chain.invoke({"context":docs,"character":"Thanos","promptq":"Thanos orders breakfast from Ya Kun."})

AIMessage(content='"I am inevitable. I demand a hearty breakfast fit for a conqueror. Bring me the finest kaya toast and soft-boiled eggs from Ya Kun. And make it quick, for I have galaxies to conquer."', response_metadata={'token_usage': {'completion_tokens': 43, 'prompt_tokens': 259, 'total_tokens': 302}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-d77c3097-901a-404e-91a4-a8b7be88d3b2-0')

In [None]:
# Run
chain.invoke({"context":docs,"character":"Thanos","promptq":"""
We, the citizens of Singapore, pledge ourselves as one united people, regardless of race, language or religion, to build a democratic society, based on justice and equality, so as to achieve happiness, prosperity and progress for our nation.
"""})

AIMessage(content='In all my years of conquest, I have seen the folly of division among beings. The citizens of Singapore have shown wisdom in their pledge to unite as one people, regardless of race, language, or religion. This democratic society they strive to build, based on justice and equality, is a noble pursuit. May their efforts lead to happiness, prosperity, and progress for their nation. Just as I seek balance in the universe, may Singapore find balance in their unity.', response_metadata={'token_usage': {'completion_tokens': 93, 'prompt_tokens': 299, 'total_tokens': 392}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-10b14d6d-09fe-4390-a70d-daf9b2d0c63c-0')

In [None]:
# Run
chain.invoke({"context":docs,"character":"Thanos","promptq":"""
I want to sleep.
"""})

AIMessage(content='I desire rest.', response_metadata={'token_usage': {'completion_tokens': 4, 'prompt_tokens': 258, 'total_tokens': 262}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-0eba29d9-0217-4a88-871d-79c3165aade3-0')

## Yoda

In [None]:
#### INDEXING ####

# Load Documents
loader = WebBaseLoader(
    web_paths=("https://imsdb.com/scripts/Star-Wars-The-Empire-Strikes-Back.html",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("scrtext")
        )
    ),
)
docs = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
splits = text_splitter.split_documents(docs)

In [None]:
vectorstore = Chroma.from_documents(documents=splits,
                                    embedding=OpenAIEmbeddings())


In [None]:


retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

docs = retriever.get_relevant_documents("How would Yoda say goodbye?")

In [None]:
for i in range(len(docs)):
  print(docs[i].page_content)


No, no, there is no why.  Nothing 
		more will I teach you today.
		Clear your mind of questions.  
		Mmm.  Mmmmmmmm.

Artoo beeps in the distance as Luke lets Yoda down to the ground.  
Breathing heavily, he takes his shirt from a nearby tree branch and 
pulls it on.
No, no, there is no why.  Nothing 
		more will I teach you today.
		Clear your mind of questions.  
		Mmm.  Mmmmmmmm.

Artoo beeps in the distance as Luke lets Yoda down to the ground.  
Breathing heavily, he takes his shirt from a nearby tree branch and 
pulls it on.
No, no, there is no why.  Nothing 
		more will I teach you today.
		Clear your mind of questions.  
		Mmm.  Mmmmmmmm.

Artoo beeps in the distance as Luke lets Yoda down to the ground.  
Breathing heavily, he takes his shirt from a nearby tree branch and 
pulls it on.


In [None]:
# Prompt
template = """Rewrite the prompt in the style of the character mentioned. The context contains sample dialogue from the character, :

Character: {character}
Context: {context}

Promptq: {promptq}
"""

prompt = ChatPromptTemplate.from_template(template)

# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Chain
chain = prompt | llm

In [None]:
# Run
chain.invoke({"context":docs,"character":"Yoda","promptq":"""
We, the citizens of Singapore, pledge ourselves as one united people, regardless of race, language or religion, to build a democratic society, based on justice and equality, so as to achieve happiness, prosperity and progress for our nation.
"""})

AIMessage(content='Pledge ourselves as one united people, we do. Regardless of race, language, or religion. Build a democratic society, we must. Justice and equality, the foundation. Happiness, prosperity, and progress for our nation, we seek. Hmm. Mmmmmmmm.', response_metadata={'token_usage': {'completion_tokens': 55, 'prompt_tokens': 466, 'total_tokens': 521}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-79ae7290-a659-462c-84b2-b7bff3c7fc32-0')

## Confucius

In [None]:
#### INDEXING ####

# Load Documents
loader = WebBaseLoader(
    web_paths=("https://www.gutenberg.org/cache/epub/3330/pg3330-images.html",),
    # bs_kwargs=dict(
    #     parse_only=bs4.SoupStrainer(
    #         class_=("scrtext")
    #     )
    # ),
)
docs = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
splits = text_splitter.split_documents(docs)

In [None]:
vectorstore = Chroma.from_documents(documents=splits,
                                    embedding=OpenAIEmbeddings())


In [None]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

docs = retriever.get_relevant_documents("How does the Master speak")

In [None]:
# Prompt
template = """Rewrite the prompt in the style of the character mentioned. The context contains sample dialogue from the character, :

Character: {character}
Context: {context}

Promptq: {promptq}
"""

prompt = ChatPromptTemplate.from_template(template)

# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Chain
chain = prompt | llm

In [None]:
# Run
chain.invoke({"context":docs,"character":"The Master","promptq":"""
We, the citizens of Singapore, pledge ourselves as one united people, regardless of race, language or religion, to build a democratic society, based on justice and equality, so as to achieve happiness, prosperity and progress for our nation.
"""})

AIMessage(content='We, the people of this land, vow to stand together as a unified force, transcending differences in race, language, and faith. Our goal is to construct a society rooted in fairness and parity, leading us towards joy, success, and advancement for our country.', response_metadata={'token_usage': {'completion_tokens': 54, 'prompt_tokens': 228, 'total_tokens': 282}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-342ee55b-c4ad-4131-bdb8-b4ebc4777722-0')

In [None]:
from langchain import hub
prompt_hub_rag = hub.pull("rlm/rag-prompt")

In [None]:
prompt_hub_rag

[RAG chains](https://python.langchain.com/docs/expression_language/get_started#rag-search-example)

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is Task Decomposition?")