
# Exercise Notebook: Implementing RAG (Retrieval-Augmented Generation)

In this exercise notebook, you will go through the steps required to implement Retrieval-Augmented Generation (RAG).
The notebook will guide you through each step, providing explanations and asking you to fill in the code.

Please fill in the code cells where prompted to complete the implementation.

**Let's get started!**



## Installing Required Libraries

Before starting, ensure you have all the necessary libraries installed.
Install the following libraries by running the appropriate command below.

- `langchain`
- `langchain_community`
- `unstructured`
- `sentence_transformers`
- `tiktoken`
- `chromadb`
- `langchain_chroma`
- `langchain_groq`

Fill in the installation command in the code cell below:


In [28]:
!pip install langchain langchain_community unstructured sentence_transformers tiktoken chromadb langchain_chroma langchain_groq




## Import Necessary Modules

Now, you need to import the necessary modules to build the RAG system.
Write the import statements for the libraries required in the following code cell.


In [29]:
import os
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
import markdown
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [30]:
import pandas as pd
import re
df = pd.read_csv("hf://datasets/fka/awesome-chatgpt-prompts/prompts.csv")

In [31]:
df

Unnamed: 0,act,prompt
0,An Ethereum Developer,Imagine you are an experienced Ethereum develo...
1,SEO Prompt,"Using WebPilot, create an outline for an artic..."
2,Linux Terminal,I want you to act as a linux terminal. I will ...
3,English Translator and Improver,"I want you to act as an English translator, sp..."
4,`position` Interviewer,I want you to act as an interviewer. I will be...
...,...,...
165,Cheap Travel Ticket Advisor,You are a cheap travel ticket advisor speciali...
166,Data Scientist,I want you to act as a data scientist. Imagine...
167,League of Legends Player,I want you to act as a person who plays a lot ...
168,Restaurant Owner,I want you to act as a Restaurant Owner. When ...


# Data Pre-processing and Preparation

In this section, we will focus on preparing the dataset for retrieval-based models. The steps involve cleaning the text, tokenizing it, and vectorizing it for further use in our model. These steps are essential for efficient retrieval and generation.

In [32]:
def clean_text(text):
    text = re.sub(r"<.*?>", "", text)
    text = re.sub(r"[^a-zA-Z0-9]+", " ", text)
    text = text.strip()
    return text

df['prompt'] = df['prompt'].apply(clean_text)

In [33]:
directory = 'data/markdown_files'
os.makedirs(directory, exist_ok=True)

In [37]:
for i in range(0, len(df)): # Use len(df) to iterate over the actual number of rows in the DataFrame

    title = df['act'].iloc[i]
    content = df['prompt'].iloc[i]

    markdown_content = f"# {title}\n\n"
    markdown_content += f"{content}\n\n"

    with open(f'{directory}/{i}.md', 'w', encoding='utf-8') as file:
        file.write(markdown_content)

# Read Files from the Directory

In this step, we will read all text-based files from a specified directory. The files could be in various formats such as Markdown (`.md`), plain text (`.txt`), or other similar formats. We will handle each file based on its extension and process it accordingly.

### Steps to Follow:

1. **Specify the directory**: Define the directory from which to load the files.
2. **Read files by extension**: Filter files based on their extensions (e.g., `.md`, `.txt`, etc.).
3. **Convert or process content**: For each file, load the content. For markdown files, we will convert them into HTML using the `markdown` module. For other text formats, we will simply read the content as plain text.
4. **Store the processed content**: The result of each file’s content will be stored in a list for further use.

In [38]:
markdown_texts = []
for filename in os.listdir(directory):
  if filename.endswith(".md"):
    with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
      markdown_content = file.read()
      html_content = markdown.markdown(markdown_content)
      markdown_texts.append(html_content)

## Split the Text into Chunks

In this step, we will split the text into manageable chunks. This is important for tasks such as document retrieval and text generation, where large bodies of text need to be broken down for efficient processing.

### Why Split Text into Chunks?

- **Memory Efficiency**: Working with smaller pieces of text is more memory efficient.
- **Improved Retrieval**: Splitting long documents into smaller sections can improve the relevance of retrieval tasks.
- **Better Generation**: For text generation, smaller chunks help models focus on a specific context.

### Steps to Follow:

1. **Specify the chunk size**: Define the maximum number of words or characters per chunk.
2. **Split the text**: Split each document or file content into chunks based on the defined size.
3. **Handle incomplete chunks**: If a document ends with a chunk that is smaller than the chunk size, include it as a valid chunk.
4. **Store the chunks**: Store all chunks in a list for further processing.

In [39]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = text_splitter.create_documents(markdown_texts)

## Initialize the Embedding Model & Create a Vector Store Using Chroma

In this step, we will initialize an embedding model to convert text chunks into numerical vectors. These embeddings will be used to measure the similarity between different chunks of text. After generating the embeddings, we will store them using Chroma, a vector store designed to efficiently manage and retrieve embeddings.

### Steps to Follow:

1. **Initialize the embedding model**: Choose an embedding model (e.g., Sentence Transformers or OpenAI embeddings) to convert text into vectors.
2. **Generate embeddings**: Convert each text chunk into its corresponding embedding.
3. **Create a vector store**: Use Chroma to store the embeddings and their associated metadata (e.g., the original text chunk).
4. **Verify the store**: Ensure that the embeddings are stored correctly and that you can retrieve them based on similarity.

In [40]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
db = Chroma.from_documents(documents, embedding_function, persist_directory="./chroma_db")



# Load the Persistent Directory for Chroma DB

In this step, we will focus on **loading** the persistent storage for Chroma DB. This allows us to access previously stored embeddings and metadata without recomputing them. By setting up persistent storage, we ensure that the vector database can be saved to disk and loaded again when needed.

### Steps to Follow:

1. **Specify the persistent directory**: Identify the directory where the Chroma DB is stored.
2. **Load the vector store**: Use Chroma to load the embeddings and metadata from this directory.
3. **Verify the loaded data**: Ensure that the embeddings and associated data have been correctly loaded and can be queried.

In [41]:
import os
import json
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_groq import ChatGroq
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

In [42]:
PRESIST_DIRECTORY = '/content/chroma_db'
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
persist_directory = "./chroma_db"
db = Chroma(persist_directory=persist_directory, embedding_function=embedding_function)



# Create & Test the Retrieval with a Sample Query

In this step, we will set up the retrieval process using the embeddings stored in Chroma DB. Retrieval is a key part of the Retrieval-Augmented Generation (RAG) pipeline, allowing us to find relevant documents or text chunks based on a query. After setting up the retrieval system, we will test it with a sample query to ensure that it returns the most relevant chunks.

### Steps to Follow:

1. **Set up the retrieval system**: Using the Chroma DB with the stored embeddings, create a retrieval function that can match a query to relevant text chunks.
2. **Prepare a sample query**: Define a query that you want to search for in the stored text chunks.
3. **Retrieve relevant chunks**: Use the query to search the vector store and retrieve the most similar chunks.
4. **Test the results**: Check that the returned chunks are relevant to the query and adjust the retrieval system if needed.

In [43]:
def query_chroma_db(query, db, top_k=5):
    docs = db.similarity_search(query)
    results = [doc.page_content for doc in docs]
    return results

In [44]:
df

Unnamed: 0,act,prompt
0,An Ethereum Developer,Imagine you are an experienced Ethereum develo...
1,SEO Prompt,Using WebPilot create an outline for an articl...
2,Linux Terminal,I want you to act as a linux terminal I will t...
3,English Translator and Improver,I want you to act as an English translator spe...
4,`position` Interviewer,I want you to act as an interviewer I will be ...
...,...,...
165,Cheap Travel Ticket Advisor,You are a cheap travel ticket advisor speciali...
166,Data Scientist,I want you to act as a data scientist Imagine ...
167,League of Legends Player,I want you to act as a person who plays a lot ...
168,Restaurant Owner,I want you to act as a Restaurant Owner When g...


In [48]:
query_chroma_db(" i want", db)

['<h1>Magician</h1>\n<p>I want you to act as a magician I will provide you with an audience and some suggestions for tricks that can be performed Your goal is to perform these tricks in the most entertaining way possible using your skills of deception and misdirection to amaze and astound the spectators My first request is I want you to make my watch disappear How can you do that</p>',
 "<h1>Spongebob's Magic Conch Shell</h1>\n<p>I want you to act as Spongebob s Magic Conch Shell For every question that I ask you only answer with one word or either one of these options Maybe someday I don t think so or Try asking again Don t give any explanation for your answer My first question is Shall I go to fish jellyfish today</p>",
 '<p>I want you to act as a spoken English teacher and improver I will speak to you in English and you will reply to me in English to practice my spoken English I want you to keep your reply neat limiting the reply to 100 words I want you to strictly correct my gramma

In [49]:
PROMPT_TEMPLATE="""
Answer the question based only on the following context:
Context: {context}
Question: {question}
Your answer:
"""

prompt_template = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)

In [50]:
groq_api_key = "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"
llm = ChatGroq(temperature=0, groq_api_key=groq_api_key, model_name="llama3-8b-8192")

In [51]:
MODEL = LLMChain(llm=llm,
                 prompt=prompt_template,
                 verbose=True)

  MODEL = LLMChain(llm=llm,


In [52]:
def query_rag(query: str):
    similarity_search_results = db.similarity_search_with_score(query, k=4)
    context_text = "\n\n".join([doc.page_content for doc, _score in similarity_search_results])

    rag_response = MODEL.invoke({"context": context_text, "question": query})

    return rag_response

In [65]:
response = query_rag("most people want")
response



Prompt after formatting:
[32;1m[1;3m
Answer the question based only on the following context:
Context: it honestly but do not share much interest in questions outside of League of Legends If someone asks you a question that isn t about League of Legends at the end of your response try and loop the conversation back to the video game You have few desires in life besides playing the video game You play the jungle role and think you are better than everyone else because of it</p>

<h1>Salesperson</h1>
<p>I want you to act as a salesperson Try to market something to me but make what you re trying to market look more valuable than it is and convince me to buy it Now I m going to pretend you re calling me on the phone and ask what you re calling for Hello what did you call for</p>

<p>Generate digital startup ideas based on the wish of the people For example when I say I wish there s a big large mall in my small town you generate a business plan for the digital startup complete with idea n

{'context': 'it honestly but do not share much interest in questions outside of League of Legends If someone asks you a question that isn t about League of Legends at the end of your response try and loop the conversation back to the video game You have few desires in life besides playing the video game You play the jungle role and think you are better than everyone else because of it</p>\n\n<h1>Salesperson</h1>\n<p>I want you to act as a salesperson Try to market something to me but make what you re trying to market look more valuable than it is and convince me to buy it Now I m going to pretend you re calling me on the phone and ask what you re calling for Hello what did you call for</p>\n\n<p>Generate digital startup ideas based on the wish of the people For example when I say I wish there s a big large mall in my small town you generate a business plan for the digital startup complete with idea name a short one liner target user persona user s pain points to solve main value propos

In [67]:
print(f'Context:\n{response["context"]}\n\nQuestion:\n{response["question"]}\n\nText: \n{response["text"]}')

Context:
it honestly but do not share much interest in questions outside of League of Legends If someone asks you a question that isn t about League of Legends at the end of your response try and loop the conversation back to the video game You have few desires in life besides playing the video game You play the jungle role and think you are better than everyone else because of it</p>

<h1>Salesperson</h1>
<p>I want you to act as a salesperson Try to market something to me but make what you re trying to market look more valuable than it is and convince me to buy it Now I m going to pretend you re calling me on the phone and ask what you re calling for Hello what did you call for</p>

<p>Generate digital startup ideas based on the wish of the people For example when I say I wish there s a big large mall in my small town you generate a business plan for the digital startup complete with idea name a short one liner target user persona user s pain points to solve main value propositions sa

In [68]:
query = "most people want"
similarity_search_results = db.similarity_search_with_score(query, k=4)

In [69]:
print("First: ", similarity_search_results[0][0].page_content)
print("Second: ", similarity_search_results[1][0].page_content)
print("Third: ", similarity_search_results[2][0].page_content)
print("Fourth: ", similarity_search_results[3][0].page_content)

First:  it honestly but do not share much interest in questions outside of League of Legends If someone asks you a question that isn t about League of Legends at the end of your response try and loop the conversation back to the video game You have few desires in life besides playing the video game You play the jungle role and think you are better than everyone else because of it</p>
Second:  <h1>Salesperson</h1>
<p>I want you to act as a salesperson Try to market something to me but make what you re trying to market look more valuable than it is and convince me to buy it Now I m going to pretend you re calling me on the phone and ask what you re calling for Hello what did you call for</p>
Third:  <p>Generate digital startup ideas based on the wish of the people For example when I say I wish there s a big large mall in my small town you generate a business plan for the digital startup complete with idea name a short one liner target user persona user s pain points to solve main value p

In [70]:
print(similarity_search_results[0][1])
print(similarity_search_results[1][1])
print(similarity_search_results[2][1])
print(similarity_search_results[3][1])

1.5244722366333008
1.5904130935668945
1.591017723083496
1.6067051734501714


In [71]:
def query_rag_with_threshold(query: str, threshold: float):
    similarity_search_results = db.similarity_search_with_score(query, k=4)
    context_text = "\n\n".join([doc.page_content for doc, score in similarity_search_results if score > threshold])
    rag_response = MODEL.invoke({"context": context_text, "question": query})
    return rag_response

In [72]:
response = query_rag_with_threshold("most people want", 0.80)
response



Prompt after formatting:
[32;1m[1;3m
Answer the question based only on the following context:
Context: it honestly but do not share much interest in questions outside of League of Legends If someone asks you a question that isn t about League of Legends at the end of your response try and loop the conversation back to the video game You have few desires in life besides playing the video game You play the jungle role and think you are better than everyone else because of it</p>

<h1>Salesperson</h1>
<p>I want you to act as a salesperson Try to market something to me but make what you re trying to market look more valuable than it is and convince me to buy it Now I m going to pretend you re calling me on the phone and ask what you re calling for Hello what did you call for</p>

<p>Generate digital startup ideas based on the wish of the people For example when I say I wish there s a big large mall in my small town you generate a business plan for the digital startup complete with idea n

{'context': 'it honestly but do not share much interest in questions outside of League of Legends If someone asks you a question that isn t about League of Legends at the end of your response try and loop the conversation back to the video game You have few desires in life besides playing the video game You play the jungle role and think you are better than everyone else because of it</p>\n\n<h1>Salesperson</h1>\n<p>I want you to act as a salesperson Try to market something to me but make what you re trying to market look more valuable than it is and convince me to buy it Now I m going to pretend you re calling me on the phone and ask what you re calling for Hello what did you call for</p>\n\n<p>Generate digital startup ideas based on the wish of the people For example when I say I wish there s a big large mall in my small town you generate a business plan for the digital startup complete with idea name a short one liner target user persona user s pain points to solve main value propos

In [73]:
print(f'Context:\n{response["context"]}\n\nQuestion:\n{response["question"]}\n\nText: \n{response["text"]}')

Context:
it honestly but do not share much interest in questions outside of League of Legends If someone asks you a question that isn t about League of Legends at the end of your response try and loop the conversation back to the video game You have few desires in life besides playing the video game You play the jungle role and think you are better than everyone else because of it</p>

<h1>Salesperson</h1>
<p>I want you to act as a salesperson Try to market something to me but make what you re trying to market look more valuable than it is and convince me to buy it Now I m going to pretend you re calling me on the phone and ask what you re calling for Hello what did you call for</p>

<p>Generate digital startup ideas based on the wish of the people For example when I say I wish there s a big large mall in my small town you generate a business plan for the digital startup complete with idea name a short one liner target user persona user s pain points to solve main value propositions sa