# Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG) combines information retrieval with language models. It first searches for relevant facts in external sources, then feeds those facts to the language model alongside the user's prompt. This helps the model generate more accurate and factual responses, even on topics beyond its initial training data.


---
## 1.&nbsp; Installations and Settings 🛠️
Let's get started by setting up a GPU on your Colab notebook. Head over to `Edit > Notebook Settings` and select `GPU` from the runtime type dropdown. Once you've made that change, click `Save`.

Now, we'll need to install two additional libraries to build our RAG model. These libraries will help us create and store numerical representations of our text, which are essential for this task.

1. **sentence_transformers:** This library will generate embeddings, which are like numerical summaries of our text. These embeddings will allow us to compare different sentences and identify relationships between them.

2. **faiss-gpu:** This library provides a fast and efficient database for storing and retrieving our numerical summaries.

> If you're using a CPU instead of a GPU, install `faiss-cpu` instead. This version will work just fine, but it may be a bit slower.

In [None]:
!pip3 install -qqq langchain --progress-bar off
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install -qqq llama-cpp-python --progress-bar off

!pip3 install -qqq sentence_transformers --progress-bar off
!pip3 install -qqq faiss-gpu --progress-bar off

!huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

---
## 2.&nbsp; Setting up your LLM 🧠
Here we'll add one extra parameter, `n_ctx`. This parameter controls the size of the input context. The larger the window, the more memory you're likely to use. The default of 512 is fine for most basic things, but as we are starting to retrieve articles and add them to the context window, we'll double the size to 1024. Feel free to play around with this number as your project needs.

In [1]:
from langchain.llms import LlamaCpp

llm = LlamaCpp(model_path = "/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/Chapter_8_generative_Ai/Model/faiss_index/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
               max_tokens = 2000,
               temperature = 0.1,
               top_p = 1,
               n_gpu_layers = -1,
               n_ctx = 1024)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/Chapter_8_generative_Ai/Model/faiss_index/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                


### 2.1.&nbsp;  Test your LLM

In [2]:
answer_1 = llm.invoke("Write a poem about data science.")
print(answer_1)


llama_print_timings:        load time =    6789.94 ms
llama_print_timings:      sample time =      17.75 ms /   165 runs   (    0.11 ms per token,  9294.20 tokens per second)
llama_print_timings: prompt eval time =    6789.65 ms /     8 tokens (  848.71 ms per token,     1.18 tokens per second)
llama_print_timings:        eval time =   10219.85 ms /   164 runs   (   62.32 ms per token,    16.05 tokens per second)
llama_print_timings:       total time =   17278.71 ms /   172 tokens



Data Science is the future, it's where we're all headed,
With algorithms and models, predictions are guaranteed.
We take in information from all around,
And use it to create insights that astound.

From machine learning to deep learning too,
There's nothing that data science can't do.
It helps us make decisions with ease,
And find patterns that were once hard to see.

With data science, we can predict the past,
And even shape the future that's yet to come.
It's a field that's constantly evolving,
And with every new discovery, our world is improving.

So let's embrace this future that's bright,
And use data science to make everything right.


---
## 3.&nbsp; Retrieval Augmented Generation 🔃

### 3.1.&nbsp; Find your data
Our model needs some information to work its magic! In this case, we'll be using a copy of Alice's Adventures in Wonderland, but feel free to swap it out for anything you like: legal documents, school textbooks, websites – the possibilities are endless!

In [None]:
#run in the terminal
pip install wget

In [None]:
!wget -O alice_in_wonderland.txt https://www.gutenberg.org/cache/epub/11/pg11.txt

> If your working locally, just download a txt book from Project Gutenburg. Here's the link to [Alice's Adventures in Wonderland](https://www.gutenberg.org/cache/epub/11/pg11.txt). Feel free to use any other book though.

### 3.2.&nbsp; Load the data
Now that we have the data, we have to load it in a format LangChain can understand. For this, Langchain has [loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/). There's loaders for CSV, text, PDF, and a host of other formats. You're not restricted to just text here.



In [3]:
from langchain.document_loaders import TextLoader

loader = TextLoader("/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/Chapter_8_generative_Ai/Model/faiss_index/alice_in_wonderland.txt")
documents = loader.load()

### 3.3.&nbsp; Splitting the document
Obviously, a whole book is a lot to digest. This is made easier by [splitting](https://python.langchain.com/docs/modules/data_connection/document_transformers/) the document into chunks. You can split it by paragraphs, sentences, or even individual words, depending on what you want to analyse. In Langchain, we have different tools like the RecursiveCharacterTextSplitter (say that five times fast!) that understand the structure of text and help you break it down into manageable chunks.

Check out [this website](https://chunkviz.up.railway.app/) to help visualise the splitting process.


In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800,
                                               chunk_overlap=150)

docs = text_splitter.split_documents(documents)

### 3.4.&nbsp; Creating vectors with embeddings

[Embeddings](https://python.langchain.com/docs/integrations/text_embedding) are a fancy way of saying we turn words into numbers that computers can understand. Each word gets its own unique code, based on its meaning and relationship to other words. The list of numbers produced is known as a vector. Vectors allow us to compare text and find chunks that contain similar information.

Different embedding models encode words and meanings in different ways, and finding the right one can be tricky. We're using open-source models from HuggingFace, who even have a handy [leaderboard of embeddings](https://huggingface.co/spaces/mteb/leaderboard) on their website. Just browse the options and see which one speaks your language (literally!).
> As we are doing a retrieval project, click on the `Retrieval` tab of the leaderboard to see the best embeddings for retrieval tasks.

In [5]:
from langchain.embeddings import HuggingFaceEmbeddings
import os

embedding_model = "sentence-transformers/all-MiniLM-l6-v2"
current_folder = os.getcwd()
embeddings_folder = current_folder  # Using the current directory

embeddings = HuggingFaceEmbeddings(model_name=embedding_model, cache_folder=embeddings_folder)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
current_folder = os.getcwd()
print(current_folder)

/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/Chapter_8_generative_Ai/class_notebooks


> The first time you use the embeddings model, it'll download all the necessary data. After that it runs locally on your machine. Unfortunately, Colab is a different story – its sessions end, so the model needs to download again each time.

To exemplify using embeddings to transform a sentence into a vector, let's look at an example:

In [7]:
test_text = "Why do data scientists make great comedians? They're always trying to make ANOVA pun"
query_result = embeddings.embed_query(test_text)
query_result

[0.009409726597368717,
 -0.023806331679224968,
 -0.012127568945288658,
 0.03612378612160683,
 -0.03382447361946106,
 -0.07974186539649963,
 0.07004600018262863,
 0.0746554359793663,
 0.04014184698462486,
 0.04419058933854103,
 -0.007505984045565128,
 -0.060012150555849075,
 -0.10028237104415894,
 0.032309673726558685,
 -0.039546284824609756,
 0.016906695440411568,
 -0.0303138867020607,
 -0.12780103087425232,
 -0.03218212351202965,
 -0.07546596229076385,
 7.660531264264137e-05,
 0.05085941031575203,
 0.12591633200645447,
 -0.04004546254873276,
 0.040401391685009,
 -0.022957686334848404,
 -0.07265669852495193,
 -0.02543501742184162,
 -0.01982485130429268,
 0.011819676496088505,
 -0.035723403096199036,
 0.036577362567186356,
 0.07559998333454132,
 0.03425060212612152,
 -0.05330543592572212,
 -0.030826333910226822,
 0.02147846296429634,
 0.12243172526359558,
 -0.0054456680081784725,
 0.04834012687206268,
 -0.004316539969295263,
 -0.043691281229257584,
 0.009050648659467697,
 0.027110163122

In [8]:
characters = len(test_text)
dimensions = len(query_result)
print(f"The {characters} character sentence was transformed into a {dimensions} dimension vector")

The 84 character sentence was transformed into a 384 dimension vector


Embedding vectors have a fixed length, meaning each vector produced by this specific embedding will always have 384 dimensions. Choosing the appropriate embedding size involves a trade-off between accuracy and computational efficiency. Larger embeddings capture more semantic information but require more memory and processing power. Start with the provided MiniLM embedding as a baseline and experiment with different sizes to find the optimal balance for your needs.

### 3.5.&nbsp; Creating a vector database
Imagine a library where books aren't just filed alphabetically, but also by their themes, characters, and emotions. That's the magic of vector databases: they unlock information beyond keywords, connecting ideas in unexpected ways.

In [9]:
from langchain.vectorstores import FAISS

vector_db = FAISS.from_documents(docs, embeddings)

Once the database is made, you can save it to use over and over again in the future.

In [11]:
vector_db.save_local("/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/Chapter_8_generative_Ai/Model/vector")

Here's the code to load it again.

We'll leave it commented out here as we don't need it right now - it's already stored above in the variable `vector_db`.

In [None]:
# new_db = FAISS.load_local("/content/faiss_index", embeddings)

You can also search your database to see which vectors are close to your input.

In [12]:
vector_db.similarity_search("What does the Mad Hatter drink?")

[Document(page_content='CHAPTER VII.\nA Mad Tea-Party\n\n\nThere was a table set out under a tree in front of the house, and the\nMarch Hare and the Hatter were having tea at it: a Dormouse was sitting\nbetween them, fast asleep, and the other two were using it as a\ncushion, resting their elbows on it, and talking over its head. “Very\nuncomfortable for the Dormouse,” thought Alice; “only, as it’s asleep,\nI suppose it doesn’t mind.”\n\nThe table was a large one, but the three were all crowded together at\none corner of it: “No room! No room!” they cried out when they saw\nAlice coming. “There’s _plenty_ of room!” said Alice indignantly, and\nshe sat down in a large arm-chair at one end of the table.\n\n“Have some wine,” the March Hare said in an encouraging tone.', metadata={'source': '/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/Chapter_8_generative_Ai/Model/faiss_index/alice_in_wonderland.txt'}),
 Document(page_content='“I’d rather finish my tea,” said the Hatte

### 3.6.&nbsp; Adding a prompt
We can guide our model's behavior with a prompt, similar to how we gave instructions to the chatbot. We'll use specific tags in the prompt to tell the model what to do. These tags, `<s> </s>` and `[INST] [/INST]`, come straight from the [model's documentation on Hugging Face](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF). They can also be found in [Mistral's docs](https://docs.mistral.ai/models/). Different models have different expectations for prompts, so always check the documentation.

In [14]:
from langchain.prompts.prompt import PromptTemplate

input_template = """
<s>
[INST] Answer the question based only on the following context: [/INST]
{context}
</s>
[INST] Question: {question} [/INST]
"""

prompt = PromptTemplate(template=input_template,
                        input_variables=["context", "question"])

### 3.7.&nbsp; RAG - chaining it all together
This is the final piece of the puzzle, we now bring everything together in a chain. Our vector database, our prompt, and our LLM join to give us retrieval augmented generation.

In [15]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_db.as_retriever(search_kwargs={"k": 2}), # top 2 results only, speed things up
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

In [16]:
answer = qa_chain.invoke("Who likes to chop off heads?")

answer

Llama.generate: prefix-match hit

llama_print_timings:        load time =    6789.94 ms
llama_print_timings:      sample time =      15.42 ms /    50 runs   (    0.31 ms per token,  3242.75 tokens per second)
llama_print_timings: prompt eval time =   16358.67 ms /   472 tokens (   34.66 ms per token,    28.85 tokens per second)
llama_print_timings:        eval time =    3230.55 ms /    49 runs   (   65.93 ms per token,    15.17 tokens per second)
llama_print_timings:       total time =   19998.08 ms /   521 tokens


{'query': 'Who likes to chop off heads?',
 'result': 'The Queen in the story "Alice\'s Adventures in Wonderland" likes to chop off heads. She orders her guards to execute two gardeners who were tending to a rose-tree by saying, "Off with their heads!"',
 'source_documents': [Document(page_content='“Leave off that!” screamed the Queen. “You make me giddy.” And then,\nturning to the rose-tree, she went on, “What _have_ you been doing\nhere?”\n\n“May it please your Majesty,” said Two, in a very humble tone, going\ndown on one knee as he spoke, “we were trying—”\n\n“_I_ see!” said the Queen, who had meanwhile been examining the roses.\n“Off with their heads!” and the procession moved on, three of the\nsoldiers remaining behind to execute the unfortunate gardeners, who ran\nto Alice for protection.\n\n“You shan’t be beheaded!” said Alice, and she put them into a large\nflower-pot that stood near. The three soldiers wandered about for a\nminute or two, looking for them, and then quietly marc

In [20]:
print(qa_chain.invoke("Whose head does she chop off?")['result'])

Llama.generate: prefix-match hit

llama_print_timings:        load time =    6789.94 ms
llama_print_timings:      sample time =       2.49 ms /    26 runs   (    0.10 ms per token, 10441.77 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    1949.50 ms /    26 runs   (   74.98 ms per token,    13.34 tokens per second)
llama_print_timings:       total time =    1990.39 ms /    27 tokens


The Queen orders the executioner to chop off the heads of the two gardeners who were tending to the roses.


#### 3.7.1.&nbsp; Exploring the returned dictionary

In [16]:
answer.keys()

dict_keys(['query', 'result', 'source_documents'])

##### `query`

The question that we asked.

In [17]:
answer['query']

'Who likes to chop off heads?'

##### `result`

The response.

In [18]:
answer['result']

'The Queen in "Alice\'s Adventures in Wonderland" likes to chop off heads. She orders her guards to behead two gardeners who were tending to a rose tree.'

In [19]:
print(answer['result'])

The Queen in "Alice's Adventures in Wonderland" likes to chop off heads. She orders her guards to behead two gardeners who were tending to a rose tree.


##### `source_documents`

What information was used to form the response.

In [20]:
answer['source_documents']

[Document(page_content='“Leave off that!” screamed the Queen. “You make me giddy.” And then,\nturning to the rose-tree, she went on, “What _have_ you been doing\nhere?”\n\n“May it please your Majesty,” said Two, in a very humble tone, going\ndown on one knee as he spoke, “we were trying—”\n\n“_I_ see!” said the Queen, who had meanwhile been examining the roses.\n“Off with their heads!” and the procession moved on, three of the\nsoldiers remaining behind to execute the unfortunate gardeners, who ran\nto Alice for protection.\n\n“You shan’t be beheaded!” said Alice, and she put them into a large\nflower-pot that stood near. The three soldiers wandered about for a\nminute or two, looking for them, and then quietly marched off after the\nothers.\n\n“Are their heads off?” shouted the Queen.', metadata={'source': '/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/alice_in_wonderland.txt'}),
 Document(page_content='“In my youth,” said the sage, as he shook his grey locks,\n    

In [21]:
answer['source_documents'][0]

Document(page_content='“Leave off that!” screamed the Queen. “You make me giddy.” And then,\nturning to the rose-tree, she went on, “What _have_ you been doing\nhere?”\n\n“May it please your Majesty,” said Two, in a very humble tone, going\ndown on one knee as he spoke, “we were trying—”\n\n“_I_ see!” said the Queen, who had meanwhile been examining the roses.\n“Off with their heads!” and the procession moved on, three of the\nsoldiers remaining behind to execute the unfortunate gardeners, who ran\nto Alice for protection.\n\n“You shan’t be beheaded!” said Alice, and she put them into a large\nflower-pot that stood near. The three soldiers wandered about for a\nminute or two, looking for them, and then quietly marched off after the\nothers.\n\n“Are their heads off?” shouted the Queen.', metadata={'source': '/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/alice_in_wonderland.txt'})

In [22]:
answer['source_documents'][0].page_content

'“Leave off that!” screamed the Queen. “You make me giddy.” And then,\nturning to the rose-tree, she went on, “What _have_ you been doing\nhere?”\n\n“May it please your Majesty,” said Two, in a very humble tone, going\ndown on one knee as he spoke, “we were trying—”\n\n“_I_ see!” said the Queen, who had meanwhile been examining the roses.\n“Off with their heads!” and the procession moved on, three of the\nsoldiers remaining behind to execute the unfortunate gardeners, who ran\nto Alice for protection.\n\n“You shan’t be beheaded!” said Alice, and she put them into a large\nflower-pot that stood near. The three soldiers wandered about for a\nminute or two, looking for them, and then quietly marched off after the\nothers.\n\n“Are their heads off?” shouted the Queen.'

In [23]:
print(answer['source_documents'][0].page_content)

“Leave off that!” screamed the Queen. “You make me giddy.” And then,
turning to the rose-tree, she went on, “What _have_ you been doing
here?”

“May it please your Majesty,” said Two, in a very humble tone, going
down on one knee as he spoke, “we were trying—”

“_I_ see!” said the Queen, who had meanwhile been examining the roses.
“Off with their heads!” and the procession moved on, three of the
soldiers remaining behind to execute the unfortunate gardeners, who ran
to Alice for protection.

“You shan’t be beheaded!” said Alice, and she put them into a large
flower-pot that stood near. The three soldiers wandered about for a
minute or two, looking for them, and then quietly marched off after the
others.

“Are their heads off?” shouted the Queen.


The documents name also gets returned, useful if you have multiple documents!

In [24]:
answer['source_documents'][0].metadata

{'source': '/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/alice_in_wonderland.txt'}

In [25]:
answer['source_documents'][0].metadata["source"]

'/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/alice_in_wonderland.txt'