# Improving Fine-tuned Model using RAG



## Stages within RAG


1. **Loading:-**
* Nodes and Documents: Imagine each of your papers as a piece of data. A “Document” is like a big container where we put all these papers together.
* Connectors: These are like special tools we use to pick up papers from different places and put them into our big box

2. **Indexing and Embedding:-** <br>
We want to make it easy to find the right paper from the Document when we need it.
* Indexes: Think of indexes like a big filing system for our papers. We need a way to organize all the papers so that when we want to find one, we can do it quickly
*  Embeddings: LLMs generate numerical representations of data called embeddings. When filtering your data for relevance, LlamaIndex will convert queries into embeddings, and your vector store will find data that is numerically similar to the embedding of your query.

3. **Storing:-**
* Storing your index
* Other metadata: Besides just the filing system, there might be other important information about each paper that you want to keep track of, for example you might want to remember when you last accessed a paper, who wrote it, or how relevant it is to certain topics. All this extra information is called metadata.
4. **Querying:-** <br> Querying is like asking a question to find the relevant context you need based on the query. You might ask, “Do you have any papers about space exploration?”
* Retrievers: Each retriever knows a different way to look for context. For example, one retriever might quickly find papers based on keywords, while another might focus on finding papers written by specific authors with different efficiency and accuracy
* Routers: A router is like a manager who decides which retriever to assign to help you based on your query. They use a selector to choose the best option based on each candidate's metadata and the query
* Node Postprocessors:  A Node Postprocessor helps to organize the results obtained from retiriever. It might rearrange the context based on relevance, filter out any irrelevant ones, or even add additional information to help you better understand the results.
* Response Synthesizers: Finally, once you have a set of contexts that match your query, a response synthesizer takes all this information and presents it to you in a clear and understandable way. In other words, response synthesizer generates a response from an LLM, using a user query and a given set of retrieved text chunks.
5. Evaluation: Checking how well something works compared to other options base don factors like accuracy, faithfullness and speed.

For more info checkout these resources:
- https://docs.llamaindex.ai/en/stable/getting_started/concepts/
- https://medium.com/@aneesha161994/question-answering-in-rag-using-llama-index-92cfc0b4dae3
- https://medium.com/@aneesha161994/part-2-llama-index-question-answering-in-rag-b174fd05c371

### How does RAG works?
 Your data is loaded and prepared, or “indexed,” so that it can be quickly searched. When a user makes a query, the index filters your data to find the most relevant information. This filtered context, along with the user’s query, is then sent to the LLM along with a prompt. The LLM uses this information to generate a response.

Understanding RAG techniques is important for integrating your data into applications, even if you’re building something like a chatbot or an agent<br>
<center><img src='https://drive.google.com/uc?id=1wOx96tgYfNaWOwY6ut3IhROflPaiEiIq' width="600" height="300"></center>


How do we best augment LLMs with our own private data?<br>
We need a comprehensive toolkit to help perform this data augmentation for LLMs.
That's where **LlamaIndex** comes in.<br>
**LlamaIndex is a "data framework" to help you build LLM apps**. It provides the following tools:

- Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.).
- Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
- Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
- Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, ChatGPT, anything else).

### imports

In [1]:
!pip install llama-index # A starter Python package that includes core LlamaIndex as well as a selection of integrations.
!pip install llama-index-embeddings-huggingface
!pip install peft
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes

Collecting llama-index
  Downloading llama_index-0.10.50-py3-none-any.whl (6.8 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.2.7-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.12-py3-none-any.whl (26 kB)
Collecting llama-index-core==0.10.50 (from llama-index)
  Downloading llama_index_core-0.10.50-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.10-py3-none-any.whl (6.2 kB)
Collecting llama-index-indices-managed-llama-cloud>=0.2.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.2.1-py3-none-any.whl (9.1 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_le

* The LlamaIndex Python library is namespaced such that import statements which
include core imply that the core package is being used. <br>
* In contrast, those statements without core imply that an integration package is being used

In [2]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding              #LlamaIndex integration packages from LlamaHub
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex   #rest are LlamaIndex Core Packages
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

### Define Settings

#### Embed Model
The embedding model is used to convert text to numerical representationss, used for calculating similarity and top-k retrieval.

The Settings is a bundle of commonly used resources where local configurations (transformations, LLMs, embedding models) can be passed directly into the interfaces that make use of them. Refer https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/settings/

In [3]:
# import any embedding model on HF hub (https://huggingface.co/spaces/mteb/leaderboard)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Settings.embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large") # alternative model

Settings.llm = None
Settings.chunk_size = 256
Settings.chunk_overlap = 25

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

LLM is explicitly disabled. Using MockLLM.


### Read and Store Docs into Vector DB
The SimpleDirectoryReder is a foundational tool within LlamaIndex for loading data from local files.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!ls /content/drive/MyDrive

 133762266-money-transfer-10082022.pdf		       'Machine Learning Resources.gdoc'
 210070013-1.pdf				       'ML Tech Resume.pdf'
 articles					        Prime_between_1_and_200.ipynb
 Classroom					        Resume
'Colab Notebooks'				       'Resume_new (1).pdf'
 Computer_Vision_Notebooks			       'Resume_new (2).pdf'
 deep_learning					       'Resume_new (3).pdf'
'Document from Ashwin Nagarwal.pdf'		        Resume_new.pdf
'dynamic-selection-main (1)'			       'Schedule (1).gsheet'
'dynamic-selection-main (1)-20240504T185618Z-001.zip'   Schedule.gsheet
'Endsem Schedule.xlsx'				       'Screenshot (356).png'
'Heat Map.png'					       'Shapefiles and base map'
 Homestays_Data.xlsx


In [6]:
# articles available here: {add GitHub repo}
documents = SimpleDirectoryReader("/content/drive/MyDrive/articles").load_data()

Observations:
- All 3 files in the article folder has total 71 pages combined.
- So documents is a list which contain 71 list elements where each element corresponds to one page of text in the article folder files.
- Also document id is different for each page/list element

In [7]:
documents

[Document(id_='9b1ffdfc-44d2-44f6-923e-b5061ef3d9c5', embedding=None, metadata={'page_label': '1', 'file_name': '4 Ways to Quantify Fat Tails with Python _ by Shaw Talebi _ Towards Data Science.pdf', 'file_path': '/content/drive/MyDrive/articles/4 Ways to Quantify Fat Tails with Python _ by Shaw Talebi _ Towards Data Science.pdf', 'file_type': 'application/pdf', 'file_size': 1795379, 'creation_date': '2024-06-28', 'last_modified_date': '2024-06-28'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Member-only story\n4 Ways to Quantify Fat Tails with\nPython\nIntuition and Example Code\nShaw Talebi\nPublished inTowards Data Science·11 min read·Dec 7, 2023\n200 8\nA fat (cat’s) tail. Image from Canva.\nOpen in app\nSearch Write\n', mimetype='text/plain', s

In [8]:
# some ad hoc document refinement
print(len(documents))
for doc in documents:
    if "Member-only story" in doc.text:  #any page or list element with this text is removed from document
        documents.remove(doc)
        continue

    if "The Data Entrepreneurs" in doc.text:
        documents.remove(doc)

    if " min read" in doc.text:
        documents.remove(doc)

print(len(documents))

71
61


### **Indexing**

---


With your data loaded, you now have a list of Document objects (or a list of Nodes). It's time to build an Index over these objects so you can start querying them.

### What is an Index?#
In LlamaIndex terms, an Index is a data structure composed of Document objects, designed to enable querying by an LLM. Your Index is designed to be complementary to your querying strategy.

LlamaIndex offers several different index types.One of these is Vector Store Index
### Vector Store Index#
A VectorStoreIndex is by far the most frequent type of Index you'll encounter. The Vector Store Index takes your Documents and splits them up into Nodes. It then creates vector embeddings of the text of every node, ready to be queried by an LLM.

#### What is an embedding?#
Vector embeddings are central to how LLM applications function.

A vector embedding, often just called an embedding, is a numerical representation of the semantics, or meaning of your text. Two pieces of text with similar meanings will have mathematically similar embeddings, even if the actual text is quite different !!!

In [9]:
# store docs into vector DB
index = VectorStoreIndex.from_documents(documents)

### Set Up Search Function

In [10]:
# set number of docs to retreive
top_k = 3

# configure retriever
retriever = VectorIndexRetriever(                                #retrieves similarity top k
    index=index,
    similarity_top_k=top_k,
)

### What is Query Engine?

A Query Engine is an end-to-end pipeline that allows you to ask questions over your data. It takes in a natural language query, and returns a response, along with reference context retrieved and passed to the LLM.

In [11]:
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

### Retrieve Relevant Docs

In [12]:
# query documents
query = "What is fat-tailedness?"
response = query_engine.query(query)

In [13]:
# reformat response
context = "Context:\n"
for i in range(top_k):
    context = context + response.source_nodes[i].text + "\n\n"

print(context)

Context:
Some of the controversy might be explained by the observation that log-
normal distributions behave like Gaussian for low sigma and like Power Law
at high sigma [2].
However, to avoid controversy, we can depart (for now) from whether some
given data fits a Power Law or not and focus instead on fat tails.
Fat-tailedness — measuring the space between Mediocristan
and Extremistan
Fat Tails are a more general idea than Pareto and Power Law distributions.
One way we can think about it is that “fat-tailedness” is the degree to which
rare events drive the aggregate statistics of a distribution. From this point of
view, fat-tailedness lives on a spectrum from not fat-tailed (i.e. a Gaussian) to
very fat-tailed (i.e. Pareto 80 – 20).
This maps directly to the idea of Mediocristan vs Extremistan discussed
earlier. The image below visualizes different distributions across this
conceptual landscape [2].

Pareto, Power Laws, and Fat Tails
What they don’t teach you in statistics
towardsdata

### Import LLM

In [14]:
# load fine-tuned model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

config = PeftConfig.from_pretrained("shawhin/shawgpt-ft")
model = PeftModel.from_pretrained(model, "shawhin/shawgpt-ft")

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]

Some weights of the model checkpoint at TheBloke/Mistral-7B-Instruct-v0.2-GPTQ were not used when initializing MistralForCausalLM: ['model.layers.0.mlp.down_proj.bias', 'model.layers.0.mlp.gate_proj.bias', 'model.layers.0.mlp.up_proj.bias', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.0.self_attn.o_proj.bias', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.1.mlp.down_proj.bias', 'model.layers.1.mlp.gate_proj.bias', 'model.layers.1.mlp.up_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.1.self_attn.o_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.10.mlp.down_proj.bias', 'model.layers.10.mlp.gate_proj.bias', 'model.layers.10.mlp.up_proj.bias', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.10.self_attn.o_proj.bias', 'model.layers.10.self_attn.q_proj.bias', 'model.layers.10.self_attn.v_proj.bias', 'model.layers.11.mlp.down_proj.bias', 'model.layers.11

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/8.40M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

## Use LLM


### Response without RAG

In [18]:
# prompt (with only comment/query,no context)
intstructions_string = f"""ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""
prompt_template = lambda comment: f'''[INST] {intstructions_string} \n{comment} \n[/INST]'''
comment = "What is fat-tailedness?"

prompt = prompt_template(comment)
print(prompt)

[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
What is fat-tailedness? 
[/INST]


In [19]:
model.eval()

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
What is fat-tailedness? 
[/INST]
Great question!

Fat-tailedness is a statistical property of a distribution. In simple terms, it refers to the presence of extreme outliers or heavy tails in the distribution.

For instance, consider the distribution of heights in a population. A normal distribution would have most people clustered around an average height with a few people deviating slightly from the mean. However, in a fat-tailed distribution, you would observe a larger number of people being

### Response with Context

In [20]:
# prompt (with comment/query and context)
prompt_template_w_context = lambda context, comment: f"""[INST]ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

{context}
Please respond to the following comment. Use the context above if it is helpful.

{comment}
[/INST]
"""

In [21]:
prompt = prompt_template_w_context(context, comment)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST]ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Context:
Some of the controversy might be explained by the observation that log-
normal distributions behave like Gaussian for low sigma and like Power Law
at high sigma [2].
However, to avoid controversy, we can depart (for now) from whether some
given data fits a Power Law or not and focus instead on fat tails.
Fat-tailedness — measuring the space between Mediocristan
and Extremistan
Fat Tails are a more general idea than Pareto and Power Law distributions.
One way we can think about it is that “fat-tailedness” is the degree to which
