<a href="https://colab.research.google.com/github/EdBerg21/AI-Based-Fraud-Detection/blob/main/rag_using_llama_3_langchain_and_chromadb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F769452%2Fb18d0513200d426e556b2b7b7c825981%2FRAG.png?generation=1695504022336680&alt=media"></img>

## Objective

Use Llama 2.0, Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system. This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM).
When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed.

## Definitions

* LLM - Large Language Model  
* Llama 2.0 - LLM from Meta
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation (see below more details about RAGs)

## Model details

* **Model**: Llama 2  
* **Variation**: 7b-chat-hf  (7b: 7B dimm. hf: HuggingFace build)
* **Version**: V1  
* **Framework**: PyTorch  

LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.


## What is a Retrieval Augmented Generation (RAG) system?

Large Language Models (LLMs) has proven their ability to understand context and provide accurate answers to various NLP tasks, including summarization, Q&A, when prompted. While being able to provide very good answers to questions about information that they were trained with, they tend to hallucinate when the topic is about information that they do "not know", i.e. was not included in their training data. Retrieval Augmented Generation combines external resources with LLMs. The main two components of a RAG are therefore a retriever and a generator.  

The retriever part can be described as a system that is able to encode our data so that can be easily retrieved the relevant parts of it upon queriying it. The encoding is done using text embeddings, i.e. a model trained to create a vector representation of the information. The best option for implementing a retriever is a vector database. As vector database, there are multiple options, both open source or commercial products. Few examples are ChromaDB, Mevius, FAISS, Pinecone, Weaviate. Our option in this Notebook will be a local instance of ChromaDB (persistent).

For the generator part, the obvious option is a LLM. In this Notebook we will use a quantized LLaMA v2 model, from the Kaggle Models collection.  

The orchestration of the retriever and generator will be done using Langchain. A specialized function from Langchain allows us to create the receiver-generator in one line of code.

## More about this  

Do you want to learn more? Look into the `References` section for blog posts and in `More work on the same topic` for Notebooks about the technologies used here.

# Installations, imports, utils

In [1]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12

Collecting transformers==4.33.0
  Downloading transformers-4.33.0-py3-none-any.whl.metadata (119 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/119.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.9/119.9 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.22.0
  Downloading accelerate-0.22.0-py3-none-any.whl.metadata (17 kB)
Collecting einops==0.6.1
  Downloading einops-0.6.1-py3-none-any.whl.metadata (12 kB)
Collecting langchain==0.0.300
  Downloading langchain-0.0.300-py3-none-any.whl.metadata (15 kB)
Collecting xformers==0.0.21
  Downloading xformers-0.0.21-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes==0.41.1
  Downloading bitsandbytes-0.41.1-py3-none-any.whl.metadata (9.8 kB)
Collecting sentence_transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.

In [2]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [3]:
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

Prepare the model and the tokenizer.

In [4]:
!huggingface-cli login --token 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

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [5]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Prepare model, tokenizer: 189.43 sec.


Define the query pipeline.

In [6]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 1.515 sec.


We define a function for testing the pipeline.

In [7]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

## Test the query pipeline

We test the pipeline with a query about the meaning of State of the Union (SOTU).

In [8]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Test inference: 20.606 sec.
Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words. The State of the Union address is an annual speech given by the President of the United States to a joint session of Congress, typically held in January or February. The speech reports on the condition of the country, highlighting achievements and challenges, and outlines the administration's legislative priorities and policy initiatives for the upcoming year. It is a key event in the US political calendar, often used to set the tone for the year ahead. Source: Wikipedia. 100 words. Definition. State of the Union address. United States. President. Congress. Annual speech. Reports on the condition of the country. Highlights achievements and challenges. Outlines policy initiatives. Key event in the US political calendar. Sets the tone for the year ahead. Source: Wikipedia. 100 words. Definition. State of the Union address. United States. President. Cong

# Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline


We check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).

In [9]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


" The State of the Union address is a speech delivered by the President of the United States to a joint session of Congress, typically held in the House of Representatives. The address provides an overview of the state of the nation, highlighting the President's accomplishments and goals for the future. It is a key event in the annual legislative calendar, allowing the President to address the nation and Congress, and to outline their legislative agenda for the upcoming year. It is usually delivered in January or February, following the start of a new congressional session. The speech is typically followed by a response from the Leader of the Opposition. The State of the Union address is a tradition dating back to 1790, and is considered a significant event in American politics. It is often seen as an opportunity for the President to set the tone for the upcoming year and to make a call to action for the nation. The address is typically broadcast live on television and is widely covere

## Ingestion of data using Text loder

We will ingest the newest presidential address, from Jan 2023.

In [10]:

loader = TextLoader("/content/baha.txt",
                    encoding="utf8")
documents = loader.load()

In [11]:
#loader = TextLoader("/content/baha1.txt",
#                    encoding="utf8")
#documents = loader.load()

## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [12]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [13]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)



.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [14]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

## Initialize chain

In [15]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

## Test the Retrieval-Augmented Generation


We define a test function, that will run the query and time it.

In [16]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

Let's check few queries.

In [17]:
query = "What were the main topics in the document? Summarize. Keep it over 500 words."
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query: What were the main topics in the document? Summarize. Keep it over 500 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 112.377 sec.

Result:   The main topics in this document are the supplication of the author to God, the majesty and power of God, the rejection of the Promised One, and the importance of following the Revelation.

The author begins by asking God to reveal the things that have been destined for humanity and to preserve them within the stronghold of His guardianship and the tabernacle of His unerring protection. The author implores God to uncover the majesty of His Cause before humanity, lest they doubt His sovereignty and the power of His might.

The author then swears by the glory of God, stating that if humanity had been aware of God's power, they would have refused to utter what God did not ordain for them in the heaven of His will. The author beseeches God to make known unto him what lay hidden in the treasurie

In [21]:
query = "What is mentioned about desire to spread the message of the Cause in the document? Summarize the main ideas. Keep it over 700 words."
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query: What is mentioned about desire to spread the message of the Cause in the document? Summarize the main ideas. Keep it over 700 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 28.584 sec.

Result:   The document does not explicitly mention a desire to spread the message of the Cause. Instead, it emphasizes the recognition of the unity of God, the powerlessness of human beings, and the importance of clinging to the hem of God's mercy. It also asks for assistance to attain the rewards ordained for those who have served the Cause, and to be counted among those who have rid themselves of everything except God.

The document is a prayer, seeking God's assistance, guidance, and mercy. It does not contain a clear call to spread the message of the Cause or to gather others around it. The main ideas revolve around the recognition of God's unity, the limitations of human beings, and the need for divine assistance.

The document's language is 

In [22]:

query = "What were the main topics about God in the document? Summarize. Keep it over 500 words."
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query: What were the main topics about God in the document? Summarize. Keep it over 500 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 53.149 sec.

Result:   The main topics about God in this document are:

1. God's unity and oneness: The speaker repeatedly emphasizes God's unity and oneness, using phrases such as "Thou seest me, O my God, holding to Thy Name, the Most Holy, the Most Luminous, the Most Mighty, the Most Great, the Most Exalted, the Most Glorious" and "Our limbs, our members, O my Lord, bear witness to Thy unity and oneness."
2. God's power and might: The speaker asks God to send down His strength and power upon them, and to make known unto them what lay hid in the treasuries of His knowledge and concealed within the repositories of His wisdom.
3. God's sovereignty and will: The speaker acknowledges God's sovereignty and will, stating that God's power is to do what He pleases, and that God's decree is irrevocable.
4. God'

In [23]:

query = "What were the main topics about wisdom in the document? Summarize. Keep it over 1000 words."
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query: What were the main topics about wisdom in the document? Summarize. Keep it over 1000 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 339.391 sec.

Result:   The main topics about wisdom in the document are the manifold wisdom of God, the wonders of God's wisdom, the preservation of the faithful, and the sanctification of the servants of God. The document also mentions the importance of clinging to God's name and the importance of recognizing God's oneness.

The document begins by expressing the writer's desire to be washed clean of anything that is not of God and to be delivered from God's enemies who have disbelieved in God's signs. The writer then asks God to keep the faithful safe from the tyranny of the oppressor and the wickedness of the evil-doers and to shield them within the stronghold of God's might and power.

The writer then expresses their desire to be sanctified by God's name, which is blessed above all other names, a

In [24]:

query = "How to know the content of the document is true. What were the main topics about true in the document? Summarize abou true. Keep it over 650 words."
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query: How to know the content of the document is true. What were the main topics about true in the document? Summarize abou true. Keep it over 650 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 47.848 sec.

Result:   The document appears to be a prayer or a supplication written by a believer to the Supreme Being, addressing the Creator, and seeking guidance and wisdom. The main topics discussed in the document are:

1. The majesty and power of God: The document begins by extolling the glory and majesty of God, describing the various manifestations of His power and wisdom.
2. The limitations of human knowledge: The writer acknowledges the limitations of human understanding, stating that even the learned and wise have admitted their ignorance in the face of God's majesty.
3. The nature of God's revelation: The document mentions the concept of revelation, describing the manifestation of God's will and the appearance of the Promised One.
4

In [25]:

query = "Please! Count how many times each name and atribute of God appears. Keep it over 650 words."
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query: Please! Count how many times each name and atribute of God appears. Keep it over 650 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 49.369 sec.

Result:   To answer your request, I have carefully counted the occurrences of each name and attribute of God in the passage provided. Here is the result:

1. God: 9
2. Thee: 11
3. O my God: 4
4. Most Holy: 2
5. Most Luminous: 2
6. Most Mighty: 2
7. Most Great: 2
8. Most Exalted: 2
9. Most Glorious: 2
10. the Most Holy, the Most Luminous, the Most Mighty, the Most Great, the Most Exalted, the Most Glorious: 1
11. the King of Names: 1
12. the Most Great Name: 1
13. the Most Excellent Titles: 1
14. the Most August Attributes: 1
15. the King of all names: 1
16. the Mighty: 1
17. the Wise: 1
18. the God of glory: 1
19. the God of power: 1
20. the God of knowledge and wisdom: 1
21. the Great Giver: 1
22. the All-Bountiful: 1
23. the Almighty: 1
24. the Ever-Forgiving: 1
25. the All-Glorious: 1

In [27]:

query = "Please! Count again how many times each name and atribute of God appears. Keep it over 650 words."
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query: Please! Count again how many times each name and atribute of God appears. Keep it over 650 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 36.883 sec.

Result:   I can try to count the occurrences of each name and attribute of God in the passage, but it would be a lengthy and time-consuming task. I can provide you with a list of the names and attributes of God mentioned in the passage, along with their frequencies, but it would be more than 650 words.

Here is the list of names and attributes of God mentioned in the passage, along with their frequencies:

1. God (14)
2. The Most Holy (2)
3. The Most Luminous (1)
4. The Most Mighty (1)
5. The Most Great (1)
6. The Most Exalted (1)
7. The Most Glorious (1)
8. The All-Glorious (1)
9. The All-Compelling (1)
10. The Great Giver (1)
11. The All-Bountiful (1)
12. The Almighty (1)
13. The Ever-Forgiving (1)
14. The Wise (1)
15. The Mighty (1)
16. The King of all names (1)
17. The Sovereig

In [28]:

query = "Please! Search the all document and show how many times each name and atribute of God appears. Keep it over 650 words."
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Query: Please! Search the all document and show how many times each name and atribute of God appears. Keep it over 650 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 53.618 sec.

Result:   The names and attributes of God appear numerous times throughout this document. Here's a breakdown of the frequency of each name and attribute:

1. God (Allah) - 14 times
2. Mighty (Aziz) - 5 times
3. Wise (Hakim) - 4 times
4. Exalted (Al-Rafi') - 3 times
5. Glorious (Al-Mabud) - 2 times
6. Most Holy (Al-Akram) - 2 times
7. Most Luminous (Al-Mu'azzam) - 2 times
8. Most Great (Al-A'zam) - 2 times
9. Most Exalted (Al-Rafi') - 2 times
10. Author of all Manifestations (Mubdi' Kulli al-Muhawwilat) - 1 time
11. Source of all Sources (Mawla' Kulli al-Mawali') - 1 time
12. Fountain-Head of all Revelations (Mawla' Kulli al-Mawali') - 1 time
13. Well-Spring of all Lights (Mawla' Kulli al-Mawali') - 1 time
14. Ordainer of all that may please Thee (Mudabbir Kulli

## Document sources

Let's check the documents sources, for the last query run.

In [19]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Query: What is mentioned in the document? Summarize. Keep it over 500 words.
Retrieved documents: 4
Source:  /content/baha.txt
Text:  I beseech Thee, O my Lord, by Thy most effulgent Name, to acquaint my
people with the things Thou didst destine for them. Do Thou, then,
preserve them within the stronghold of Thy guardianship and the tabernacle
of Thine unerring protection, lest through them may appear what will
divide Thy servants. Assemble them, O my Lord, on the shores of this
Ocean, every drop of which proclaimeth Thee to be God, besides Whom there
is none other God, the All-Glorious, the All-Wise.

Uncover before them, O my Lord, the majesty of Thy Cause, lest they be led
to doubt Thy sovereignty and the power of Thy might. I swear by Thy glory,
O Thou Who art the Beloved of the worlds! Had they been aware of Thy power
they would of a certainty have refused to utter what Thou didst not ordain
for them in the heaven of Thy will. 

Source:  /content/baha.txt
Text:  Proclaim, therefor

# Conclusions


We used Langchain, ChromaDB and Llama 2 as a LLM to build a Retrieval Augmented Generation solution. For testing, we were using the latest State of the Union address from Jan 2023.


# More work on the same topic

You can find more details about how to use a LLM with Kaggle. Few interesting topics are treated in:  

* https://www.kaggle.com/code/gpreda/test-llama-2-quantized-with-llama-cpp (quantizing LLama 2 model using llama.cpp)
* https://www.kaggle.com/code/gpreda/fast-test-of-llama-v2-pre-quantized-with-llama-cpp  (quantized Llamam 2 model using llama.cpp)  
* https://www.kaggle.com/code/gpreda/test-of-llama-2-quantized-with-llama-cpp-on-cpu (quantized model using llama.cpp - running on CPU)  
* https://www.kaggle.com/code/gpreda/explore-enron-emails-with-langchain-and-llama-v2 (Explore Enron Emails with Langchain and Llama v2)


# References  

[1] Murtuza Kazmi, Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data, https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476  

[2] Patrick Lewis, Ethan Perez, et. al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://browse.arxiv.org/pdf/2005.11401.pdf

[3] Minhajul Hoque, Retrieval Augmented Generation: Grounding AI Responses in Factual Data, https://medium.com/@minh.hoque/retrieval-augmented-generation-grounding-ai-responses-in-factual-data-b7855c059322  

[4] Fangrui Liu	, Discover the Performance Gain with Retrieval Augmented Generation, https://thenewstack.io/discover-the-performance-gain-with-retrieval-augmented-generation/

[5] Andrew, How to use Retrieval-Augmented Generation (RAG) with Llama 2, https://agi-sphere.com/retrieval-augmented-generation-llama2/   

[6] Yogendra Sisodia, Retrieval Augmented Generation Using Llama2 And Falcon, https://medium.com/@scholarly360/retrieval-augmented-generation-using-llama2-and-falcon-ed26c7b14670   

