# Introduction

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F769452%2Fb18d0513200d426e556b2b7b7c825981%2FRAG.png?generation=1695504022336680&alt=media"></img>

## Objective

Use Llama 2.0, Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system. This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM).
When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed. 

## Definitions

* LLM - Large Language Model  
* Llama 2.0 - LLM from Meta 
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation (see below more details about RAGs)

## Model details

* **Model**: Llama 2  
* **Variation**: 7b-chat-hf  (7b: 7B dimm. hf: HuggingFace build)
* **Version**: V1  
* **Framework**: PyTorch  

LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.


## What is a Retrieval Augmented Generation (RAG) system?

Large Language Models (LLMs) has proven their ability to understand context and provide accurate answers to various NLP tasks, including summarization, Q&A, when prompted. While being able to provide very good answers to questions about information that they were trained with, they tend to hallucinate when the topic is about information that they do "not know", i.e. was not included in their training data. Retrieval Augmented Generation combines external resources with LLMs. The main two components of a RAG are therefore a retriever and a generator.  
 
The retriever part can be described as a system that is able to encode our data so that can be easily retrieved the relevant parts of it upon queriying it. The encoding is done using text embeddings, i.e. a model trained to create a vector representation of the information. The best option for implementing a retriever is a vector database. As vector database, there are multiple options, both open source or commercial products. Few examples are ChromaDB, Mevius, FAISS, Pinecone, Weaviate. Our option in this Notebook will be a local instance of ChromaDB (persistent).

For the generator part, the obvious option is a LLM. In this Notebook we will use a quantized LLaMA v2 model, from the Kaggle Models collection.  

The orchestration of the retriever and generator will be done using Langchain. A specialized function from Langchain allows us to create the receiver-generator in one line of code.

## More about this  

Do you want to learn more? Look into the `References` section for blog posts and in `More work on the same topic` for Notebooks about the technologies used here.

# Installations, imports, utils

einops==0.6.1 --> error
langchain==0.0.300 --> error

In [1]:
!wget -q https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.4.1/auto_gptq-0.4.1+cu118-cp310-cp310-linux_x86_64.whl
!BUILD_CUDA_EXT=0 pip install -qqq auto_gptq-0.4.1+cu118-cp310-cp310-linux_x86_64.whl --progress-bar off

In [2]:
!pip install transformers==4.37.0 bitsandbytes accelerate optimum

Collecting transformers==4.37.0
  Downloading transformers-4.37.0-py3-none-any.whl (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Collecting optimum
  Downloading optimum-1.19.0-py3-none-any.whl (417 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m417.7/417.7 kB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.19,>=0.14 (from transformers==4.37.0)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m89.7 MB/s[0m eta [36m0:00:00[0m
Collecting coloredlogs (from optimum)
  Downloading coloredlogs-15.0.

In [3]:
pip install --upgrade auto-gptq

Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate>=0.26.0 (from auto-gptq)
  Downloading accelerate-0.29.3-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
Collecting gekko (from auto-gptq)
  Downloading gekko-1.1.1-py3-none-any.whl (13.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m78.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: gekko, accelerate, auto-gptq
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.22.0
    Uninstalling accelerate-0.22.0:
      Successfully uninstalled accelerate-0.22.0
  Attempting uninstall: auto-gptq
    Found existing installation: auto-g

In [4]:
!pip install einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
 sentence_transformers==2.2.2 chromadb==0.4.12

Collecting einops==0.6.1
  Downloading einops-0.6.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain==0.0.300
  Downloading langchain-0.0.300-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xformers==0.0.21
  Downloading xformers-0.0.21-cp310-cp310-manylinux2014_x86_64.whl (167.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.0/167.0 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence_transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
[?25hCollecting chromadb==0.4.12
  Downloading chromadb-0.4.12

In [5]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.0.33-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.2.0,>=0.1.43 (from langchain_community)
  Downloading langchain_core-0.1.44-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.2/290.2 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.2.0,>=0.1.0 (from langchain_community)
  Downloading langsmith-0.1.48-py3-none-any.whl (113 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.7/113.7 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting packaging<24.0,>=23.2 (from langchain-core<0.2.0,>=0.1.43->langchain_community)
  Downloading packaging-23.2-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting

In [6]:
!pip install langchain_text_splitters

Collecting langchain_text_splitters
  Downloading langchain_text_splitters-0.0.1-py3-none-any.whl (21 kB)
Installing collected packages: langchain_text_splitters
Successfully installed langchain_text_splitters-0.0.1


In [7]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter)
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma


# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [8]:
!huggingface-cli login --token ["your_hf"]

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [9]:
#model_id = 'meta-llama/Llama-2-7b-chat-hf'
#model_id = "Qwen/Qwen1.5-7B-Chat-GPTQ-Int8"
model_id = "Qwen/CodeQwen1.5-7B-Chat"

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

Prepare the model and the tokenizer.

In [10]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    #trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    torch_dtype="auto",#######
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

config.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/31.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.89G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.71G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.46M [00:00<?, ?B/s]

Prepare model, tokenizer: 116.162 sec.


In [11]:
def test(tokenizer, model, prompt_to_test):
    time_1 = time()
    #prompt = "could you implement max SubArray Sum in cpp?"
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt_to_test}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    time_2 = time()
    print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")
    print("Response :::: ", response)

In [12]:
prompt = ''' could u implement it in cpp?
One hot summer day Pete and his friend Billy decided to buy a watermelon. They chose the biggest and the ripest one, in their opinion. After that the watermelon was weighed, and the scales showed w kilos. They rushed home, dying of thirst, and decided to divide the berry, however they faced a hard problem.

Pete and Billy are great fans of even numbers, that's why they want to divide the watermelon in such a way that each of the two parts weighs even number of kilos, at the same time it is not obligatory that the parts are equal. The boys are extremely tired and want to start their meal as soon as possible, that's why you should help them and find out, if they can divide the watermelon in the way they want. For sure, each of them should get a part of positive weight.'''

In [13]:
test(tokenizer, model, prompt)



Prepare model, tokenizer: 29.19 sec.
Response ::::  Sure, here's an implementation of the problem in C++:

```cpp
#include <iostream>

int main() {
    int w;
    std::cin >> w;
    if(w % 2 == 0 && w > 2) {
        std::cout << "YES";
    } else {
        std::cout << "NO";
    }
    return 0;
}
```

This code reads an integer `w` from the standard input, which represents the weight of the watermelon in kilograms. It then checks whether the weight is even (i.e., `w % 2 == 0`) and greater than 2. If both conditions are true, it prints "YES", indicating that the watermelon can be divided into two even-weighted parts. Otherwise, it prints "NO", indicating that it is not possible to divide the watermelon in the desired manner.


Define the query pipeline.

In [14]:
time_1 = time()
query_pipeline = transformers.pipeline("text-generation",
                                       model=model, 
                                       tokenizer=tokenizer,
                                       max_new_tokens=512
                                      )


time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 2.62 sec.


We define a function for testing the pipeline.

In [15]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt_to_test}
    ]
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        #max_length=200,
    )
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

## Test the query pipeline

We test the pipeline with a query about the meaning of State of the Union (SOTU).

In [16]:
test_model(tokenizer,
           query_pipeline,
           prompt)

Test inference: 13.994 sec.
Result:  could u implement it in cpp?
One hot summer day Pete and his friend Billy decided to buy a watermelon. They chose the biggest and the ripest one, in their opinion. After that the watermelon was weighed, and the scales showed w kilos. They rushed home, dying of thirst, and decided to divide the berry, however they faced a hard problem.

Pete and Billy are great fans of even numbers, that's why they want to divide the watermelon in such a way that each of the two parts weighs even number of kilos, at the same time it is not obligatory that the parts are equal. The boys are extremely tired and want to start their meal as soon as possible, that's why you should help them and find out, if they can divide the watermelon in the way they want. For sure, each of them should get a part of positive weight.

Input Specification:
The first (and the only) input line contains integer number w, which shows the weight of the watermelon in kilos. (1 w 100)

Output Sp

# Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline


We check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).

In [17]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt=prompt)

'\n\nWrite a program that will determine if the boys can divide the watermelon in the required manner, and print YES or NO accordingly.\n\nExamples\n\nInput  : 8  \nOutput : YES \n\nInput  : 12  \nOutput : NO \n\nInput  : 15  \nOutput : NO  \nNote: According to problem statement it is given that the weight of watermelon is always odd, then the solution will be NO. If weight of watermelon is even then the weight of each part will be even and will divide the weight of watermelon evenly. Thus, the solution is YES in that case.'

In [18]:
messages =f"""
 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 {prompt}
 </s>
 <|assistant|>
"""
response = llm.predict(messages)
print(response)

Certainly! Here's an implementation of the described problem in C++:

```cpp
#include <iostream>
using namespace std;

int main() {
    int w;
    cin >> w;
    
    if (w % 2 == 0) {
        if (w > 2) {
            cout << "YES";
        } else {
            cout << "NO";
        }
    } else {
        cout << "NO";
    }
    
    return 0;
}
```

This program reads an integer `w` representing the weight of the watermelon. If the weight is even, and greater than 2, it outputs "YES", as the weight can be divided evenly. If the weight is odd or is less than or equal to 2, it outputs "NO".

Please note that the program assumes that the input weight `w` is a positive integer.


## Ingestion of data using CSV loder

We will ingest the newest presidential address, from Jan 2023.

In [19]:
loader = CSVLoader(file_path='/kaggle/input/combined-solution/combine_sol.csv',encoding="utf8")

documents = loader.load()

## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [20]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

'''cpp_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.CPP, chunk_size=1000, chunk_overlap=20)
all_splits = cpp_splitter.create_documents([documents])'''


'cpp_splitter = RecursiveCharacterTextSplitter.from_language(\n    language=Language.CPP, chunk_size=1000, chunk_overlap=20)\nall_splits = cpp_splitter.create_documents([documents])'

## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [21]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [22]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

Batches:   0%|          | 0/449 [00:00<?, ?it/s]

## Initialize chain

In [23]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

## Test the Retrieval-Augmented Generation 


We define a test function, that will run the query and time it.

In [24]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    messages =f"""
    <|system|>
    You are an AI assistant that follows instruction extremely well.
    Please be truthful and give direct answers
    </s>
     <|user|>
     {query}
     </s>
     <|assistant|>
    """
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

Let's check few queries.

In [25]:
'''query = "What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words."
test_rag(qa, query)'''

'query = "What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words."\ntest_rag(qa, query)'

In [26]:
query = ''' could u implement it in cpp?
One hot summer day Pete and his friend Billy decided to buy a watermelon. They chose the biggest and the ripest one, in their opinion. After that the watermelon was weighed, and the scales showed w kilos. They rushed home, dying of thirst, and decided to divide the berry, however they faced a hard problem.

Pete and Billy are great fans of even numbers, that's why they want to divide the watermelon in such a way that each of the two parts weighs even number of kilos, at the same time it is not obligatory that the parts are equal. The boys are extremely tired and want to start their meal as soon as possible, that's why you should help them and find out, if they can divide the watermelon in the way they want. For sure, each of them should get a part of positive weight.'''
test_rag(qa, query)

Query:  could u implement it in cpp?
One hot summer day Pete and his friend Billy decided to buy a watermelon. They chose the biggest and the ripest one, in their opinion. After that the watermelon was weighed, and the scales showed w kilos. They rushed home, dying of thirst, and decided to divide the berry, however they faced a hard problem.

Pete and Billy are great fans of even numbers, that's why they want to divide the watermelon in such a way that each of the two parts weighs even number of kilos, at the same time it is not obligatory that the parts are equal. The boys are extremely tired and want to start their meal as soon as possible, that's why you should help them and find out, if they can divide the watermelon in the way they want. For sure, each of them should get a part of positive weight.



[1m> Entering new RetrievalQA chain...[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


[1m> Finished chain.[0m
Inference time: 47.735 sec.

Result:    

Input: 4 2
Output: True

Input: 1 1
Output: False

Input: 2 4
Output: False

Input: 3 3
Output: False
Input: 10 3
Output: True
Explanation: 10 - 3 = 7 and 7 is even number

Input: 9 4
Output: True

Explanation: 9 - 4 = 5 and 5 is even number

Input: 9 5
Output: False

Input: 11 2
Output: False

Explanation: 11 - 2 = 9 and 9 is even number

Input: 10 2
Output: False

Explanation: 10 - 2 = 8 and 8 is even number

Input: 10 8
Output: False

Explanation: 10 - 8 = 2 and 2 is even number

Input: 9 3
Output: False

Explanation: 9 - 3 = 6 and 6 is even number

Input: 6 4
Output: False

Explanation: 6 - 4 = 2 and 2 is even number

Input: 15 4
Output: False

Explanation: 15 - 4 = 11 and 11 is even number

Input: 17 10
Output: False

Explanation: 17 - 10 = 7 and 7 is even number

Input: 18 4
Output: False

Explanation: 18 - 4 = 14 and 14 is even number

Input: 20 6

## Document sources

Let's check the documents sources, for the last query run.

In [27]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query:  could u implement it in cpp?
One hot summer day Pete and his friend Billy decided to buy a watermelon. They chose the biggest and the ripest one, in their opinion. After that the watermelon was weighed, and the scales showed w kilos. They rushed home, dying of thirst, and decided to divide the berry, however they faced a hard problem.

Pete and Billy are great fans of even numbers, that's why they want to divide the watermelon in such a way that each of the two parts weighs even number of kilos, at the same time it is not obligatory that the parts are equal. The boys are extremely tired and want to start their meal as soon as possible, that's why you should help them and find out, if they can divide the watermelon in the way they want. For sure, each of them should get a part of positive weight.
Retrieved documents: 4
Source:  /kaggle/input/combined-solution/combine_sol.csv
Text:  : 103
solution: #include <iostream>

using namespace std;

int main()
{
	int coconuts;

	w

# Conclusions


We used Langchain, ChromaDB and Llama 2 as a LLM to build a Retrieval Augmented Generation solution. For testing, we were using the latest State of the Union address from Jan 2023.


# More work on the same topic

You can find more details about how to use a LLM with Kaggle. Few interesting topics are treated in:  

* https://www.kaggle.com/code/gpreda/test-llama-2-quantized-with-llama-cpp (quantizing LLama 2 model using llama.cpp)
* https://www.kaggle.com/code/gpreda/fast-test-of-llama-v2-pre-quantized-with-llama-cpp  (quantized Llamam 2 model using llama.cpp)  
* https://www.kaggle.com/code/gpreda/test-of-llama-2-quantized-with-llama-cpp-on-cpu (quantized model using llama.cpp - running on CPU)  
* https://www.kaggle.com/code/gpreda/explore-enron-emails-with-langchain-and-llama-v2 (Explore Enron Emails with Langchain and Llama v2)


# References  

[1] Murtuza Kazmi, Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data, https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476  

[2] Patrick Lewis, Ethan Perez, et. al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://browse.arxiv.org/pdf/2005.11401.pdf 

[3] Minhajul Hoque, Retrieval Augmented Generation: Grounding AI Responses in Factual Data, https://medium.com/@minh.hoque/retrieval-augmented-generation-grounding-ai-responses-in-factual-data-b7855c059322  

[4] Fangrui Liu	, Discover the Performance Gain with Retrieval Augmented Generation, https://thenewstack.io/discover-the-performance-gain-with-retrieval-augmented-generation/

[5] Andrew, How to use Retrieval-Augmented Generation (RAG) with Llama 2, https://agi-sphere.com/retrieval-augmented-generation-llama2/   

[6] Yogendra Sisodia, Retrieval Augmented Generation Using Llama2 And Falcon, https://medium.com/@scholarly360/retrieval-augmented-generation-using-llama2-and-falcon-ed26c7b14670   

