# Overview

In this notebook, we will use our own dataset as the knwoloedge of the RAG. We will load our data from CSV file. Next, loading `RagRetriever` witht the customise data. Finally, we will do some inference under the context of the customize data.


# Installing faiss(GPU+CPU)

Here we install Faceboook Faiss which is a library for efficient **similarity search** and **cluster of dense vectors**. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Some of the most useful algorithms are implemented on the GPU. Further more, the GPU implementation can accept input from either CPU or GPU memeory. 

In [1]:
%%capture
!conda install -c pytorch -c nvidia faiss-gpu=1.8.0 -y

In [2]:
%%capture
!pip install transformers==4.38.2
# !pip install accelerate==0.27.2
!pip install datasets==2.18.0 # Fix issue numpy attributes error of RagRetriever
# !pip install peft==0.9.0
# !pip install bitsandbytes==0.42.0

In [7]:
import os
import torch
import faiss # for checking faiss-gpu
import warnings

csv_path='/kaggle/input/knowledge-dataset/own_knwoledge_dataset.csv'
os.environ['EM_MODEL']='facebook/dpr-ctx_encoder-multiset-base'

if torch.cuda.is_available():
    torch.backends.cudnn.deterministic=True
    # https://github.com/huggingface/transformers/issues/28731
    torch.backends.cuda.enable_mem_efficient_sdp(False)
    device='cuda'
else:
    device='cpu'
    
warnings.filterwarnings('ignore')

# Loading Dataset

The dataset needed for RAG must have two columns:
- title(string): title of the document
- text(string): text of a passage of the document

We visualization the data and make sure it's format was corrected.

In [4]:
import pandas as pd


df=pd.read_csv(csv_path)
print(df.shape)

df.head()

(3, 2)


Unnamed: 0,Aaron,"Aaron Aaron ( or ; ""Ahärôn"") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother's spokesman (""prophet"") to the Pharaoh. Part of the Law (Torah) that Moses received from God at Sinai granted Aaron the priesthood for himself and his male descendants, and he became the first High Priest of the Israelites. Aaron died before the Israelites crossed the North Jordan river and he was buried on Mount Hor (Numbers 33:39; Deuteronomy 10:6 says he died and was buried at Moserah). Aaron is also mentioned in the New Testament of the Bible. According to the Book of Exodus, Aaron first functioned as Moses' assistant. Because Moses complained that he could not speak well, God appointed Aaron as Moses' ""prophet"" (Exodus 4:10-17; 7:1). At the command of Moses, he let his rod turn into a snake. Then he stretched out his rod in order to bring on the first three plagues. After that, Moses tended to act and speak for himself. During the journey in the wilderness, Aaron was not always prominent or active. At the battle with Amalek, he was chosen with Hur to support the hand of Moses that held the ""rod of God"". When the revelation was given to Moses at biblical Mount Sinai, he headed the elders of Israel who accompanied Moses on the way to the summit."
0,Pokémon,"Pokémon , also known as in Japan, is a media f..."
1,Melbourne,Melborune is a beautiful city which is located...
2,RMIT,"RMIT is an university in Melbourne, the city c..."


In [18]:
from datasets import Dataset
from typing import List, Optional

def split_text(text: str, n=100, character=" ") -> List[str]:
    """
    Split the text every `n` -th occurence of `character`
    """
    text=text.split(character)
    return [character.join(text[i:1+n]).strip() for i in range(0, len(text), n)]


def split_documents(documents: dict) -> dict:
    """
    Split documents into passages
    """
    titles, texts=[],[]
    print(documents)
    for title, text in zip(documents["title"], documents["text"]):
        if text is not None:
            for passage in split_text(text):
                titles.append(title if title is not None else "")
                texts.append(passage)
    return {"title":titles, "text": texts}

dataset=Dataset.from_pandas(df, split="train")
dataset=dataset.map(split_documents, batched=True, num_proc=4) # 4 vCPUs in Kaggle

num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.


Map (num_proc=3):   0%|          | 0/3 [00:00<?, ? examples/s]

{'Aaron': ['Pokémon'], 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from God at Sinai granted Aaron the priesthood for himself and his male descendants, and he became the first High Priest of the Israelites. Aaron died before the Israelites crossed the North Jordan river and he was buried on Mount Hor (Numbers 33:39; Deuteronomy 10:6 says he died and was buried at Moserah). Aaron is also mentioned in the New Testament of the Bible. Accor

KeyError: 'title'

# Adding and Computing the Embeddings

In [10]:
from transformers import DPRContextEncoder, DPRContextEncoderTokenizerFast

ctx_encoder=DPRContextEncoder.from_pretrained(os.getenv("EM_MODEL")).to(device)
ctx_tokenizer=DPRContextEncoderTokenizerFast.from_pretrained(os.getenv("EM_MODEL"))

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


In [11]:
from functools import partial
from datasets import Features, Sequence, Value


new_features=Features({
    "text": Value("string"),
    "title": Value("string"),
    "embeddings": Sequence(Value("float32")) # save as float32 instead of float64 to save space
    })


def embed(documents: dict, ctx_encoder: DPRContextEncoder, ctx_tokenizer: DPRContextEncoderToeknizerFast) -> dict:
    """
    Compute the DPR embeddings of document passages
    """
    input_ids=ctx_tokenizer(documents["title"], documents["text"], truncation=True, padding="longest",  return_tensors="pt")["input_ids"]
    embeddings=ctx_encoder(input_ids.to(device=device), return_dict=True).pooler_output
    return {"embeddings": embeddings.detach().cpu().numpy}
    

dataset=dataset.map(
    partial(embed, ctx_encoder=ctx_encoder, ctx_tokenizer=ctx_tokenizer),
    batched=True,
    batch_size=processing_args.batch_size,
    features=new_features,
)

path_passages=os.path.join('/kaggle/input/',"my_knowledge_dataset")
dataset.save_to_disk(path_passages)

NameError: name 'embed' is not defined

We can also load the dataset from the local disk.

```python
from datasets import load_from_disk

dataset=load_from_disk(path_passages)
```

# Implementation Fast Approximate Nearest Neighbor Search

We are going to use the Faiss implementation of HNSW for fast approcimate nearest neighbor search.

In [None]:
index=faiss.