# Overview

In this notebook, we will use our own dataset as the knwoloedge of the RAG. We will load our data from CSV file. Next, loading `RagRetriever` witht the customise data. Finally, we will do some inference under the context of the customize data.


# Installing faiss(GPU+CPU)

Here we install Faceboook Faiss which is a library for efficient **similarity search** and **cluster of dense vectors**. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Some of the most useful algorithms are implemented on the GPU. Further more, the GPU implementation can accept input from either CPU or GPU memeory. 

In [None]:
%%capture
!conda install -c pytorch -c nvidia faiss-gpu=1.8.0 -y

In [None]:
%%capture
!pip install transformers==4.38.2
# !pip install accelerate==0.27.2
!pip install datasets==2.18.0 # Fix issue numpy attributes error of RagRetriever
# !pip install peft==0.9.0
# !pip install bitsandbytes==0.42.0

In [None]:
import os
import torch
import faiss # for checking faiss-gpu
import warnings

os.environ['CSV_PATH']='/kaggle/input/knowledge-dataset/own_knwoledge dataset.csv'
os.environ['EM_MODEL']='facebook/dpr-ctx_encoder-multiset-base'

if torch.cuda.is_available():
    torch.backends.cudnn.deterministic=True
    # https://github.com/huggingface/transformers/issues/28731
    torch.backends.cuda.enable_mem_efficient_sdp(False)
    device='cuda'
else:
    device='cpu'
    
warnings.filterwarnings('ignore')

# Checking the Basic Information of Data

The dataset needed for RAG must have two columns:
- title(string): title of the document
- text(string): text of a passage of the document

We visualization the data and make sure it's format was corrected.

In [None]:
import pandas as pd

df=pd.read_csv(csv_path)
print(df.shape)
df.head()

# Loading the Data from CSV

In [None]:
from datasets import Dataset, load_dataset
from typing import List, Optional


def split_text(text: str, n=100, character=" ") -> List[str]:
    """Split the text every ``n``-th occurrence of ``character``"""
    text = text.split(character)
    return [character.join(text[i : i + n]).strip() for i in range(0, len(text), n)]


def split_documents(documents: dict) -> dict:
    """Split documents into passages"""
    titles, texts = [], []
    for title, text in zip(documents["title"], documents["text"]):
        if text is not None:
            for passage in split_text(text):
                titles.append(title if title is not None else "")
                texts.append(passage)
    return {"title": titles, "text": texts}


# Using pandas dataframe without set columnes names will cause issue
# dataset=Dataset.from_pandas(df, split="train")
# dataset=dataset.map(split_documents, batched=True, num_proc=4) # 4 vCPUs in Kaggle


dataset = load_dataset("csv", data_files=[os.getenv('CSV_PATH')], split="train", delimiter="\t", column_names=["title", "text"])
dataset

# Adding and Computing the Embeddings

In [None]:
from transformers import DPRContextEncoder, DPRContextEncoderTokenizerFast

ctx_encoder=DPRContextEncoder.from_pretrained(os.getenv("EM_MODEL")).to(device)
ctx_tokenizer=DPRContextEncoderTokenizerFast.from_pretrained(os.getenv("EM_MODEL"))

In [None]:
from functools import partial
from datasets import Features, Sequence, Value


def embed(documents: dict, ctx_encoder: DPRContextEncoder, ctx_tokenizer: DPRContextEncoderTokenizerFast) -> dict:
    """Compute the DPR embeddings of document passages"""
    input_ids = ctx_tokenizer(documents["title"], documents["text"], truncation=True, padding="longest", return_tensors="pt")["input_ids"]
    embeddings = ctx_encoder(input_ids.to(device=device), return_dict=True).pooler_output
    return {"embeddings": embeddings.detach().cpu().numpy()}
    
new_features = Features({
    "text": Value("string"), 
    "title": Value("string"), 
    "embeddings": Sequence(Value("float32"))})  # optional, save as float32 instead of float64 to save space

dataset = dataset.map(
    partial(embed, ctx_encoder=ctx_encoder, ctx_tokenizer=ctx_tokenizer),
    batched=True,
    batch_size=16,
    features=new_features,
)

path_passages=os.path.join('/kaggle/input/',"my_knowledge_dataset")
dataset.save_to_disk(path_passages)

We can also load the dataset from the local disk.

```python
from datasets import load_from_disk

dataset=load_from_disk(path_passages)
```

# Implementation Fast Approximate Nearest Neighbor Search

We are going to use the Faiss implementation of HNSW for fast approcimate nearest neighbor search.

In [None]:
index=faiss.