<a href="https://www.kaggle.com/code/aisuko/rag-q-a-with-customise-dataset?scriptVersionId=166910319" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

In this notebook, we will use our own dataset as the knwoloedge of the RAG. We will load our data from CSV file. Next, loading `RagRetriever` witht the customise data. Finally, we will do some inference under the context of the customize data.


# Installing faiss(GPU+CPU)

Here we install Faceboook Faiss which is a library for efficient **similarity search** and **cluster of dense vectors**. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Some of the most useful algorithms are implemented on the GPU. Further more, the GPU implementation can accept input from either CPU or GPU memeory. 

In [1]:
%%capture
!conda install -c pytorch -c nvidia faiss-gpu=1.8.0 -y

In [2]:
%%capture
!pip install transformers==4.38.2
# !pip install accelerate==0.27.2
!pip install datasets==2.18.0 # Fix issue numpy attributes error of RagRetriever
# !pip install peft==0.9.0
# !pip install bitsandbytes==0.42.0

In [3]:
import os
import torch
import faiss # for checking faiss-gpu
import warnings

os.environ['CSV_PATH']='/kaggle/input/knowledge-dataset/own_knwoledge dataset.csv'
os.environ['CSV_DEMO']='/kaggle/input/knowledge-dataset/my_knowledge_dataset.csv'
os.environ['EM_MODEL']='facebook/dpr-ctx_encoder-multiset-base'
os.environ['RAG_MODEL']='facebook/rag-sequence-nq'

if torch.cuda.is_available():
    torch.backends.cudnn.deterministic=True
    # https://github.com/huggingface/transformers/issues/28731
    torch.backends.cuda.enable_mem_efficient_sdp(False)
    device='cuda'
else:
    device='cpu'
    
warnings.filterwarnings('ignore')

print(device)

cuda


# Checking the Basic Information of Data

The dataset needed for RAG must have two columns:
- title(string): title of the document
- text(string): text of a passage of the document

We visualization the data and make sure it's format was corrected.

In [4]:
# Bad case
import pandas as pd

df=pd.read_csv(os.getenv('CSV_PATH'))
print(df.shape)
df.head()

(4, 2)


Unnamed: 0,title,text
0,Aaron,"Aaron Aaron ( or ; ""Ahärôn"") is a prophet, hig..."
1,Pokémon,"Pokémon , also known as in Japan, is a media f..."
2,Melbourne,Melborune is a beautiful city which is located...
3,RMIT,"RMIT is an university in Melbourne, the city c..."


In [5]:

df1=pd.read_csv(os.getenv("CSV_DEMO"))
print(df1.shape)
df1.head()

(1, 18)


Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,"Aaron\tAaron Aaron ( or ; ""Ahärôn"") is a prophet",high priest,and the brother of Moses in the Abrahamic religions. Knowledge of Aaron,along with his brother Moses,comes exclusively from religious texts,such as the Bible and Quran. The Hebrew Bible relates that,unlike Moses,who grew up in the Egyptian royal court,Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites,"Aaron served as his brother's spokesman (""prophet"") to the Pharaoh. Part of the Law (Torah) that Moses received from God at Sinai granted Aaron the priesthood for himself and his male descendants",and he became the first High Priest of the Israelites. Aaron died before the Israelites crossed the North Jordan river and he was buried on Mount Hor (Numbers 33:39; Deuteronomy 10:6 says he died and was buried at Moserah). Aaron is also mentioned in the New Testament of the Bible. According to the Book of Exodus,Aaron first functioned as Moses' assistant. Because Moses complained that he could not speak well,"God appointed Aaron as Moses' ""prophet"" (Exodus 4:10-17; 7:1). At the command of Moses",he let his rod turn into a snake. Then he stretched out his rod in order to bring on the first three plagues. After that,Moses tended to act and speak for himself. During the journey in the wilderness,Aaron was not always prominent or active. At the battle with Amalek,"he was chosen with Hur to support the hand of Moses that held the ""rod of God"". When the revelation was given to Moses at biblical Mount Sinai",he headed the elders of Israel who accompanied Moses on the way to the summit.
Pokémon\tPokémon,also known as in Japan,is a media franchise managed by The Pokémon Company,a Japanese consortium between Nintendo,Game Freak,and Creatures. The franchise copyright is shared by all three companies,but Nintendo is the sole owner of the trademark. The franchise was created by Satoshi Tajiri in 1995,"and is centered on fictional creatures called ""Pokémon""",which humans,known as Pokémon Trainers,catch and train to battle each other for spor...,a pair of video games for the original Game B...,with over in revenue up until March 2017. The...,"the ""Pokémon"" franchise includes the world's ...",the top-selling trading card game with over 2...,an anime television series that has become th...,000 episodes in 124 countries,as well as an anime film series,a,books,manga comics,music,and merchandise. The franchise is also repres...,"such as the ""Super Smash Bros."" series. In No...",4Kids Entertainment,which had managed the non-game related licens...,announced that it had agreed not to renew the...


# Loading the Data from CSV

We load the CSV file and split the data into passages of 100 words

In [6]:
from datasets import Dataset, load_dataset
from typing import List, Optional


def split_text(text: str, n=100, character=" ") -> List[str]:
    """Split the text every ``n``-th occurrence of ``character``"""
    text = text.split(character)
    return [character.join(text[i : i + n]).strip() for i in range(0, len(text), n)]


def split_documents(documents: dict) -> dict:
    """Split documents into passages"""
    titles, texts = [], []
    for title, text in zip(documents["title"], documents["text"]):
        if text is not None:
            for passage in split_text(text):
                titles.append(title if title is not None else "")
                texts.append(passage)
    return {"title": titles, "text": texts}


# Using pandas dataframe without set columnes names will cause issue
# dataset=Dataset.from_pandas(df, split="train")
# dataset=dataset.map(split_documents, batched=True, num_proc=4) # 4 vCPUs in Kaggle

# Spliting the documents into passages of 100 words
dataset = load_dataset("csv", data_files=[os.getenv('CSV_DEMO')], split="train", delimiter="\t", column_names=["title", "text"])
dataset

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['title', 'text'],
    num_rows: 2
})

# Create Dataset from CSV



In [7]:
from transformers import DPRContextEncoder, DPRContextEncoderTokenizerFast

ctx_encoder=DPRContextEncoder.from_pretrained(os.getenv("EM_MODEL")).to(device)
ctx_tokenizer=DPRContextEncoderTokenizerFast.from_pretrained(os.getenv("EM_MODEL"))

config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


In [8]:
from functools import partial
from datasets import Features, Sequence, Value


def embed(documents: dict, ctx_encoder: DPRContextEncoder, ctx_tokenizer: DPRContextEncoderTokenizerFast) -> dict:
    """Compute the DPR embeddings of document passages"""
    print(documents["title"])
    print(documents["text"])
    input_ids = ctx_tokenizer(documents["title"], documents["text"], truncation=True, padding="longest", return_tensors="pt")["input_ids"]
    embeddings = ctx_encoder(input_ids.to(device=device), return_dict=True).pooler_output
    return {"embeddings": embeddings.detach().cpu().numpy()}
    
new_features = Features({
    "text": Value("string"), 
    "title": Value("string"), 
    "embeddings": Sequence(Value("float32"))})  # optional, save as float32 instead of float64 to save space


dataset = dataset.map(
    partial(embed, ctx_encoder=ctx_encoder, ctx_tokenizer=ctx_tokenizer),
    batched=True,
    batch_size=16,
    features=new_features,
)

dataset.save_to_disk('my_knowledge_dataset')

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

['Aaron', 'Pokémon']
['Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from God at Sinai granted Aaron the priesthood for himself and his male descendants, and he became the first High Priest of the Israelites. Aaron died before the Israelites crossed the North Jordan river and he was buried on Mount Hor (Numbers 33:39; Deuteronomy 10:6 says he died and was buried at Moserah). Aaron is also mentioned in the New Testament of the Bible. Accord

Saving the dataset (0/1 shards):   0%|          | 0/2 [00:00<?, ? examples/s]

We can also load the dataset from the local disk.

```python
from datasets import load_from_disk

dataset=load_from_disk(path_passages)
```

# Index the Dataset

We are going to use the Faiss implementation of HNSW for fast approcimate nearest neighbor search.

In [9]:
dimension=768 # The dimension of the embeddings to pass to the HNSW Faiss index.
m=128 # The number of bi-directional links created for every new element during the HNSW index construction.

index=faiss.IndexHNSWFlat(dimension, m, faiss.METRIC_INNER_PRODUCT)
type(index)

faiss.swigfaiss_avx512.IndexHNSWFlat

In [10]:
dataset.add_faiss_index("embeddings", custom_index=index)

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['text', 'title', 'embeddings'],
    num_rows: 2
})

# Save the Index

In [11]:
dataset.get_index("embeddings").save("my_knowledge_hnsw_index.faiss")

# Loading RAG

We load RagRetriever and RagSequenceForGeneration seperately.

In [12]:
from transformers import RagRetriever, RagSequenceForGeneration, RagTokenizer

retriever=RagRetriever.from_pretrained(os.getenv("RAG_MODEL"), index="custom", indexed_dataset=dataset)

config.json:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

(…)_encoder_tokenizer/tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

question_encoder_tokenizer/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)ncoder_tokenizer/special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.


(…)enerator_tokenizer/tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

generator_tokenizer/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

generator_tokenizer/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

(…)erator_tokenizer/special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizerFast'.


In [13]:
model=RagSequenceForGeneration.from_pretrained(os.getenv("RAG_MODEL"), retriever=retriever).to(device)

pytorch_model.bin:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/rag-sequence-nq were not used when initializing RagSequenceForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.bias', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing RagSequenceForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagSequenceForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
tokenizer=RagTokenizer.from_pretrained(os.getenv("RAG_MODEL"))

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called fr

# Infernece

In [15]:
question="What does Moses' rod turn into?"

def inference(question:str):
    input_ids=tokenizer.question_encoder(question, return_tensors="pt").to(device)
    generated=model.generate(input_ids['input_ids'])
    generated_str=tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
    return generated_str

inference(question)

' aaron'

In [16]:
question2="How many people live in Melbourne?"
inference(question2)

' trading card game'

In [17]:
question3="What is the relationship between Nintendo and Pokémon?"
inference(question3)

' trading card game'

In [18]:
question4="What does Moses' rod turn into?"
inference(question4)

' aaron'

In [19]:
question5="Where is Pokémon company?"
inference(question5)

' trading card game'

# Conclusion

As we can see if we want to use our own dataset. We need to make sure our private data can cover same basic knowledge around the world. Otherwise, The answer might be worse than we assumed.