### Benchmark Embedding models
The purpose of this notebook is to benchmark the embedding models on **french dataset FQUAD**.

In fact, embeddings are the main part of semantic search especially when dealing with question answering.

It's applications can be extended to RAG `Retrieval Augmentated Generation`.

In this notebook we propose to:
* Benchmark models mainly `multi-qa-mpnet-base-dot-v1` and `distiluse-base-multilingual-cased-v1`
* Finetune `CamemBert` model on **FQUAD** dataset
* Compare the different results obtained
* Analyse the strenghes and weaknesses of each model

Without further ado, let's start &#x1F64F;

#### Model Benchmarking

In [1]:
import pandas as pd
import torch
from datasets import load_dataset
from pathlib import Path
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from tqdm import tqdm

In [2]:
class QuestionAnsweringEmbeddingInference:
    def __init__(self, model_name_or_path, dataset_name_or_path):
        self.model_name_or_path = model_name_or_path
        self.dataset_name_or_path = dataset_name_or_path
        
    def _load_model(self):
        model = SentenceTransformer(self.model_name_or_path)
        return model
    
    def _transform_data(self):
        dataset = load_dataset("json", self.dataset_name_or_path, field="data", split='train+validation')
        
        # Transform dataset to dataframe
        df_dict = {
            "question": [],
            "context": [],
            "id": []
        }
        
        print(dataset)
        for row in tqdm(dataset):
            for paragraph in dataset["paragraphs"]:
                paragraph = paragraph[0]
                for question in paragraph["qas"]:
                    df_dict["question"].append(question["question"])
                    df_dict["context"].append(paragraph["context"])
                    df_dict["id"].append(question["id"])
            
        df = pd.DataFrame(df_dict) 
        
        # Extract no duplicates
        no_duplicates = df.drop_duplicates(
            subset="context",
            keep="first"
        )
        no_duplicates.drop(columns=["question"], inplace=True)
        no_duplicates["id"] = no_duplicates["id"].apply(
            lambda x: x+"con"
        )
        
        # Merge dataset
        df = df.merge(no_duplicates, how="inner", on="context")
        
        # Construct retrieval queries
        ir_queries = {
            row['id_x']: row['question'] for i, row in df.iterrows()
        }
        
        # Construct retrieval contexts
        ir_corpus = {
            row['id_y']: row['context'] for i, row in df.iterrows()
        }
        
        ir_relevant_docs = {key: [] for key in df['id_x'].unique()}
        for i, row in df.iterrows():
            ir_relevant_docs[row['id_x']].append(row['id_y'])
        ir_relevant_docs = {key: set(values) for key, values in ir_relevant_docs.items()}
            
        return ir_queries, ir_corpus, ir_relevant_docs
    
    def evaluate(self):
        
        ir_queries, ir_corpus, ir_relevant_docs = self._transform_data()
        ir_eval = InformationRetrievalEvaluator(
            ir_queries, ir_corpus, ir_relevant_docs
        )
        
        # Load the model
        model = self._load_model()
        
        return ir_eval(model)

In [3]:
models = ["multi-qa-mpnet-base-dot-v1", "distiluse-base-multilingual-cased-v1"]
dataset_name_or_path = 'valid.json'

model_results = {}
for model in models:
    
    qa_embbeding_inference = QuestionAnsweringEmbeddingInference(
        model_name_or_path=model,
        dataset_name_or_path=dataset_name_or_path
    )
    
    score = qa_embbeding_inference.evaluate()
    
    model_results[model] = score

Found cached dataset json (C:/Users/hella/.cache/huggingface/datasets/json/valid.json-a66e53f4fc1fb90c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)


Dataset({
    features: ['paragraphs', 'title'],
    num_rows: 135
})


100%|████████████████████████████████████████████████████████████████████████████████| 135/135 [01:11<00:00,  1.90it/s]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_duplicates.drop(columns=["question"], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_duplicates["id"] = no_duplicates["id"].apply(
Found cached dataset json (C:/Users/hella/.cache/huggingface/datasets/json/valid.json-a66e53f4fc1fb90c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)


Dataset({
    features: ['paragraphs', 'title'],
    num_rows: 135
})


100%|████████████████████████████████████████████████████████████████████████████████| 135/135 [01:08<00:00,  1.96it/s]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_duplicates.drop(columns=["question"], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_duplicates["id"] = no_duplicates["id"].apply(


Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5f450/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

Downloading (…)966465f450/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)465f450/modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

In [4]:
for model, score in model_results.items():
    print(f"model: {model}, score: {score}")

model: multi-qa-mpnet-base-dot-v1, score: 0.7924400322463055
model: distiluse-base-multilingual-cased-v1, score: 0.8452430164472748
