<a href="https://colab.research.google.com/github/Calcifer777/learn-deep-learning/blob/main/learn-rag/sample_bge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Resources

## BGE

- HF page:https://huggingface.co/BAAI/bge-m3
- Suggestions: Hybrid retrieval with re-ranking
- https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse

## Tools

- Faiss: https://github.com/facebookresearch/faiss (MIT)
- PySerini: https://github.com/castorini/pyserini (Apache)

# Install dependencies

In [1]:
%%bash

python --version

Python 3.10.12


In [2]:
%%bash

apt install libomp-dev

Reading package lists...
Building dependency tree...
Reading state information...
libomp-dev is already the newest version (1:14.0-55~exp2).
0 upgraded, 0 newly installed, 0 to remove and 33 not upgraded.






In [3]:
%%bash

pip install transformers FlagEmbedding faiss-gpu pyserini faiss-cpu



In [29]:
%%bash

pip list

Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
accelerate                       0.27.2
aiohttp                          3.9.3
aiosignal                        1.3.1
alabaster                        0.7.16
albumentations                   1.3.1
altair                           4.2.2
annotated-types                  0.6.0
anyio                            3.7.1
appdirs                          1.4.4
argon2-cffi                      23.1.0
argon2-cffi-bindings             21.2.0
array-record                     0.5.0
arviz                            0.15.1
astropy                          5.3.4
astunparse                       1.6.3
async-timeout                    4.0.3
atpublic                         4.0
attrs                            23.2.0
audioread                        3.0.1
autograd                         1.6.2
Babel                            2.14.0
backcall                         0.2.0
be

In [25]:
import os
from pathlib import Path

import numpy as np

import datasets

from FlagEmbedding import BGEM3FlagModel, FlagModel
import torch
from transformers import XLMRobertaTokenizer
from transformers import AutoModel, AutoTokenizer

import faiss
from pyserini.search.faiss import FaissSearcher, AutoQueryEncoder
from pyserini.output_writer import get_output_writer, OutputFormat

In [5]:
model_name = "BAAI/bge-m3-retromae"
batch_size = 8
max_sentence_length = 512
output_dir = Path("./faiss/")

In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [31]:
output_dir.mkdir(exist_ok=True, parents=True)

In [8]:
model = BGEM3FlagModel(model_name_or_path=model_name, use_fp16=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

Some weights of XLMRobertaModel were not initialized from the model checkpoint at /root/.cache/huggingface/hub/models--BAAI--bge-m3-retromae/snapshots/882999a13b472477c7fde523191ba3524f3cccf9 and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The parameters of colbert_linear and sparse linear is new initialize. Make sure the model is loaded for training, not inferencing


In [9]:
model.device

device(type='cuda')

In [15]:
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModel.from_pretrained(model_name)

In [10]:
sentences_1 = [
    "What is BGE M3?",
    "Defination of BM25",
]
sentences_2 = [
    "BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
    "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document",
]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)

encoding:   0%|          | 0/1 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
encoding: 100%|██████████| 1/1 [00:01<00:00,  1.96s/it]
encoding:   0%|          | 0/1 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
encoding: 100%|██████████| 1/1 [00:00<00:00,  2.91it/s]


# Initialize model

In [11]:
flag_model = FlagModel(
    model_name_or_path=model_name,
    pooling_method="cls",
    normalize_embeddings=True,
    use_fp16=True,
)

Some weights of XLMRobertaModel were not initialized from the model checkpoint at BAAI/bge-m3-retromae and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Create Faiss index from corpus

In [12]:
corpus = datasets.Dataset.from_list([{"id": idx, "content": s} for idx, s in enumerate(sentences_1)])

In [13]:
corpus_embeddings =  flag_model.encode_corpus(
    corpus["content"],
    batch_size=batch_size,
    max_length=max_sentence_length,

)
corpus_embeddings = corpus_embeddings.astype(np.float32)
dim = corpus_embeddings.shape[-1]
dim

1024

In [14]:
faiss_index = faiss.index_factory(
    dim,
    "Flat",
    faiss.METRIC_INNER_PRODUCT,
)

In [15]:
faiss_index.train(corpus_embeddings)
faiss_index.add(corpus_embeddings)

In [32]:
def save_result(index: faiss.Index, docid: list, index_save_dir: str):
    docid_save_path = os.path.join(index_save_dir, 'docid')
    index_save_path = os.path.join(index_save_dir, 'index')
    with open(docid_save_path, 'w', encoding='utf-8') as f:
        for _id in docid:
            f.write(str(_id) + '\n')
    faiss.write_index(index, index_save_path)


save_result(faiss_index, corpus["id"], output_dir)

# Dense Retrieval

In [23]:
query_encoder = AutoQueryEncoder(
    encoder_dir=model_name,
    device=device,
    pooling="cls",
    use_fp=16,
    l2_norm=True,  # normalizes embeddings
)

Some weights of XLMRobertaModel were not initialized from the model checkpoint at BAAI/bge-m3-retromae and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:
faiss_searcher = FaissSearcher(
    index_dir=(output_dir).as_posix(),
    query_encoder=query_encoder,
)

In [37]:
search_results = faiss_searcher.batch_search(
    queries=sentences_2,
    q_ids=list(range(len(sentences_2))),
    k=2,  # hits
    threads=2,
)
search_results

{0: [DenseSearchResult(docid='0', score=0.8609249),
  DenseSearchResult(docid='1', score=0.8046838)],
 1: [DenseSearchResult(docid='1', score=0.8583928),
  DenseSearchResult(docid='0', score=0.81190693)]}