## 3_faiss_ranking.ipynb

This notebook implements a FAISS-based ranking system for retrieving the most relevant documents for given queries. The workflow includes:
1. Loading document and query embeddings.
2. Building a FAISS index for fast nearest-neighbor search.
3. Retrieving the top-k most relevant documents for each query based on cosine similarity or L2 distance.
4. Associating and displaying the results with query and document IDs.

### Output
- A dictionary `results` containing query IDs as keys and a list of top document IDs with their distances as values.

### Notes
- FAISS is optimized for large-scale similarity searches, making it suitable for handling large datasets.
- The default `IndexFlatL2`

In [None]:
import faiss
import numpy as np
import pickle

from collections import defaultdict
from FlagEmbedding import FlagModel

In [None]:
# Vector for loading documents
with open('../pkl/m3_chunk_128/m3_chunk_128_embedding_1.pkl', 'rb') as f:
    doc_embeddings_dict = pickle.load(f)
doc_ids = list(doc_embeddings_dict.keys())
doc_embeddings = np.array([doc_embeddings_dict[doc_id] for doc_id in doc_ids]).astype('float32')
del doc_embeddings_dict

In [None]:
%%time

# Load the test queries
test_path = '../dataset/test.csv'
test_df = pd.read_csv(test_path)

# Load the model
model = FlagModel('BAAI/bge-m3',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

# Embed the test queries
queries = test_df['query'].tolist()
query_ids = test_df['id'].tolist()
query_embeddings = model.encode(queries).astype('float32')

In [None]:
# Get vector dimensions
d = doc_embeddings.shape[1]
# Initialise the FAISS index
index = faiss.IndexFlatL2(d)
# Add document vectors to the index
index.add(doc_embeddings)
print(f"Added {index.ntotal} document vectors to the index")

In [None]:
# Search the top 10 most relevant documents for each query vector
k = 10 
distances, indices = index.search(query_embeddings, k)

# Associate results with query IDs and document IDs
results = {}
for i, query_id in enumerate(query_ids):
    results[query_id] = [(doc_ids[indices[i][j]], distances[i][j]) for j in range(k)]

In [None]:
for query_id, docs in list(results.items())[:5]:  # Display the results of the first 5 queries
    print(f"查询ID: {query_id}")
    for doc_id, distance in docs:
        print(f"    文档ID: {doc_id}, 距离: {distance}")
    print()