# Neural search for question answering

The exercise introduces the problem of passage retrieval, an important step in factual question answering.
This part concentrates on the methods for retrieving
the content of documents that might be useful for answering the question. We compare lexical text
representations (e.g. ElasticSearch default behaviour), with dense text representations (e.g. [multilingual E5](https://huggingface.co/intfloat/multilingual-e5-base) neural model).

## Tasks

Objectives (8 points)

1. Read the documentation of the [document store](https://docs.haystack.deepset.ai/docs/document_store) and the [retriever](https://docs.haystack.deepset.ai/docs/retriever) in the [Haystack framework](https://haystack.deepset.ai/).
2. Install Haystack framework (e.g. with `pip install 'farm-haystack[all]'`).

In [None]:
!pip install 'farm-haystack[all]'
!pip install -U datasets
!pip install -U ipywidgets

In [22]:
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever, DensePassageRetriever
import datasets
import pandas as pd
from tqdm import tqdm
import os
import shutil
import numpy as np
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TextClassificationPipeline

3. Configure a document store based on Faiss supported by multilingual E5 model:
   1. For Faiss use [multilingual E5](https://huggingface.co/intfloat/multilingual-e5-base) or [silver retriever base](https://huggingface.co/ipipan/silver-retriever-base-v1) encoder.
   2. **Warning:** If you use E5, make sure to [properly configure](https://github.com/deepset-ai/haystack/issues/5242) the store.
   3. In the case you have problems using Faiss, you can use `InMemoryDocumentStore`, but this will require to re-index
      all documents each time the script is run, which is time consuming.

Create Retriver

In [None]:
!rm -rf ./faiss_document_store.db

In [3]:
faiss_store = FAISSDocumentStore(
    faiss_index_factory_str="Flat",
    similarity="cosine"
)

In [4]:
faiss_retriever = EmbeddingRetriever(
    document_store=faiss_store,
    embedding_model='ipipan/silver-retriever-base-v1',
    model_format="sentence_transformers",
    use_gpu=True,
    progress_bar= False,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.35k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/368 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/907k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/556k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.30M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/144 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

4. Load the documents (passages) from the FiQA corpus.

In [5]:
dataset_corpus = datasets.load_dataset("clarin-knext/fiqa-pl", "corpus", split="corpus")

README.md:   0%|          | 0.00/201 [00:00<?, ?B/s]

fiqa-pl.py:   0%|          | 0.00/1.67k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/32.3M [00:00<?, ?B/s]

Generating corpus split:   0%|          | 0/57638 [00:00<?, ? examples/s]

In [6]:
dataset_corpus

Dataset({
    features: ['_id', 'title', 'text'],
    num_rows: 57638
})

In [7]:
df_corpus = pd.DataFrame(dataset_corpus)
df_corpus['_id'] = df_corpus['_id'].astype(int)
df_corpus.set_index('_id', inplace=True)

In [8]:
df_corpus

Unnamed: 0_level_0,title,text
_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,,"Nie mówię, że nie podoba mi się też pomysł szk..."
31,,Tak więc nic nie zapobiega fałszywym ocenom po...
56,,Nigdy nie możesz korzystać z FSA dla indywidua...
59,,Samsung stworzył LCD i inne technologie płaski...
63,,Oto wymagania SEC: Federalne przepisy dotycząc...
...,...,...
599946,,">Cóż, po pierwsze, drogi to coś więcej niż hob..."
599953,,"Tak, robią. Na dotacje dla firm farmaceutyczny..."
599966,,">To bardzo smutne, że nie rozumiesz ludzkiej n..."
599975,,„Czy Twój CTO pozwolił dużej grupie użyć „„adm...


In [9]:
documents = []

for index, row in df_corpus.iterrows():
    document = {
        'content': row['text'],
        'meta': {'id': index}
    }
    documents.append(document)

In [10]:
faiss_store.write_documents(documents)

Writing Documents: 60000it [02:02, 490.94it/s]


In [11]:
faiss_store.update_embeddings(faiss_retriever)

Documents Processed: 60000 docs [14:29, 69.04 docs/s]


In [12]:
faiss_store.save("store.faiss")
faiss_retriever.save("retriever.faiss")

5. Use the set of questions and the scorings defined in this corpus, to compute NDCG@5 for the dense retriever.

In [26]:
dataset_QA = datasets.load_dataset("clarin-knext/fiqa-pl-qrels")
df_qrels_test = pd.DataFrame(dataset_QA['test'])

dataset_queries = datasets.load_dataset("clarin-knext/fiqa-pl", "queries")
df_queries = pd.DataFrame(dataset_queries['queries'])
df_queries['_id'] = df_queries['_id'].astype(int)
df_queries.set_index('_id', inplace=True)

In [27]:
K = 5

def calc_ndcg_k(scores):
    if len(scores) != K : Exception("Invalid scores arr size, != 5")
    dcg = np.sum(scores / np.log2(np.arange(2, len(scores) + 2)))
    idcg = np.sum(sorted(scores, reverse=True) / np.log2(np.arange(2, len(scores) + 2)))
    ndcg = dcg / idcg if idcg > 0 else 0.0
    return ndcg

arr = np.array([0.0 for i in range(K)])

In [28]:
ndcg = 0
iterator = 0

for query_id in tqdm(df_qrels_test['query-id'].unique()):
    question = df_queries.loc[query_id]['text']
    retriever_results = faiss_retriever.retrieve(query=question, top_k=5)
    corpus_ids = list(df_qrels_test[df_qrels_test['query-id'] == query_id]['corpus-id'])

    for idx, val in enumerate(retriever_results):
        _id = df_corpus.loc[int(val.meta['id'])].name
        if _id in corpus_ids:
            arr[idx] = 1
        else:
            arr[idx] = 0
    ndcg += calc_ndcg_k(arr)
    iterator += 1

100%|██████████| 648/648 [00:51<00:00, 12.65it/s]


In [30]:
mean_ndcg = ndcg / iterator
print("Mean NDCG for silver-retriever-base-v1:", mean_ndcg)

Mean NDCG for silver-retriever-base-v1: 0.26807909512976164


6. Compare the NDCG score from this exercise with the score from [lab 2](2-fts.md) and from [lab 6](6-classification.md).

**Lab2**
1. With Synonyms: 0.2669
2. Without Synonyms: 0.2657
3. Wihout Lemmatization: 0.2078

**Lab6**
1. With model: 0.1655
2. Without model: 0.2078

**Lab8**
1. For Retriver: 0.2680


In Lab 2, the use of Full-Text Search (FTS) with synonyms achieved a competitive NDCG score of 0.2669, slightly higher than without synonyms (0.2657). This indicates that synonyms add marginal improvements to query-document matching. However, the absence of lemmatization significantly reduced performance to 0.2078, showing that morphological normalization plays an important role in retrieval effectiveness.

In Lab 6, the classification-based approach yielded the lowest NDCG score of 0.1655, suggesting that the model struggled to align questions with relevant documents. When the classification model was not used, the score rose to 0.2078, mirroring the performance drop seen in Lab 2 without lemmatization. This indicates that the classification model likely lacked robustness in this setup.

Finally, in Lab 8, the dense retriever achieved the best overall performance with an NDCG score of 0.2680, outperforming both FTS-based and classification-based methods. This demonstrates the strength of dense vector representations in capturing semantic relationships between queries and documents, particularly for tasks requiring nuanced understanding.

Overall, dense retrievers show the highest potential for achieving accurate retrieval, while FTS remains a viable and simpler alternative for less complex scenarios. Classification approaches, on the other hand, may require further optimization to be effective in this context.

7. **Bonus** (+2p) Combine dense retrieval with classification model from [lab 6](6-classification.md) to implement a two-step
   retrieval. Compute NDCG@5 for this combined model.

In [62]:
model_path = "model"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)

pl_tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")

tokenizer_kwargs = {"padding": "max_length", "truncation": True}
pipe = TextClassificationPipeline(
    model=model,
    tokenizer=pl_tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    return_all_scores=False,
    **tokenizer_kwargs
)

print("Pipeline is ready and using GPU!" if torch.cuda.is_available() else "Pipeline is ready but using CPU.")


Using device: cuda
Pipeline is ready and using GPU!


In [63]:
def reranking(query_id: int):
    question = df_queries.loc[query_id]['text']

    responses = faiss_retriever.retrieve(query=question, top_k=10)
    responses = list(responses)
    predictions = dict()

    for result in responses:
        corpus_id = result.meta['id']
        model_query = f"Pytanie: {question} Kontekst: {result.content}"
        pred = pipe(model_query)
        label = int(pred[0]['label'][-1])

        predictions[corpus_id] = -float(pred[0]['score'])
        if label == 1:
            predictions[corpus_id] *= -1

    responses = sorted(responses, key = lambda x: -predictions[x.meta['id']])

    return responses


In [64]:
ndcg = 0
iterator = 0

for query_id in tqdm(df_qrels_test['query-id'].unique(), desc="Progress", unit="iteration", position=0, leave=True):
    responses = reranking(query_id)
    corpus_ids = list(df_qrels_test[df_qrels_test['query-id'] == query_id]['corpus-id'])

    for idx, val in enumerate(responses[:K]):
        _id = df_corpus.loc[int(val.meta['id'])].name
        if _id in corpus_ids:
            arr[idx] = 1
        else:
            arr[idx] = 0
    ndcg += calc_ndcg_k(arr)
    iterator += 1

Progress: 100%|██████████| 648/648 [04:28<00:00,  2.41iteration/s]


In [65]:
mean_ndcg = ndcg / iterator
print("Mean NDCG for silver-retriever-base-v1 + lab_6 model:", mean_ndcg)

Mean NDCG for silver-retriever-base-v1 + lab_6 model: 0.23872316322021517


8. **Bonus** (+2p) Use a different dense encoder, e.g. [E5 large](https://huggingface.co/intfloat/multilingual-e5-large) or [Polish Roberta Base](https://huggingface.co/sdadas/mmlw-retrieval-roberta-base) and compute NDCG@5.

In [67]:
!rm -rf ./faiss_document_store.db

In [68]:
faiss_store_2 = FAISSDocumentStore(
    faiss_index_factory_str="Flat",
    similarity="cosine"
)

In [69]:
faiss_retriever_2 = EmbeddingRetriever(
    document_store=faiss_store_2,
    embedding_model='sdadas/mmlw-retrieval-roberta-base',
    use_gpu=True,
    progress_bar= False,
    model_format="sentence_transformers"
)

Embedings

In [70]:
faiss_store_2.write_documents(documents)

Writing Documents: 60000it [01:56, 514.31it/s]


In [71]:
faiss_store_2.update_embeddings(faiss_retriever_2)

Documents Processed: 60000 docs [13:49, 72.34 docs/s]


Save State

In [72]:
faiss_store_2.save("store.faiss")
faiss_retriever_2.save("retriever.faiss")

NDCG

In [73]:
ndcg = 0
iterator = 0

for query_id in tqdm(df_qrels_test['query-id'].unique()):
    question = df_queries.loc[query_id]['text']
    retriever_results = faiss_retriever_2.retrieve(query=question, top_k=K)
    corpus_ids = list(df_qrels_test[df_qrels_test['query-id'] == query_id]['corpus-id'])

    for idx, val in enumerate(retriever_results):
        _id = df_corpus.loc[int(val.meta['id'])].name
        if _id in corpus_ids:
            arr[idx] = 1
        else:
            arr[idx] = 0
    ndcg += calc_ndcg_k(arr)
    iterator += 1

100%|██████████| 648/648 [00:48<00:00, 13.28it/s]


In [74]:
mean_ndcg = ndcg / iterator
print("Mean NDCG for mmlw-retrieval-roberta-base:", mean_ndcg)

Mean NDCG for mmlw-retrieval-roberta-base: 0.2835997656020073


Summary:

**Lab2**
1. With Synonyms: 0.2669
2. Without Synonyms: 0.2657
3. Wihout Lemmatization: 0.2078

**Lab6**
1. With model: 0.1655
2. Without model: 0.2078

**Lab8**
1. For Retriver: 0.2680
2. For Retriver + model: 0.2387
3. For Roberta: 0.2835

#### Questions (2 points)

**1. Which of the methods: lexical match (e.g. ElasticSearch) or dense representation works better?**

Dense representations work better, as shown by the highest NDCG score (0.2835) achieved by the Roberta-based retriever. Dense retrieval methods capture semantic meaning and are better at matching queries with documents that use different words to express similar ideas. Lexical search, while effective in certain contexts, is limited to exact or near-exact word matches.

**2. Which of the methods is faster?**

ElasticSearch is faster. It is highly optimized for indexing and retrieving documents, making it very efficient for large-scale datasets. Dense retrieval requires generating embeddings and performing similarity computations, which are computationally expensive and slower, especially for large datasets.

**3. Try to determine the other pros and cons of using lexical search and dense document retrieval models.**

For Lexical Search/ElasticSearch:
- Pros:

  -> Very fast and efficient for querying large datasets.

  -> Highly customizable with support for synonyms, lemmatization, and other preprocessing techniques.

- Cons:

  -> Struggles to capture semantic meaning and context, relying only on word overlap.

  -> Requires careful preprocessing for better results

For Dense Retrieval Models:

- Pros:

  -> Good at understanding semantic relationships and capturing synonyms or paraphrases.

  -> Can adapt to specific domains through fine-tuning.

- Cons:

  -> Slower than lexical methods due to the computational cost of embedding generation and similarity calculations.

  -> Requires significant computational resources, including GPUs, and can be challenging to scale for extremely large datasets.




## Hints

1. Haystack is a framework for buliding question answering applications.
2. Lexical document retrieval is based on traditional NLP pipelines (e.g. lemmatization),
   i.e. models based on bag of words. ElasticSearch typical usage is based on lexical search model.
3. Dense document retrieval is based on dense vector models provided by neural networks. These dense vectors might be
   generated directly, by e.g. avaraging the vectors of word embeddings belonging to a given text fragment. Yet such
   models performance is inferior to sparse models.
4. More sophisticated models are trained directly on the document retrieval task. E.g. [DPR](https://arxiv.org/abs/2004.04906)
   uses a bi-encoder architecture that has a separate neural network for encoding the question and for encoding the passage.
   [E5](https://arxiv.org/abs/2212.03533) model has a shared encoder architecture.
   These netowrks are trained to maximise the dot-product of the vectors produced by each of the networks.
5. Using dense vector representation requires computing the dense vectors for all passages in the dataset.
   These vectors might be stored in document stores such as [FAISS](https://github.com/facebookresearch/faiss) for faster retrieval,
   especially when the dataset is very large (does not fit into memory).
6. [Polish retrieval benchmark](https://huggingface.co/spaces/sdadas/pirb) lists and compares the models that implement dense retrieval for Polish.
