# Retriever Model and Pipeline Evaluation on LegalQuAD

This notebook contains the Evaluation of different Retriever and Reader Models on the German LegalQuAD test dataset. 
The experiments were conducted as part of the Paper "Question Answering in German Caselaw" published at the Pro-Ve 2022.
A detailed description of the experiments can be found in the paper. 

### Preprequisits: 

- Python >= 3.7.5
- farm-haystack >= 1.0
- PyTorch
- Running ElasticSearch DocumentStore (self-hosted or managed)
- GPU support is highly recommended

## Table of Contents:

* [1. Database setup](#chapter1)
* [2. Data Preprocessing & Database indexing](#chapter2)
* [3. Retriever and Reader model initialisation](#chapter3)
* [4. Evaluation & Results](#chapter4) 
    * [4.1 Retriever models standalone](#section_4_1) 
        * [4.1.1 Retriever model on top_k=10](#section_4_1_1)
        * [4.1.2 Retriever model with top_k=range(10,110,10)](#section_4_1_2)
    * [4.2 QA-Pipeline](#section_4_2) 
        * [4.2.1 QA-Pipeline with BM25](#section_4_2_1)
        * [4.2.2 QA-Pipeline with MFAQ](#section_4_2_2)
        * [4.2.3 QA Pipeline with Ensemble Retriever (BM25 + MFAQ)](#section_4_2_3)

## 1. Database setup <a class="anchor" id="chapter1"></a>

In [None]:
# Import ElasticSearch DocumentStore
from haystack.document_stores import ElasticsearchDocumentStore

# Define doc and label index for the database
doc_index = "eval_docs"
label_index = "eval_labels"

# Init and connect to ElasticSearch
document_store = ElasticsearchDocumentStore(host="d959cba9e11f9a.lhrtunnel.link",
                                            port=443,
                                            scheme='https',
                                            username="",
                                            password="",
                                            index=doc_index,
                                            label_index=label_index,
                                            embedding_field="emb",
                                            embedding_dim=768,
                                            excluded_meta_data=["emb"],
                                            similarity="dot_product")

## 2. Data Preprocessing & Database indexing <a class="anchor" id="chapter2"></a>

In [None]:
# Import Preprocessor
from haystack.nodes import PreProcessor

# Init preprocessor
# Split documents after 200 words
preprocessor = PreProcessor(
    language='de',
    split_length=200,
    split_overlap=0,
    split_respect_sentence_boundary=False,
    clean_empty_lines=False,
    clean_whitespace=False
)

# delete indicies to make sure, there are no duplicates
document_store.delete_documents(index=doc_index)
document_store.delete_documents(index=label_index)

# Convert SQuAD dataset into haystack document format
document_store.add_eval_data(
    filename="../data/legal_squad_test.json",
    doc_index=doc_index,
    label_index=label_index,
    preprocessor=preprocessor
)

## 3. Retriever and Reader model initialisation <a class="anchor" id="chapter3"></a>

In [None]:
# Import Retrieval methods 
from haystack.nodes import TfidfRetriever, ElasticsearchRetriever, DensePassageRetriever, EmbeddingRetriever

# Init TF-IDF
retriever_tfidf = TfidfRetriever(document_store=document_store)
# Init BM25
retriever_bm25 = ElasticsearchRetriever(document_store=document_store)
# Init EmbeddingRetriver from huggingface
retriever_emb = EmbeddingRetriever(document_store=document_store, embedding_model="clips/mfaq")
# Update passage embeddings inside the document store
document_store.update_embeddings(retriever_emb, index=doc_index)

## 4. Evaluation & Results <a class="anchor" id="chapter4"></a>

In [None]:
# Import Pipeline and pre-defined pipelines
from haystack import Pipeline
from haystack.pipelines import ExtractiveQAPipeline, DocumentSearchPipeline, JoinDocuments
# Import Evaluation results and Labels
from haystack.schema import EvaluationResult, MultiLabel
# set evaluation labels
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=False)

### 4.1 Retriever models standalone <a class="anchor" id="section_4_1"></a>

#### 4.1.1 Retriever model on top_k=10 <a class="anchor" id="section_4_1_1"></a>

In [None]:
# Evaluate all retriever models on top_k=10


def get_retrieval_results(top_k=10):
    """
    Method iterates over the defined retrieval methods
    and evaluates their passage search capabilities. 
    """
    
    # list with the retrieval methods initialized in chapter 3.  
    retrieval_methods = [retriever_tfidf, retriever_bm25, retriever_emb]

    for method in retrieval_methods:
        # init document search pipeline
        pipeline_ds = DocumentSearchPipeline(retriever=method)

        # init evaluation pipeline
        eval_result_pipeline = pipeline_ds.eval(
            labels=eval_labels,
            params={"Retriever": {"top_k": top_k}}
        )

        # calculate and print metrics
        metrics = eval_result_pipeline.calculate_metrics()
        print(f"*** RETRIEVER RESULTS: {method.__str__()} ***")
        print(f'Retriever - Recall (single relevant document): {metrics["Retriever"]["recall_single_hit"]}')
        print(f'Retriever - Recall (multiple relevant documents): {metrics["Retriever"]["recall_multi_hit"]}')
        print(f'Retriever - Mean Reciprocal Rank: {metrics["Retriever"]["mrr"]}')
        print(f'Retriever - Precision: {metrics["Retriever"]["precision"]}')
        print(f'Retriever - Mean Average Precision: {metrics["Retriever"]["map"]}')
        print("******************************************************************")

# call method
get_retrieval_results()

#### 4.1.2 Retriever model with top_k=range(10,110,10) <a class="anchor" id="section_4_1_2"></a>

In [None]:
# Evaluate retriever models in range(10,110,10)

def get_retrieval_results_in_range():
    """
    Method iterates over defined retrieval methods
    and evaluates their passage search capabilities. 
    """
    
    # list with the retrieval methods  initialized in chapter 3.  
    retrieval_methods = [retriever_tfidf, retriever_bm25, retriever_emb]
    
    for top_k in range (10,110,10):
        print(f"*** Results on top_k: {top_k} ***")
        
        for method in retrieval_methods:
            # init document search pipeline
            pipeline_ds = DocumentSearchPipeline(retriever=method)

            # init evaluation pipeline
            eval_result_pipeline = pipeline_ds.eval(
                labels=eval_labels,
                params={"Retriever": {"top_k": top_k}}
            )

            # calculate and print metrics
            metrics = eval_result_pipeline.calculate_metrics()
            print(f"*** RETRIEVER RESULTS: {method.__str__()} ***")
            print(f'Retriever - Recall (single relevant document): {metrics["Retriever"]["recall_single_hit"]}')
            print(f'Retriever - Recall (multiple relevant documents): {metrics["Retriever"]["recall_multi_hit"]}')
            print(f'Retriever - Mean Reciprocal Rank: {metrics["Retriever"]["mrr"]}')
            print(f'Retriever - Precision: {metrics["Retriever"]["precision"]}')
            print(f'Retriever - Mean Average Precision: {metrics["Retriever"]["map"]}')
            print("******************************************************************")

# call method
get_retrieval_results_in_range()

### 4.2 QA-Pipeline <a class="anchor" id="section_4_2"></a>

In [None]:
# Import FARMReader
from haystack.nodes.reader import FARMReader

# Init fine_tuned reader
reader_fine_tuned = FARMReader("finetuned_models/GELECTRA-large-LegalQuAD-new", return_no_answer=True)

# Init base reader
reader_base = FARMReader("deepset/gelectra-base-germanquad", return_no_answer=True)

# Init large reader
reader_large = FARMReader("deepset/gelectra-large-germanquad", return_no_answer=True)

#### 4.2.1 QA-Pipeline with BM25 <a class="anchor" id="section_4_2_1"></a>

In [None]:
# Fine-tuned reader evaluation
pipeline_qa_fine_tuned = ExtractiveQAPipeline(reader=reader_fine_tuned, retriever=retriever_bm25)

eval_result_pipeline = pipeline_qa_fine_tuned.eval(
            labels=eval_labels,
            params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
        )

# calculate and print metrics
metrics = eval_result_pipeline.calculate_metrics()
print(f"*** RESULTS: Reader Fine-tuned ***")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
print(f'Reader - Exact Match: {metrics["Reader"]["exact_match"]}')
print("******************************************************************")


In [None]:
# Base reader evaluation
pipeline_qa_base = ExtractiveQAPipeline(reader=reader_base, retriever=retriever_bm25)

eval_result_pipeline = pipeline_qa_base.eval(
            labels=eval_labels,
            params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
        )

# calculate and print metrics
metrics = eval_result_pipeline.calculate_metrics()
print(f"*** RESULTS: Reader BASE ***")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
print(f'Reader - Exact Match: {metrics["Reader"]["exact_match"]}')
print("******************************************************************")


In [None]:
# Large reader evaluation
pipeline_qa_large = ExtractiveQAPipeline(reader=reader_large, retriever=retriever_bm25)

eval_result_pipeline = pipeline_qa_large.eval(
            labels=eval_labels,
            params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
        )

# calculate and print metrics
metrics = eval_result_pipeline.calculate_metrics()
print(f"*** RESULTS: Reader LARGE ***")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
print(f'Reader - Exact Match: {metrics["Reader"]["exact_match"]}')
print("******************************************************************")


#### 4.2.2 QA-Pipeline with MFAQ <a class="anchor" id="section_4_2_2"></a>

In [None]:
# Fine-tuned reader evaluation
pipeline_qa_fine_tuned = ExtractiveQAPipeline(reader=reader_fine_tuned, retriever=retriever_emb)

eval_result_pipeline = pipeline_qa_fine_tuned.eval(
            labels=eval_labels,
            params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
        )

# calculate and print metrics
metrics = eval_result_pipeline.calculate_metrics()
print(f"*** RESULTS: Reader Fine-tuned ***")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
print(f'Reader - Exact Match: {metrics["Reader"]["exact_match"]}')
print("******************************************************************")
print(f"*** All Metrics ***")
print(metrics)
print("******************************************************************")

In [None]:
# Base reader evaluation
pipeline_qa_base = ExtractiveQAPipeline(reader=reader_base, retriever=retriever_emb)

eval_result_pipeline = pipeline_qa_base.eval(
            labels=eval_labels,
            params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
        )

# calculate and print metrics
metrics = eval_result_pipeline.calculate_metrics()
print(f"*** RESULTS: Reader BASE ***")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
print(f'Reader - Exact Match: {metrics["Reader"]["exact_match"]}')
print("******************************************************************")
print(f"*** All Metrics ***")
print(metrics)
print("******************************************************************")

In [None]:
# Large reader evaluation
pipeline_qa_large = ExtractiveQAPipeline(reader=reader_large, retriever=retriever_emb)

eval_result_pipeline = pipeline_qa_large.eval(
            labels=eval_labels,
            params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
        )

# calculate and print metrics
metrics = eval_result_pipeline.calculate_metrics()
print(f"*** RESULTS: Reader LARGE ***")
print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
print(f'Reader - Exact Match: {metrics["Reader"]["exact_match"]}')
print("******************************************************************")
print(f"*** All Metrics ***")
print(metrics)
print("******************************************************************")

#### 4.2.3 QA Pipeline with Ensemble Retriever (BM25 + MFAQ) <a class="anchor" id="section_4_2_3"></a>

In [None]:
# Ensemble Pipeline evaluation
from haystack import Pipeline
from haystack.pipelines import JoinDocuments

pipeline_ensemble = Pipeline()
pipeline_ensemble.add_node(component=retriever_bm25, name="Retriever_BM25", inputs=["Query"])
pipeline_ensemble.add_node(component=retriever_emb, name="Retriever_EMB", inputs=["Query"])
pipeline_ensemble.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["Retriever_BM25", "Retriever_EMB"])
pipeline_ensemble.add_node(component=reader_fine_tuned, name="Reader_Fine_tuned", inputs=["JoinResults"])


eval_result_pipeline = pipeline_ensemble.eval(
            labels=eval_labels,
            params={"Retriever_BM25": {"top_k": 10}, "Retriever_EMB": {"top_k": 10}, "Reader_Fine_tuned": {"top_k": 5}}
            )
                    

# calculate and print metrics
metrics = eval_result_pipeline.calculate_metrics()
print(f"*** RESULTS: Reader LARGE ***")
print(f'Reader - F1-Score: {metrics["Reader_Fine_tuned"]["f1"]}')
print(f'Reader - Exact Match: {metrics["Reader_Fine_tuned"]["exact_match"]}')
print("******************************************************************")
print(f"*** All Metrics ***")
print(metrics)
print("******************************************************************")