# **Retrieval and QA Evaluation on German Legal Data**

This Notebook contains the evaluation of different Information Retrieval and QA methods on German legal documents. 

We compare BM25 and Dense Passge Retrieval (DPR) for Document/Passage Retrieval purposes and take first steps in the evaluation of an BERT-based QA model. 

## 1. Init Environment

- Install latest release from Haystack 
- Install ElasticSearch


In [None]:
!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git

In [None]:
# Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [None]:
# check if cuda is working
from farm.utils import initialize_device_settings

device, n_gpu = initialize_device_settings(use_cuda=True)

## 2. Create ElasticSearch DocumentStore

- Init ElasticSearch
- Init data Preprocessor
- Write documents to ElasticSearch (optional: Preprocess documents)

In [None]:
# create indices 
doc_index = "evaluation_docs"
label_index = "evaluation_labels"

In [None]:
# Connect to Elasticsearch
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host="localhost", 
                                            username="", 
                                            password="", 
                                            index="document",
                                            create_index=False, 
                                            embedding_field="emb",
                                            embedding_dim=768, 
                                            excluded_meta_data=["emb"])

In [None]:
from haystack.preprocessor import PreProcessor

# Write evaluation data to Elasticsearch Document Store
# split documents into passages using the PreProcessor

preprocessor = PreProcessor(
    split_length=100,
    split_overlap=0,
    split_respect_sentence_boundary=False,
    clean_empty_lines=False,
    clean_whitespace=False
)
# make sure to delete documents before writing evaluation data!

document_store.delete_documents(index=doc_index)
document_store.delete_documents(index=label_index)

# add evaluation data 
document_store.add_eval_data(
    filename="FILENAME",
    doc_index=doc_index,
    label_index=label_index,
    preprocessor=preprocessor
)

# Create needed label format for retriever and the reader evaluation
labels = document_store.get_all_labels_aggregated(index=label_index)

In [None]:
# check number of documents in DocumentStore
document_store.get_document_count(index=doc_index)

## Initialize Retriever

- Init BM25
- Init DPR

In [None]:
# Init BM25 Retriever
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

#Init DensePassageRetriever
from haystack.retriever.dense import DensePassageRetriever
retriever_dense = DensePassageRetriever(document_store=document_store,
                                 query_embedding_model="deepset/gbert-base-germandpr-question_encoder",
                                 passage_embedding_model="deepset/gbert-base-germandpr-ctx_encoder",
                                 use_gpu=True,
                                 embed_title=True,
                                 batch_size=16)

# Update document embeddings in database 
document_store.update_embeddings(retriever_dense, index=doc_index)

## Evaluation of BM25

- Evaluate BM25 on documents with different k-value

In [None]:
for k in range (10,110,10):

  retriever_eval_results = retriever.eval(top_k=k, label_index=label_index, doc_index=doc_index)
  ## Retriever Recall is the proportion of questions for which the correct document containing the answer is among the correct documents
  print("Retriever Recall:", retriever_eval_results["recall"])
  ## Retriever Mean Avg Precision rewards retrievers that give relevant documents a higher rank
  print("Retriever Mean Avg Precision:", retriever_eval_results["map"])

## Evaluation of DPR

- Evaluation of DPR on documents with different k-value

In [None]:
for k in range (10,110,10):

  ## Evaluate Retriever on its own
  retriever_eval_results = retriever_dense.eval(top_k=k, label_index=label_index, doc_index=doc_index)
  ## Retriever Recall is the proportion of questions for which the correct document containing the answer is among the correct documents
  print("Retriever Recall:", retriever_eval_results["recall"])
  ## Retriever Mean Avg Precision rewards retrievers that give relevant documents a higher rank
  print("Retriever Mean Avg Precision:", retriever_eval_results["map"])