# BM25 Evaluation for Pira

This Jupyter notebook evaluates the performance of BM25 retriever model on Pirá Dataset. 

The code is based on BM25 Haystack Library implementation: https://haystack.deepset.ai/overview/intro

Check the full Pira GitHub at: https://github.com/C4AI/Pira

## Imports

In [None]:
import pandas as pd
from haystack.utils import launch_es
import os
from subprocess import Popen, PIPE, STDOUT
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
from haystack.retriever.sparse import ElasticsearchRetriever
from haystack.pipelines import DocumentSearchPipeline
import ast


## Dataset information

Here we set some values necessary to load the dataset

PATH_BASE -> Dataset path

SUPPORTING_TEXT_COLUMN -> Indicates the Supporting Text Column. Use "10" for English or "-2" for Portuguese.

ANSWER_COLUMN -> Indicates the Answer Column. Use "6" for English or "7" for Portuguese.

QUESTION_COLUMN -> Indicates the Answer Column. Use "2" for English or "3" for Portuguese.

In [None]:
PATH_BASE = './Data/test.csv'

ABSTRACT_COLUMN = 18
ANSWER_COLUMN = 7
QUESTION_COLUMN = 3

### Defining the index name for the document store

In [None]:
INDEX_KNOWLEDGE_BASE = "abstracts_100_pt"


## Loading Dataset

It is important to ensure that we do not use the same supporting text more than once.

In [None]:
pira_train = pd.read_csv(PATH_BASE + "train.csv").values.tolist()
pira_val = pd.read_csv(PATH_BASE + "validation.csv").values.tolist()
pira_test = pd.read_csv(PATH_BASE + "test.csv").values.tolist()

pira_dataset = pira_train + pira_val + pira_test

abstracts = []
temp = []
for i in range(len(pira_dataset)):
    if pira_dataset[i][ABSTRACT_COLUMN] not in temp:
        abstracts.append([pira_dataset[i][ABSTRACT_COLUMN], len(abstracts)+1])
        temp.append(pira_dataset[i][ABSTRACT_COLUMN])
del temp 
 
for i in range(len(pira_dataset)):
    for j in range(len(abstracts)):
        if pira_dataset[i][ABSTRACT_COLUMN] == abstracts[j][0]:
            pira_dataset[i].append(abstracts[j][1])
            
dicts = []
for line in abstracts:
    dicts.append({'content' : line[0], 'meta' : {'idarticle': line[1]}})

## Initializing ElasticSearch

To Download ElasticSearch files, uncoment top lines

In [None]:
#! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
#! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
#! sudo chown -R daemon:daemon elasticsearch-7.9.2


launch_es()

es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )

# wait until ES has started
! sleep 30

## Creating the document store and writing supporting documents

In [None]:
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index=INDEX_KNOWLEDGE_BASE)
document_store.write_documents(docs,batch_size=1000)

#document_store.delete_all_documents # Deleting documents if needed

## Creating the Retriever Module

In [None]:
retriever = ElasticsearchRetriever(document_store=document_store)


pipeline = DocumentSearchPipeline(retriever=retriever)

## Testing the retriever

In [None]:
question = "O que é o Pré-Sal ?"


result = pipeline.run(
    query=question,
    params={
        "Retriever": {
            "top_k": 5,
        }
    }
)

result

## Function that returns the accuracy for a giving k


This function checks for each question if the supporting text was one of the top k retrieved documents and generates the accuracy.

In [None]:

def get_BM25_acc(questions, K_Values):
    maxK = max(K_Values)
    cont = 0
    ids = []
    for line in questions:
        ids.append([])
        result = pipeline.run(query=line[QUESTION_COLUMN],params={"Retriever": {"top_k": maxK}})
        document_dict = ast.literal_eval(str(result["documents"]).replace("<Document: ","").replace("'}>","'}"))
        for i in range(len(document_dict)):
            ids[cont].append(int(document_dict[i]["meta"]["idarticle"]))
        cont+=1
    corrects = []
    accuracies = []
    for j in range(len(K_Values)):
        corrects.append(0)
        cont = 0
        for line in questions:
            if int(line[len(line)-1]) in ids[cont][:K_Values[j]]:
                corrects[j]+=1
            cont+=1
        accuracies.append(corrects[j]/len(questions))
    return(accuracies)



## Evaluating BM25 performance for multiple k values

In [None]:
accs = []
Ks = range(1,101)
pira_test2 = pd.DataFrame(pira_test)
test = pira_test2.dropna(subset=[pira_test2.columns[QUESTION_COLUMN]]).values.tolist()
accs = get_BM25_acc( test, Ks)
for i in range(len(Ks)):
    print("accuracy for K = " + str(Ks[i]) + " -- is =" + str(accs[i]))

df_accs = pd.DataFrame(accs)
df_accs.to_csv(PATH_SAVE_BM25_EVAL)