# BERT model performance evaluation on a Blue Amazon QA dataset

This code is about evaluating the performance of the model: bert-base-cased-squad-v1.1-portuguese (BERT model trained for brazilian portuguese SQuAD) for the dataset created by the authors of Q&A about Amazônia Azul (Exclusive economic zone of Brazil).

This is part of the work: *Interpretability of Attention Mechanisms in a Portuguese-Based Question Answering System\\about the Blue Amazon*
published at Encontro Nacional de Inteligência Artificial e Computacional 2021  (ENIAC 2021).

Check the paper: https://sol.sbc.org.br/index.php/eniac/article/view/18302

Check the github at: https://github.com/C4AI/blab-qa-viz



##Installing the main libraries 

In [1]:
!pip install transformers
!pip install rouge_score 

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 31.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 496 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 29.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.3 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

## Checking Colab GPU specs

In [2]:
!nvidia-smi

Thu Dec  9 04:00:14 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Downloading punkt from nltk

This will be important for the metrics part

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

##Loading bert-base-cased-squad-v1.1-portuguese transformers model

Check the model at: https://huggingface.co/pierreguillou/bert-base-cased-squad-v1.1-portuguese

In [4]:
from transformers import pipeline

# Loading the model
qa_pipeline = pipeline(
    "question-answering",
    model="pierreguillou/bert-base-cased-squad-v1.1-portuguese",
    tokenizer="pierreguillou/bert-base-cased-squad-v1.1-portuguese"
)

# Doing one prediction to check answer
predictions = qa_pipeline({
    'context': "O rio Nilo é o rio mais comprido do mundo",
    'question': "Qual é o rio mais comprido do mundo?"
})

print(predictions)

Downloading:   0%|          | 0.00/862 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/494 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

{'score': 0.8113065361976624, 'start': 0, 'end': 10, 'answer': 'O rio Nilo'}


## Mounting Drive

In order to access the database using Colab, we used google drive

In [5]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)


Mounted at /content/drive/


## Reading dataset using Pandas Library



In [6]:
import pandas as pd
df = pd.read_csv('/content/drive/Shareddrives/Explainable QA/Base de dados/QAs/Blue_Amazon_QA_dataset.csv')
df.head()

Unnamed: 0,#,Source,book/wikipedia,Passage,Question,Answer,Passagem,Questão,Resposta
0,1,Alcatrazes_Islands.txt,wikipedia,Alcatrazes possesses rich fauna and flora; by ...,How many birds live in the Alcatrazes islands?,Around 10000 birds.,Alcatrazes possui rica fauna e flora; Em dezem...,Quantas aves vivem nas ilhas de Alcatrazes?,Cerca de 10000 pássaros.
1,2,Alcatrazes_Islands.txt,wikipedia,The islands are believed to present their very...,What are the five slabs in Alcatrazes?,"Duplo, Singela, do Paredão, Farol and Negra",Acredita-se que as ilhas apresentem seu format...,Quais são as cinco lajes em Alcatrazes?,"Duplo, Singela, Paredão, Farol e Negra"
2,3,Recife_Port.txt,wikipedia,The port handles National and international cr...,is the recife port international?,yes,O porto lida com os cruzeiros nacionais e inte...,O Recife Port International?,sim
3,4,Campos_Basin.txt,wikipedia,Five tectonic stages have been identified in t...,How many tectonic stages have been identified ...,five,Cinco etapas tectônicas foram identificadas na...,Quantos estágios tectônicos foram identificado...,cinco
4,5,Port_of_Santos.txt,wikipedia,"Shaped by urban, economic and demographic deve...",What are the main exports that go through the ...,"The main exports are coffee, sugar, and soy.","Forma de desenvolvimento urbano, econômico e d...",Quais são as principais exportações que passam...,"As principais exportações são café, açúcar e s..."


## Predicting responses across the entire dataset



In [7]:
import time
start_time = time.clock()

true_answers = []
gen_answers = []
passages = []
questions =[]
source = df.values.tolist()

# Loop over each question
for i in range(len(source)):
    #print every 10 questions
    if i%10==0:
      print(i)
    predictions = qa_pipeline({
    'context': source[i][6],
    'question': source[i][7]
    })
    # Saving to memory 
    passages.append(source[i][6])
    questions.append(source[i][7])
    gen_answers.append(predictions["answer"])
    true_answers.append(source[i][8])
print(time.clock() - start_time, "seconds")

  
  return array(a, dtype, copy=False, order=order)


0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
283.643553 seconds




## Metrics 

In this section,codes from three different metrics are used to evaluate the results: 


*   Macro average f1-score
*   Exact match
*   Rouge-l

It is important to note that this code is an adaptation from SQuAD dataset (Macro average f1-score and Exact match) and MS MARCO (Rouge-l)


In [8]:
# Imports
from rouge_score import rouge_scorer
from nltk.tokenize import word_tokenize
from nltk.translate.bleu_score import SmoothingFunction
from nltk.metrics import f_measure
import nltk
import string
import re
import collections


# Supporting function
def get_tokens(sentence):
    tokens = word_tokenize(sentence)
    return tokens
def normalize_text(sentence):
    def white_space_fix(text):
        return " ".join(text.split())
    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)
    def lower(text):
        return text.lower()
    return white_space_fix(remove_punc(lower(sentence)))


# F1-score 
def f1_score_qa(text_true,text_pred):
    "squad"
    text_true = normalize_text(text_true)
    text_pred = normalize_text(text_pred)
    true_tokens = get_tokens(text_true)
    pred_tokens = get_tokens(text_pred)
    common = collections.Counter(true_tokens) & collections.Counter(pred_tokens)
    num_same = sum(common.values())
    if len(true_tokens) == 0 or len(pred_tokens) == 0:
        return int(true_tokens==pred_tokens),int(true_tokens==pred_tokens),int(true_tokens==pred_tokens)
    if num_same == 0:
        return 0,0,0
    precision = 1 * num_same / len(pred_tokens)
    recall = 1 * num_same / len(true_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1, precision, recall

# Exact Match
def em_qa(text_true,text_pred):
    return int(normalize_text(text_true) == normalize_text(text_pred))

# Rouge-l
def rouge_l_qa(text_true,text_pred):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    scores = scorer.score(text_true,text_pred)
    return scores["rougeL"].fmeasure

# Exact Match -> Loop over multiple instances
def em_qa_overall(text_true,text_pred):
    scores = 0
    for i in range(len(text_true)):
        scores += em_qa(text_true[i],text_pred[i])
    score = scores/len(text_true)
    return score

# F1-score -> Loop over multiple instances
def f1_score_qa_overall(text_true,text_pred):
    f = 0
    for i in range(len(text_true)):
        fi,pi,ri = f1_score_qa(text_true[i],text_pred[i])
        f+=fi
    f = f/len(text_true)
    return f

# Rouge-l -> Loop over multiple instances
def rouge_l_qa_overall(text_true,text_pred):
    scores = 0
    for i in range(len(text_true)):
        score = rouge_l_qa(text_true[i],text_pred[i])
        scores += score
    scores = scores/len(text_true)
    return scores


def scores(func,text_true,text_pred):
    if func=='rouge_l_qa':
        score_value = rouge_l_qa(text_true,text_pred)
    if func=='em_qa':
        score_value = em_qa(text_true,text_pred)
    if func=='f1_score_qa':
        score_value = f1_score_qa(text_true,text_pred)
    return (func, score_value)
        

def overall(func,text_true,text_pred):
    if func=='rouge_l_qa_overall':
        score_value = rouge_l_qa_overall(text_true,text_pred)
    if func=='em_qa_overall':
        score_value = em_qa_overall(text_true,text_pred)
    if func=='f1_score_qa_overall':
        score_value = f1_score_qa_overall(text_true,text_pred)
    return (func, score_value)

## Getting metrics

In [9]:
print("f1-score: " + str(overall("f1_score_qa_overall", true_answers, gen_answers)))
print("EM: " + str(overall('em_qa_overall', true_answers, gen_answers)))
print("Rouge-L: " + str(overall('rouge_l_qa_overall', true_answers, gen_answers)))

f1-score: ('f1_score_qa_overall', 0.531251841957546)
EM: ('em_qa_overall', 0.2225)
Rouge-L: ('rouge_l_qa_overall', 0.5337985521116425)


## Saving answers to a dataframe

In [10]:
df = pd.DataFrame({'question':questions, 'passages': passages, 'true_answers': true_answers, "gen_answers": gen_answers})



## Getting metrics for each answer

In [11]:
from tqdm import tqdm
model_name = "test"
for _, row in tqdm(df.iterrows()):
    # F1-score
    _, f1_score = scores('f1_score_qa', row['gen_answers'], row['true_answers'])
    df.loc[df['question']==row['question'], 'f1-score'] = round(f1_score[0], 2)

    # Rouge
    _, rouge_l = scores('rouge_l_qa', row['gen_answers'], row['true_answers'])
    rouge_l_f = rouge_l
    df.loc[df['question']==row['question'], 'rouge_L'] = round(rouge_l_f, 2)

    # Exact Match
    _, em = scores('em_qa', row['gen_answers'], row['true_answers'])
    df.loc[df['question']==row['question'], 'EM'] = round(em, 2)

400it [00:01, 307.34it/s]


## Saving Answers and Results

In [12]:
df.to_csv('/content/drive/Shareddrives/Explainable QA/Base de dados/QAs/Blue_Amazon_QA-SQuAD_Results.csv')

## Printing some of the results

In [13]:
df.head(15)

Unnamed: 0,question,passages,true_answers,gen_answers,f1-score,rouge_L,EM
0,Quantas aves vivem nas ilhas de Alcatrazes?,Alcatrazes possui rica fauna e flora; Em dezem...,Cerca de 10000 pássaros.,10000,0.4,0.33,0.0
1,Quais são as cinco lajes em Alcatrazes?,Acredita-se que as ilhas apresentem seu format...,"Duplo, Singela, Paredão, Farol e Negra","Dupla, Singela, Paredão, Do Farol e Negra",0.77,0.8,0.0
2,O Recife Port International?,O porto lida com os cruzeiros nacionais e inte...,sim,um novo terminal de passageiros,0.0,0.0,0.0
3,Quantos estágios tectônicos foram identificado...,Cinco etapas tectônicas foram identificadas na...,cinco,Cinco,1.0,1.0,1.0
4,Quais são as principais exportações que passam...,"Forma de desenvolvimento urbano, econômico e d...","As principais exportações são café, açúcar e s...","café, açúcar e soja",0.67,0.62,0.0
5,Qual é a região brasileira com a maior peça az...,"A área pode ser expandida para 4,4 milhões de ...",Região Nordeste,Nordeste,0.67,0.5,0.0
6,Quais são os nomes das duas ilhotas de Rocas A...,A área da terra das duas ilhotas (Ilha Cemitér...,Ilha Cemitério e Farol Cay,"Ilha Cemitério, Southwest e Farol Cay, Northwest",0.83,0.86,0.0
7,Há viagens de visão vistos em Ilha Grande?,O ecoturismo de pequena escala está sendo enco...,"Sim, existem.",várias trilhas e cachoeiras das montanhas da i...,0.0,0.0,0.0
8,Qual é a maior ilha de todos os santos?,Farol da Barra (farol de Barra) no local de um...,Ilha de Itaparica.,Itaparica,0.5,0.5,0.0
9,Quais animais vivem na área do atol?,"Numerosas tartarugas, tubarões, golfinhos e pá...","Numerosas tartarugas, tubarões, golfinhos e pá...","tartarugas, tubarões, golfinhos e pássaros",0.71,0.78,0.0


## Printing some of the correct answers

In [15]:
df.loc[df['EM']==1,:]

Unnamed: 0,question,passages,true_answers,gen_answers,f1-score,rouge_L,EM
3,Quantos estágios tectônicos foram identificado...,Cinco etapas tectônicas foram identificadas na...,cinco,Cinco,1.0,1.0,1.0
12,Quanto tempo dura a Baía de Guanabara?,Guanabara Bay é de 31 quilômetros de comprimen...,31 quilômetros,31 quilômetros,1.0,1.0,1.0
21,Qual é a distância entre Trindade e Ilhas Mart...,As ilhas estão situadas cerca de 2100 quilômet...,2100 quilômetros,2100 quilômetros,1.0,1.0,1.0
31,Quantos turistas estrangeiros visitam o país a...,O turismo e a recreação tornaram-se entre os f...,1.6 milhão,"1,6 milhão",1.0,1.0,1.0
40,Quão longe é a cidade de São Paulo do porto de...,A localização da cidade de Santos foi escolhid...,79 km.,79 km,1.0,1.0,1.0
...,...,...,...,...,...,...,...
376,A ilha de Fernando de Noronha poderia ter sido...,Assumindo que Quaresma é de fato Fernando de N...,Quaresma,quaresma,1.0,1.0,1.0
377,Quais animais foram o foco de proteção do proj...,Embora o propósito inicial fosse proteger as t...,Tartarugas marinhas.,tartarugas marinhas,1.0,1.0,1.0
382,Qual é a qualidade do óleo encontrado na Amazô...,As reservas de petróleo encontradas na camada ...,média a alta qualidade.,média a alta qualidade,1.0,1.0,1.0
383,Onde está localizada a Baía de Guanabara?,"Guanabara Bay (Português: Baía de Guanabara, I...",Sudeste do Brasil no estado do Rio de Janeiro,sudeste do Brasil no estado do Rio de Janeiro,1.0,1.0,1.0


## Printing some of the incorrect answers

In [16]:
df.loc[df['f1-score']==0.,:]

Unnamed: 0,question,passages,true_answers,gen_answers,f1-score,rouge_L,EM
2,O Recife Port International?,O porto lida com os cruzeiros nacionais e inte...,sim,um novo terminal de passageiros,0.0,0.00,0.0
7,Há viagens de visão vistos em Ilha Grande?,O ecoturismo de pequena escala está sendo enco...,"Sim, existem.",várias trilhas e cachoeiras das montanhas da i...,0.0,0.00,0.0
17,Há quanto tempo a ilha de Monto de Trigo foi h...,"Nos últimos três séculos, a ilha foi permanent...",Por mais de 170 anos.,séculos,0.0,0.00,0.0
18,O porto do Recife lida com cruzeiros?,O porto lida com os cruzeiros nacionais e inte...,sim,cruzeiros nacionais e internacionais,0.0,0.00,0.0
23,Em qual expedição amerigo vespucci visitou a b...,O Italian Explorer Amerigo Vespucci foi o prim...,seu segundo,segunda expedição,0.0,0.00,0.0
...,...,...,...,...,...,...,...
385,É o tamanho da amazona azul comparável à flore...,A Amazon Blue (Português: A Amazônia Azul) ou ...,sim.,superfície,0.0,0.00,0.0
386,Qual é a ameaça natural ao ecossistema de Alca...,"Cerca de 10.000 pássaros vivem no arquipélago,...",O invasor de corais na tigela laranja,poluição do mar,0.0,0.18,0.0
390,Por que as guildas ibéricas foram originalment...,Os pescadores artesanais são organizados em gu...,Para a Marinha Brasileira,organizar as comunidades de pesca se espalhar ...,0.0,0.00,0.0
391,Quantos estados brasileiros não têm costa?,O ponto mais meridional do Brasil está localiz...,9 estados brasileiros.,O ponto mais meridional do Brasil,0.0,0.00,0.0
