# Model Metrics

We've put together our first Q&A model. In this notebook we're going to merge both of these and measure our Q&A model performance on the SQuAD 2.0 validation set as a whole.

First, we load our SQuAD validation data.

In [11]:
import json

with open('./data/squad/dev.json', 'r') as f:
    squad = json.load(f)

# we will limit it to the first 100 samples in the interest of time
squad_min = squad[:1000]

In [12]:
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host='localhost', username='', password='', index='squad_mini')



In [13]:
# create list of contexts (we cannot do this using current dictionary format)
contexts = [sample['context'] for sample in squad_min]

# convert to set to remove duplicates, then back to list
contexts = list(set(contexts))

# convert back to dictionary format we need
squad_docs_min = [{'content': sample} for sample in contexts]
squad_min =[]

In [29]:
len(squad_docs_min)

106

In [14]:
document_store.write_documents(squad_docs_min)

In [15]:
import requests
res = requests.get('http://localhost:9200/squad_mini/_count')

res.json()

{'count': 106,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}

Next, let's setup the QA pipeline again using the `deepset/bert-base-cased-squad2` model.

In [17]:
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='squad_mini'
)



In [18]:
from haystack.nodes import BM25Retriever
from haystack.nodes import FARMReader

retriever = BM25Retriever(doc_store)  # BM25
reader = FARMReader(model_name_or_path='deepset/bert-base-cased-squad2',
                    context_window_size=1500,
                    use_gpu=True)

In [19]:
from haystack.pipelines import ExtractiveQAPipeline

qa = ExtractiveQAPipeline(reader=reader, retriever=retriever)

In [26]:
ans = qa.run(query='Who did Emma Marry?')
ans['answers'][0].answer

Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.64s/ Batches]


'King Ethelred II of England'

And now we build a list of predicted answers `model_out` and true answers `reference` and calculate the ROUGE score based on these.

In [30]:
from tqdm import tqdm

model_out = []
reference = []

for pair in tqdm(squad_min, leave=True):
    ans = qa.run(
        query= pair['question']
       
    )
    # append the prediction and reference to the respective lists
    model_out.append(ans['answers'][0].answer)
    reference.append(pair['answer'])

Inferencing Samples: 100%|██████████| 1/1 [00:03<00:00,  3.97s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.61s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.55s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.58s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.39s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.68s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.55s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.62s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.73s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.72s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.43s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.32s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.84s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00

In [33]:
import pickle

with open("model_out", "wb") as fp:
    pickle.dump(model_out, fp)
with open("reference", "wb") as fp:
    pickle.dump(reference, fp)


In [2]:
import pickle
with open("model_out", "rb") as fp:   # Unpickling
   model_out = pickle.load(fp)
with open("reference", "rb") as fp:   # Unpickling
   reference = pickle.load(fp)


This make take some time to process. The processing speed of our models will improve as we begin using more efficient implementations over the next few sections.

Once that has finished processing, we can calculate our ROUGE scores just like we did before.

In [2]:
from rouge import Rouge

# initialize
rouge = Rouge()

# get scores
rouge.get_scores(model_out, reference, avg=True)

ModuleNotFoundError: No module named 'rouge'

That doesn't seem to be scoring as high as we would expect, if we print some of the results we can see why:

In [16]:
# recalculate individual scores
scores = rouge.get_scores(model_out, reference)

print(model_out[4], ' | ', reference[4], ' | ', scores[4]['rouge-1']['f'])
print(model_out[22], ' | ', reference[22], ' | ', scores[22]['rouge-1']['f'])

Rollo,  |  Rollo  |  0.0
"Norseman, Viking".  |  Norseman, Viking  |  0.0


Clearly the punctuation differences are causing our ROUGE score to view these words as not matching. To fix this, we'll import `re` and remove any characters that are not spaces, letters, or numbers.

In [17]:
import re

clean = re.compile('(?i)[^0-9a-z ]')

# apply this to both lists
model_out = [clean.sub('', text) for text in model_out]
reference = [clean.sub('', text) for text in reference]

In [None]:
# recalculate individual scores
scores = rouge.get_scores(model_out, reference)

print(model_out[4], ' | ', reference[4], ' | ', scores[4]['rouge-1']['f'])
print(model_out[22], ' | ', reference[22], ' | ', scores[22]['rouge-1']['f'])

These scores are looking better now, let's calculate the average again:

In [None]:
rouge.get_scores(model_out, reference, avg=True)

Now we are seeing much more realistic scores