# Question and Answering 1

Today, we will follow [Tunstall, von Werra, and Wolf](https://github.com/nlp-with-transformers)'s Question and Answering chapter today. Since this will involve two new libraries, we will mostly use their code. We will not do the whole chapter today, however, leaving the more advanced subjects of evaluating reciever and reader seperately and fine-tuning a QA pipeline for tomorrow.

## Dataset
We start importing the QA dataset, based on Amazon's reviews of electronics products.

In [1]:
from datasets import load_dataset

subjqa = load_dataset("subjqa", name="electronics", trust_remote_code = True)

  from .autonotebook import tqdm as notebook_tqdm


Let us play a around a little to get a sense of it.

In [6]:
subjqa['train'][0]

{'domain': 'electronics',
 'nn_mod': 'great',
 'nn_asp': 'bass response',
 'query_mod': 'excellent',
 'query_asp': 'bass',
 'q_reviews_id': '0514ee34b672623dff659334a25b599b',
 'question_subj_level': 5,
 'ques_subj_score': 0.5,
 'is_ques_subjective': False,
 'review_id': '882b1e2745a4779c8f17b3d4406b91c7',
 'id': '2543d296da9766d8d17d040ecc781699',
 'title': 'B00001P4ZH',
 'context': 'I have had Koss headphones in the past, Pro 4AA and QZ-99.  The Koss Portapro is portable AND has great bass response.  The work great with my Android phone and can be "rolled up" to be carried in my motorcycle jacket or computer bag without getting crunched.  They are very light and do not feel heavy or bear down on your ears even after listening to music with them on all day.  The sound is night and day better than any ear-bud could be and are almost as good as the Pro 4AA.  They are "open air" headphones so you cannot match the bass to the sealed types, but it comes close. For $32, you cannot go wrong.

This one has no answers, so let us look at the next one.

In [7]:
subjqa['train'][1]

{'domain': 'electronics',
 'nn_mod': 'harsh',
 'nn_asp': 'high',
 'query_mod': 'not strong',
 'query_asp': 'bass',
 'q_reviews_id': '7c46670208f7bf5497480fbdbb44561a',
 'question_subj_level': 1,
 'ques_subj_score': 0.5,
 'is_ques_subjective': False,
 'review_id': 'ce76793f036494eabe07b33a9a67288a',
 'id': 'd476830bf9282e2b9033e2bb44bbb995',
 'title': 'B00001P4ZH',
 'context': 'To anyone who hasn\'t tried all the various types of headphones, it is important to remember exactly what these are: cheap portable on-ear headphones. They give a totally different sound then in-ears or closed design phones, but for what they are I would say they\'re good. I currently own six pairs of phones, from stock apple earbuds to Sennheiser HD 518s. Gave my Portapros a run on both my computer\'s sound card and mp3 player, using 256 kbps mp3s or better. The clarity is good and they\'re very lightweight. The folding design is simple but effective. The look is certainly retro and unique, although I didn\'t fi

Better, it has two answers. See how `title` both of them is the same? This is because the product is the same. Let us look at the answers. We can give a look at the answers.

In [8]:
subjqa['train'][1]['answers']

{'text': ['Bass is weak as expected',
  'Bass is weak as expected, even with EQ adjusted up'],
 'answer_start': [1302, 1302],
 'answer_subj_level': [1, 1],
 'ans_subj_score': [0.5083333253860474, 0.5083333253860474],
 'is_ans_subjective': [True, True]}

The review itself is *not* the answers, but it is the context. The answers are, in fact, gathered by annotators and correspond to parts of the context. In particular, we see that these answers start at the character of index 1302. We can check that by:

In [13]:
len1, len2 = tuple(map(len, subjqa['train'][1]['answers']['text']))
print('First answer:', subjqa['train'][1]['context'][1302 : 1302 + len1])
print('Second answer:', subjqa['train'][1]['context'][1302 : 1302 + len2])

First answer: Bass is weak as expected
Second answer: Bass is weak as expected, even with EQ adjusted up


Perfect! Let us move to the actual fun part.

## Hugging Face

Before we move to the actual haystack solution, let us use Hugging Face.

In [14]:
from transformers import pipeline

# We will take a model pretrained on SquAD 2, a famous QA dataset
model_ckpt = "deepset/minilm-uncased-squad2"
pipe = pipeline("question-answering", model = model_ckpt)


Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [25]:
question = subjqa['train'][1]['question']
context = subjqa['train'][1]['context']
k = 3

results = pipe(question = question, context = context, topk = k)
for i in range(k):
    print(results[i]['answer'])

Bass is weak as expected
Bass is weak
Bass is weak as expected




In [35]:
pipe(question = question, context = context, topk = k)



[{'score': 0.25434815883636475,
  'start': 1302,
  'end': 1326,
  'answer': 'Bass is weak as expected'},
 {'score': 0.23717091977596283,
  'start': 1302,
  'end': 1314,
  'answer': 'Bass is weak'},
 {'score': 0.20091791450977325,
  'start': 1302,
  'end': 1326,
  'answer': 'Bass is weak as expected'}]

Looks good, but let us evaluate the whole zero-shot model. We will use squad metric for this.

In [53]:
import evaluate

squad_metric = evaluate.load("squad")

In [51]:
test_data = subjqa['test']
preds = []
refs = []
for rev in test_data:
    pred = pipe(question = rev['question'], context = rev['context'])
    # Cannot evaluate with an empty string, so we give it a dummy
    if len(pred['answer']) == 0:
        cpred = ''
    else:
        cpred = pred['answer']
    
    preds.append({
        'id': rev['id'],
        'prediction_text': cpred
    })
    
    if len(rev['answers']['text']) == 0:
        ctext = ['']
        cstart = [0]
    else:
        ctext = rev['answers']['text']
        cstart = rev['answers']['answer_start']

    refs.append({
        'id': rev['id'],
        'answers': {'text': ctext, 
                    'answer_start': cstart}
    })

squad_metric.compute(predictions = preds, references = refs)

{'exact_match': 2.793296089385475, 'f1': 12.891644169881934}

These are percentages (so things are not good). We will need some fine-tuning.

## Document store

Haystack, the library that we will use for QA requires a document store. We will start installing this document store (which, in our case, following the book and my own personal experience, we use Elasticsearch).

In [2]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")
print(es.info())


{'name': 'f197774c6f58', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'tb75MDvrTUauVE-GKzCJyw', 'version': {'number': '7.9.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'd34da0ea4a966c4e49417f2da2f244e3e97b4e6e', 'build_date': '2020-09-23T00:45:33.626720Z', 'build_snapshot': False, 'lucene_version': '8.6.2', 'minimum_wire_compatibility_version': '6.8.0', 'minimum_index_compatibility_version': '6.0.0-beta1'}, 'tagline': 'You Know, for Search'}


In [3]:
from haystack.document_stores import ElasticsearchDocumentStore

# Connect document store
document_store = ElasticsearchDocumentStore(return_embedding = True)

Now we will save the documents in the Elasticsearch server, basically following the group syntax.

In [4]:
# Transform into a pandas dataset (it can be done without it,
# but means changing a little the code) for drop duplicates
import pandas as pd
dfs = {split: dset.to_pandas() for split, dset in subjqa.flatten().items()}
# Haystack now asks for a Document class for creating documents
from haystack import Document

# It's a good idea to flush Elasticsearch with each notebook restart
if len(document_store.get_all_documents()) or len(document_store.get_all_labels()) > 0:
    document_store.delete_documents("document")
    document_store.delete_documents("label")
     

for split, df in dfs.items():
    # Exclude duplicate reviews
    docs = [Document(content =  row["context"], 
             meta = {"item_id": row["title"], "question_id": row["id"], 
                     "split": split})
        for _,row in df.drop_duplicates(subset="context").iterrows()]
    document_store.write_documents(
        docs, 
        index = "document" # The name of the table (think as SQL)
        )
print(f"Loaded {document_store.get_document_count()} documents")

Loaded 1615 documents


Let us give a look at one of the documents.

In [5]:
docs = document_store.get_all_documents(index = "document", batch_size = 1)
print("Content:", docs[0].content)
print("Metadata:", docs[0].meta)

Content: I have had Koss headphones in the past, Pro 4AA and QZ-99.  The Koss Portapro is portable AND has great bass response.  The work great with my Android phone and can be "rolled up" to be carried in my motorcycle jacket or computer bag without getting crunched.  They are very light and do not feel heavy or bear down on your ears even after listening to music with them on all day.  The sound is night and day better than any ear-bud could be and are almost as good as the Pro 4AA.  They are "open air" headphones so you cannot match the bass to the sealed types, but it comes close. For $32, you cannot go wrong.
Metadata: {'item_id': 'B00001P4ZH', 'question_id': '2543d296da9766d8d17d040ecc781699', 'split': 'train'}


Notice that, in general, haystack administrates an embedding of the documents. For example, we can see its dimension by

In [6]:
document_store.embedding_dim

768

But...

In [7]:
print("Embedding:", docs[0].embedding)  

Embedding: None


So, what is going on? Nothing! We have not yet specified an embedding for the documents, 768 here is just a placeholder (as most people use BERT-like models for embedding anyways).

## Retriever

Let us retrieve some elements of our dataset. For such, we will use haysack using BM25 keyword search (very sparse, which will make it quick for playing around).

In [22]:
# We first need to initialize the retriever
from haystack.nodes import BM25Retriever
es_retriever = BM25Retriever(document_store = document_store)

We can choose one item (which will be usef to filter the datset) and then retrieve some answers fo a question.

In [48]:
item_id = "B0074BW614"
query = "Is it good for reading?"
retrieved_docs = es_retriever.retrieve(
    query = query, top_k = 3, filters={"item_id":[item_id], "split":["train"]})

print('First retrived:', retrieved_docs[0].content.replace('. ', '\n'))
print('-----------\nSecond retrived:', retrieved_docs[1].content.replace('. ', '\n'))
print('-----------\nThird retrived:', retrieved_docs[2].content.replace('. ', '\n'))

First retrived: This is a gift to myself
 I have been a kindle user for 4 years and this is my third one
 I never thought I would want a fire for I mainly use it for book reading
 I decided to try the fire for when I travel I take my laptop, my phone and my iPod classic
 I love my iPod but watching movies on the plane with it can be challenging because it is so small
Laptops battery life is not as good as the Kindle
 So the Fire combines for me what I needed all three to do
So far so good.
-----------
Second retrived: Plays Netflix great, WiFi capability has great range
Resolution on the screen is AMAZING! For the price you cannot go wrong
Bought one for my spouse and myself after becoming addicted to hers! Our son LOVES it and it is great for reading books when no light is available
Amazing sound but I suggest good headphones to really hear it all.Battery life is super long and can go 3 or 4 days without a recharge from moderate use.A steal at $199.99.
-----------
Third retrived: I've

Super cool!

## Reader

Let us move to the reader, whose goal is exactly to read the retrieved documents and find the answer to the question.

In [17]:
from haystack.nodes import TransformersReader

model_ckpt = "deepset/minilm-uncased-squad2"
max_seq_length = 384
doc_stride = 128

reader = TransformersReader(
    model_name_or_path = model_ckpt,  # only this is needed
    max_seq_len = max_seq_length,
    doc_stride = doc_stride,
    use_gpu = True  # set to False if no GPU
)

Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [34]:
k = 3

print('Question:', query, '\n\nAnswers:')

for i in range(k):  

    print(reader.predict(query = query, documents = retrieved_docs, top_k =  k)['answers'][i].answer)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Question: Is it good for reading? 

Answers:
it is great for reading books when no light is available
I mainly use it for book reading
the larger screen compared to the Kindle makes for easier reading


Not too bad. But let us make a whole QA pipeline and evluate on it.

## Retriever + Reader

Making a pipeline now is very easy.

In [37]:
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader = reader, retriever = es_retriever)

All good, let us test a little.

In [52]:
preds = pipe.run(query = query, 
                 params = {
                     "Retriever": {"top_k": k,
                                    "filters": {"item_id": [item_id], "split":["train"]}}, 
                                    "Reader": {"top_k": k}
                                    }
                                    )

print(f"Question: {preds['query']} \n")
for i in range(k):
    print(f"Answer {i+1}: {preds['answers'][i].answer}")

Question: Is it good for reading? 

Answer 1: it is great for reading books when no light is available
Answer 2: I mainly use it for book reading
Answer 3: the larger screen compared to the Kindle makes for easier reading


Not too bad, but it ass for a more quantitive analysis. Let us again use the squad metric!

In [120]:
test_data = subjqa['test']
preds = []
refs = []
for rev in test_data:
    pred = pipe.run(query = rev['question'], 
                 params = {
                     "Retriever": {"top_k": k,
                                    "filters": {"item_id": [rev['title']], "split":["test"]}}, 
                                    "Reader": {"top_k": 1}
                                    }
                                    )
    pred = pred['answers']
    # Cannot evaluate with an empty string, so we give it a dummy
    if pred == []:
        pred = ''
    else:
        pred =  pred[0].answer
    
    
    preds.append({
        'id': rev['id'],
        'prediction_text': pred
    })
    
    if len(rev['answers']['text']) == 0:
        ctext = ['']
        cstart = [0]
    else:
        ctext = rev['answers']['text']
        cstart = rev['answers']['answer_start']

    refs.append({
        'id': rev['id'],
        'answers': {'text': ctext, 
                    'answer_start': cstart}
    })

squad_metric.compute(predictions = preds, references = refs)



{'exact_match': 3.910614525139665, 'f1': 9.633795746112037}

Worse! But notice that here, we are in the much harder retrieval context. In fact, increasing k should lead to improvements in performance. 

In [125]:
k = 10
preds = []
refs = []
for rev in test_data:
    pred = pipe.run(query = rev['question'], 
                 params = {
                     "Retriever": {"top_k": k,
                                    "filters": {"item_id": [rev['title']], "split":["test"]}}, 
                                    "Reader": {"top_k": 1}
                                    }
                                    )
    pred = pred['answers']
    # Cannot evaluate with an empty string, so we give it a dummy
    if pred == []:
        pred = ''
    else:
        pred =  pred[0].answer
    
    
    preds.append({
        'id': rev['id'],
        'prediction_text': pred
    })
    
    if len(rev['answers']['text']) == 0:
        ctext = ['']
        cstart = [0]
    else:
        ctext = rev['answers']['text']
        cstart = rev['answers']['answer_start']

    refs.append({
        'id': rev['id'],
        'answers': {'text': ctext, 
                    'answer_start': cstart}
    })

squad_metric.compute(predictions = preds, references = refs)



{'exact_match': 3.910614525139665, 'f1': 9.725831352702185}

Still, only a marginal improvement (EM was actually the same). This probably indicates that the problem is with our reader, but we will try to tackle this problem tomorrow. 