# Question Answering And Semantic Search   
The fine-tuning part of this notebook is adapted from [HuggingFace example](https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb).    
One of the go-to libraries for transformer models is [Hugging Face](https://huggingface.co/docs). 
<div class="alert alert-block alert-info">
For this workshop, we have downloaded the models to use. If the model loading cells are ran locally it will download the dataset and models through the internet if there is no cache found in the cache_dir. 
</div>  

As we introduced in the lecture, the pre-trained models are pushing NLP field to a new time. In this notebook we will show you how to make a simple widget to get answers from the corpus by semantic search and fine-tuned models. If you can’t find a model for your use-case, you’ll need to finetune a pretrained model on your data. The **Challenge** section demonstrates the steps to fine-tune a model for Q&A task.

**Outline**  

- Use the Bert model to get answer
- Semantic search for articles that are relevant to the question in corpus
- The Q&A widget  
- Challenge: Fine tune Bert
    - Load pre-trained models from Hugging Face library
    - Fine tune the bert model for Q&A task using squad data

**Estimated time:** 
 45 mins (excluding challenge)

In [None]:
### Change notebook directory, for Gadi environment only
import os
working_path = os.path.expandvars("/scratch/vp91/$USER/Introduction-to-NLP/")
os.chdir(working_path)
data_path = '/scratch/vp91/NLP-2024/data/'
model_path = '/scratch/vp91/NLP-2024/model/'

# without setting to false, huggingface will throw a warning incase deadlock occurs. We use small dataset here so we do this here to hide the warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
# local paths
# working_path = './'
# data_path = '../data/'
# model_path = '../model/'

In [None]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import BertTokenizerFast, BertForQuestionAnswering
from transformers import TrainingArguments, Trainer, default_data_collator

import sentence_transformers
import IPython
from IPython.core.display import display, HTML
import logging
import pickle
import numpy as np
import pandas as pd
import time
import os

## Use Q&A Model 
HuggingFace provides a pipeline for easy interence using the models. Here is an example for using the pipeline with our specified model.

Now we directly use fine-tuned tokenizer and model. Below cell download and save the models. Here we can load from the download directory.

In [None]:
# # download fine-tuned model and tokenizer
# tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
# model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
# # save to directory
# tokenizer.save_pretrained("../model/fine-tuned/tokenizer/")
# model.save_pretrained("../model/fine-tuned/bert/")

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path + "fine-tuned/tokenizer/")
qa_model = AutoModelForQuestionAnswering.from_pretrained(model_path + "fine-tuned/bert/")

Below cell demonstrates how to use the model and tokenizer for Q&A task.

In [None]:
def getAnswer(contexts, questions, tokenizer, model):
    print('>>>> Looking for answers in {} documents...'.format(len(contexts)))
    t=time.time()
    answers = []
    for question in questions:
         for context in contexts:
            inputs = tokenizer(question, context, return_tensors="pt")
            # word to id representation
            input_ids = inputs["input_ids"].tolist()[0]
            #This outputs a range of scores across the entire sequence tokens (question and text), for both the start and end positions.
            outputs = qa_model(**inputs)
            answer_start_scores = outputs.start_logits
            answer_end_scores = outputs.end_logits

            # Get the most likely beginning of answer with the argmax of the score
            answer_start = torch.argmax(answer_start_scores)
            # Get the most likely end of answer with the argmax of the score
            answer_end = torch.argmax(answer_end_scores) +1
            # Get the answer string based on start and end token id
            answer = tokenizer.convert_tokens_to_string(
                tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
            )
            answers.append(answer)
    print('>>>> Answers extracted in : {}s'.format(time.time()-t))
    return answers

In [None]:
# Let's try with our simple example
questions = ['What is extractive question answering?']
context = [r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""]
getAnswer(context, questions, tokenizer, qa_model)

## Semantic Search in Corpus

Now our model can find the answer sentences in a context. But how can we query the entire corpus? We need to perform semantic search to find the documents that is most relevant to our question/query. To do this, we need to produce embeddings for our corpus as well as queries when we pass them in. Then we can calculate the similarity/distance between question and each document in our corpus to find the relevant ones.  
The `sentence_transformer` package provides the models we are using today, trained on 215M question-answer pairs and perform well across search tasks and domains.  
![semanticsearch](../img/semanticsearch.png)  
image from: https://www.sbert.net/examples/applications/semantic-search/README.html

In [None]:
from sentence_transformers import SentenceTransformer, util
model_name = 'multi-qa-MiniLM-L6-cos-v1'
bi_encoder = SentenceTransformer(model_name, cache_folder= model_path + 'bi_encoder')

query_embedding = bi_encoder.encode('How big is London')
passage_embedding = bi_encoder.encode(['London has 9,787,426 inhabitants at the 2011 census',
                                  'London is known for its finacial district'])

print("Similarity:", util.cos_sim(query_embedding, passage_embedding))

## Build Q&A Widget for Dataset

### Load Dataset 
Now we load the downloaded simple wiki dataset

In [None]:
import json
import gzip
wikipedia_filepath = data_path + 'simplewiki-2020-11-01.jsonl.gz'
passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append([data['title'], paragraph])


In [None]:
print(len(passages), passages[:5])

### Create Dataset Embeddings

Now we encode the dataset like we saw in the example above. For time sake we load the existing embedding from data folder. For this model and wiki dataset it took 2 hours to do embedding with 20 workers. This model is fine-tuned from a variation of Microsoft MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers.

In [None]:
# if model and embedding path is defined, load existing embedding and texts
if model_name == 'multi-qa-MiniLM-L6-cos-v1':
    embeddings_filepath = data_path + 'corpus_emb_MiniLM-L6.pkl'
    if os.path.isfile(embeddings_filepath):
        
        with open(embeddings_filepath, "rb") as fIn:
            cache_data = pickle.load(fIn)
            passages = cache_data['sentences']
            corpus_embeddings = cache_data['embeddings']
            corpus_embeddings = corpus_embeddings.float()  # Convert embedding file to float
# otherwise generate new embedding and store it to pickle        
else: 
    corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)
    with open(working_path + 'corpus_emb_MiniLM-L6.pkl', "wb") as f:
        pickle.dump({'sentences': passages, 'embeddings': corpus_embeddings}, f, protocol=4)

For a small corpus (up to 1 million documents), we can compute the cosine-similarity between query and documents in corpus by ` util.cos_sim() ` and retrieve top k documents by `torch.topk `. Fortunately this is done for us in `sentence_transformers.util.semantic_search(query_embeddings: torch.Tensor, corpus_embeddings: torch.Tensor, query_chunk_size: int = 100, corpus_chunk_size: int = 500000, top_k: int = 10, score_function: typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor] = <function cos_sim>)`.

<div class="alert alert-block alert-warning">
<b>Task 1. Try it out</b> <br>
Embed the quesiton and use the function util.semantic_search( ) and get top_k relevant documents. <br>

</div>

In [None]:
%%time
top_k = 5

query = 'How many people Aileen Carol killed?'
### TODO 
#  creat embedding for this query using bi_encoder
query_embedding = 
# call semantic search with the top_k variable 
hits = 

hits = hits[0]
print("Input question:", query)
for hit in hits:
    print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']]))
print("\n\n========\n")

### Helper Functions

Now let's pack above code into a function to use later.

<div class="alert alert-block alert-warning">
<b>Task 2.</b> <br>
Write the function searchContext( ): <br>
1. Encode the query<br>
2. Perform semantic search between question embedding and corpus embedding<br>
3. Return top_k hit <b>passages indexes</b>
</div>

In [None]:
def searchContext(top_k, bi_encoder, query, corpus_embedding):
    indexes = []
    t = time.time()
    ### TODO
    #  creat embedding for input query using bi_encoder
    question_embedding = 
    # return a list of indexes for the list of hits extracted
    hits = 
    return indexes

Below is a helper function to get the context from top hit ids.

In [None]:
# extract the top ranked passages to feed BERT
def getContext(indexes, passages):
    contexts = []
    for k in indexes:
        contexts.append(passages[k][1])
    return contexts

### Make the Widget 1.0

<div class="alert alert-block alert-warning">
<b>Task 3.</b> <br>
Complete the function QandA( ): <br>
1. Extract top matching document ids, get the contexts and answers<br>
2. display original documents that are matched<br>
</div>

In [None]:
def QandA(corpus_embeddings, passages):
    # Ask for question input
    promptQ = HTML('<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>Ask me a question</b>:')
    display(promptQ)
    question = input()
    questions = [question]

    # ask for top k value
    promptK = HTML('<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>How many results would you like?</b>')
    display(promptK)
    top_k = int(input())

    # display question
    question_HTML = '<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>Query</b>: '+question+'</div>'
    display(HTML(question_HTML))
    
    ### TODO 3.1 search the corpus for relevent passages and answers, using searchContext(),getContext() and getAnswer() functions
    top_k_ids = 
    contexts = 
    answers = 
    
    for a in answers:
        answers_HTML = '<div style="font-family: Times New Roman; font-size: 18px; margin-bottom:1pt"><b>Answer found</b>: '+a+'</div>'
        display(HTML(answers_HTML))
    # warning text
    warning_HTML = '<div style="font-family: Times New Roman; font-size: 15px; padding-bottom:15px; color:#E76f51; margin-top:1pt"> These are extracted answers from original documents. Please see the documents below:</div>'
    display(HTML(warning_HTML))
    
    ### TODO 3.2 get the list of original documents from search results
    doc = 
    df_hits = pd.DataFrame(doc, columns=['title','text'])
    df_hits.text.str.wrap(100)
    display(HTML(df_hits.to_html(render_links=True, escape=False)))

In [None]:
QandA(corpus_embeddings, passages)

## Challenge: Improve The Widget  
### Retrieve & Re-rank with Cross-Encoder

The are some different ways to improve this workflow. Here we introduce the **Retrieve & Re-rank Pipeline**, which provides better performance for long docment and complex searches. It retrieves the relevant passages first using bi-encoder (what we did above), then re-rank them by the classification score between each pair of question and passage.  
![bi-cross encode](../img/Bi_vs_Cross-Encoder.png )   
image from: https://www.sbert.net/examples/applications/cross-encoder/README.html  
Cross-encoders do not produce embeddings, so it is less efficient for comparision with millions of pairs data. However, in this pipeline, we limit our scope using bi-encoder first and then use cross-encoder to improve the accuracy of results.

In [None]:
from sentence_transformers import SentenceTransformer, CrossEncoder
# downloading from library
# cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# cross_encoder.save('../model/cross_encoder')
cross_encoder = CrossEncoder(model_path + 'cross_encoder')

In [None]:
# example
query = 'When is Aileen Wuornos born'

# The cross-encoder takes 2 inputs and perform classification tasks
cross_input = [[query, 'Ted Cassidy (July 31, 1932 - January 16, 1979) was an American actor. He was best known for his roles as Lurch and Thing on "The Addams Family".'],
                [query, 'Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956\xa0– October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute.'],
                [query, 'Wuornos was diagnosed with antisocial personality disorder and borderline personality disorder.'],
                [query, 'The movie, "Monster" is about her life. Two documentaries were made about her.'], 
                [query, 'Wuornos was born Aileen Carol Pittman in Rochester, Michigan. She never met her father. Wuornos was adopted by her grandparents. When she was 13 she became pregnant. She started working as a prostitute when she was 14.']]

cross_scores = cross_encoder.predict(cross_input)
cross_scores

<div class="alert alert-block alert-warning">
<b>Task 4: Complete function rerank( ) and QandArerank( )</b> <br>
    1.  The function takes one question and compares with a list of contexts. We use the list of indexs and passages data to get the context. The list of indexes will be passed by bi-encoder search. The return will be indexes list sorted by cross-encoder score.<br>
    2. use bi-encoder to search for top 20 relevant passages and use rerank( ) to get the top_k passage indexes. top_k is the user input number.
</div>  

In [None]:
def rerank(question, indexs, passages):
    print('>>>> Re-ranking results...')
    ### TODO 4.1
    cross_in = 
    cross_result = 
    rank_top = []
    for i, v in enumerate(indexs):
        rank_top.append([v, cross_result[i]])
    rank_top = sorted(rank_top, key=lambda x: x[1], reverse=True)
    rank_top = [e[0] for e in rank_top]
    return rank_top

In [None]:
def QandArerank(corpus_embeddings, passages):
    # Ask for question input
    promptQ = HTML('<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>Ask me a question</b>:')
    display(promptQ)
    question = input()
    questions = [question]

    # ask for top k value after cross-encoder
    promptK = HTML('<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>How many results would you like?</b>')
    display(promptK)
    top_k = int(input())

    # display question
    question_HTML = '<div style="font-family: Times New Roman; font-size: 20px; padding-bottom:28px; margin-top:1pt"><b>Query</b>: '+question+'</div>'
    display(HTML(question_HTML))
    
    ### TODO 4.2 
    bi_top_k = 
    cross_top_k = 
    contexts = getContext(cross_top_k, passages)
    answers = getAnswer(contexts, questions, tokenizer, qa_model)
    
    for a in answers:
        answers_HTML = '<div style="font-family: Times New Roman; font-size: 18px; margin-bottom:1pt"><b>Answer found</b>: '+a+'</div>'
        display(HTML(answers_HTML))
    # warning text
    warning_HTML = '<div style="font-family: Times New Roman; font-size: 15px; padding-bottom:15px; color:#E76f51; margin-top:1pt"> These are extracted answers from original documents. Please see the documents below:</div>'
    display(HTML(warning_HTML))
    
    doc = [passages[k] for k in cross_top_k]
    
    df_hits = pd.DataFrame(doc, columns=['title','text'])
    df_hits.text.str.wrap(100)
    display(HTML(df_hits.to_html(render_links=True, escape=False)))

In [None]:
QandArerank(corpus_embeddings, passages)

>Explore the dataset and ask different questions to see the performance  

>Pay attention to the **first** answer in both of your widgets, can you see the improvement?  

>Change to bigger dataset for longer documents, and try these widgets at home!

<div class="alert alert-block alert-info">
Apart from the retrieve-re-rank pipeline, there is also other methods like using Elasticsearch, FAISS indexing or nearest neightbours to improve the workflow. It depends on the size of dataset and the task you wish to perform.  

</div>  

------------------------  
By the end of this notebook, you have created and improved a small widget to find answers for you in the articles, i.e information retrieval. In the [next notebook](4-Topic_Modelling.ipynb), we will have a look at topic modelling, another useful application to explore a large amount of text data.    

--------------------------
