# **Introduction**


Business/organisation data today is scaling which requires better means of accessibility when a certain information is required. Say the data is in the form of natural languages in pdf etc, then we need semantic search.

**Search** is vital functionality in several applications across many companies. Most applications usually have a search function whether internal or external.

"Most organizations rely on a cluster of keyword-based search interfaces hosted on various ‘internal portals’ to deal with language data. If done well, this can satisfy business requirements for some of that data."


---

**Use of Keywords:**


For individuals/persons who knows what they are searching for, then keywords may suffice, though it is faced with some limitations such as; an unknown terminology or word being searched for. Imagine dealing with large data repos, this could be an herculean task and drain productivity.

---

**Solution:**

Semantic Search via question-answering. Using semantic search we can search on concepts, closely related phrases, rather than keywords from a data repo.

QA (Question-answering mimics natural language ) does this by searching using a natural language question and returning relevant documents and specific answers.

## **open-domain QA (ODQA)**

Can be spit into;



*   **Extractive QA:** This combines an information retrieval (IR) step and a reading comprehension (RC) step.

*   **Abstractive QA:** "In open-book abstractive QA, the first IR step is the same as extractive QA; relevant contexts are retrieved from an external source. These contexts are passed to the text generation model (such as GPT) and used to generate (not extract) an answer. OpenAI’s GPT models are well-known generative transformer models."





### **Extractive QA**

Using Extractive QA we can ask a question and then extract an answer from a short text.


It is not a single model but actually consists of three components:


*   Indexed data (document store/vector database)
*   Retriever model: 
*   Reader model

---

A traditional retriever uses **sparse vector retrieval** with TF-IDF or BM25. Elasticsearch is the most popular database solution.

The other option is to use **dense vector retrieval** with sentence vectors built by transformer models like BERT. Dense vectors have the advantage of enabling search via semantics. Searching with the meaning of a question.

That is; there are two fundamentally different categories of retrievers: sparse (e.g. TF-IDF, BM25) and dense (e.g. DPR, sentence-transformers).

**Example 1**

Using a QA dataset, the Stanford Question and Answering Dataset (SQuAD). We'll download this dataset using Hugging Face’s datasets library.

In [1]:
!pip install datasets



In [2]:
import datasets

qa=datasets.load_dataset('squad', split='validation')
qa

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})

In [3]:
qa[0]

{'answers': {'answer_start': [177, 177, 177],
  'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']},
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'id': '56be4db0acb8001400a502ec',
 'question': 'Which NFL team represented the AFC at Super Bo

In [4]:
qa[10]

{'answers': {'answer_start': [334, 334, 334],
  'text': ['February 7, 2016', 'February 7', 'February 7, 2016']},
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'id': '56bea9923aeaaa14008c91bb',
 'question': 'What day was the Super Bowl played on?',
 'tit

**Example 2**: Dense Vector Approach

First let's encode our contexts with a QA model like multi-qa-MiniLM-L6-cos-v1 from **sentence-transformers**.

In [5]:
#We initialize the model with:
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
model



SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

"Using **the model**, we encode the contexts inside our dataset object ***qa*** to create the sentence vector representations to be indexed in our vector database."

In [6]:
qa=qa.map(lambda x: {
    'encoding': model.encode(x['context']).tolist()
}, batched=True, batch_size=32
)
qa

  0%|          | 0/331 [00:00<?, ?ba/s]

Dataset({
    features: ['answers', 'context', 'encoding', 'id', 'question', 'title'],
    num_rows: 10570
})

We store the encoded context inside a **vector database**. We will use **Pinecone** in this example (which'll require a free API key).

Let's initialise a connection to Pinecone, create a new index, and connect to it.

In [8]:
!pip install pinecone-client
import pinecone

pinecone.init(api_key='input APi Key here', environment='us-west1-gcp') #pinecone.init(api_key=API_KEY, environment='us-west1-gcp')

# check if index already exists, if not we create it
if 'qa-index' not in pinecone.list_indexes():
    pinecone.create_index(name='qa-index', dimension=len(qa[0]['encoding']))

# connect to index
index = pinecone.Index('qa-index')



After that, we **upsert** (upload and insert) our vectors to the Pinecone index. We do this in batches where each sample is a tuple of (id, vector).

In [9]:
from tqdm.auto import tqdm  # progress bar

upserts = [(v['id'], v['encoding']) for v in qa]
# now upsert in chunks
for i in tqdm(range(0, len(upserts), 50)):
    i_end = i + 50
    if i_end > len(upserts): i_end = len(upserts)
    index.upsert(vectors=upserts[i:i_end])

  0%|          | 0/212 [00:00<?, ?it/s]

**QA Process**

After indexing the contexts, we do the QA Process


"Given a question/query, the retriever creates a sparse/dense vector representation called a query vector. This query vector is compared against all of the already indexed context vectors in the database. The n most similar are returned."

In [10]:
query = "Which NFL team represented the AFC at Super Bowl 50?"
xq = model.encode([query]).tolist()

In [11]:
xc = index.query(xq, top_k=5)
xc

{'results': [{'matches': [{'id': '56be4db0acb8001400a502ec',
                           'score': 0.685847402,
                           'values': []},
                          {'id': '56be4db0acb8001400a502f0',
                           'score': 0.685847402,
                           'values': []},
                          {'id': '56be4db0acb8001400a502ef',
                           'score': 0.685847402,
                           'values': []},
                          {'id': '56be4db0acb8001400a502ee',
                           'score': 0.685847402,
                           'values': []},
                          {'id': '56be4db0acb8001400a502ed',
                           'score': 0.685847402,
                           'values': []}],
              'namespace': ''}]}

In [12]:
ids = [x['id'] for x in xc['results'][0]['matches']]
contexts = qa.filter(lambda x: True if x ['id'] in ids else False)
contexts['context']

  0%|          | 0/11 [00:00<?, ?ba/s]

['Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Foo

For our reader model:

Let's use the deepest/electra-base-squad2 model from HuggingFace’s transformers as our reader model. 

We'll set up a 'question-answering' pipeline and pass our query and contexts to it one by one.

In [13]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
nlp = pipeline(tokenizer=model_name, model=model_name, task='question-answering')

In [14]:
print(query)
for context in contexts['context']:
  print(nlp(question=query, context=context))

Which NFL team represented the AFC at Super Bowl 50?
{'score': 0.9998526573181152, 'start': 177, 'end': 191, 'answer': 'Denver Broncos'}
{'score': 0.9998526573181152, 'start': 177, 'end': 191, 'answer': 'Denver Broncos'}
{'score': 0.9998526573181152, 'start': 177, 'end': 191, 'answer': 'Denver Broncos'}
{'score': 0.9998526573181152, 'start': 177, 'end': 191, 'answer': 'Denver Broncos'}
{'score': 0.9998526573181152, 'start': 177, 'end': 191, 'answer': 'Denver Broncos'}


### **Abstractive QA**

Abstractive QA can be split into two types: open-book and closed-book.


---
Open Book: 

"Rather than extracting answers, contexts are used as input (alongside the question) to a generative sequence-to-sequence (seq2seq) model. The model uses the question and context to generate an answer." 

The seq2seq model used is commonly BART or T5-based


Let's initialise a seq2seq pipeline using a BART model fine-tuned for abstractive QA — yjernite/bart_eli5

In [15]:
from transformers import pipeline

model_name = 'yjernite/bart_eli5'
seq2seq = pipeline('text2text-generation', model=model_name, tokenizer=model_name)

The initial question asked was specific. We’re looking for a short and concise answer of Denver Broncos. Abstractive QA is not ideal for these types of questions:

In [16]:
for context in contexts['context']:
  answer = seq2seq(
      f"question: {query} context: {context}",
      num_beams=4,
      do_sample=True,
      temperature=1.5,
      max_length=64
  )
  print(answer)




  next_indices = next_tokens // vocab_size


[{'generated_text': ' Which NFL team? The AFC. Which NFL team? The AFC. Which NFL team? The AFC. Which NFL team? The AFC. The AFC. Which NFL team? The AFC. Which NFL team? The AFC. Which AFC team? The AFC. Which AFC team? The AFC. Which AFC team'}]
[{'generated_text': ' The AFC won the AFC championship. Since the AFC was the division champs and the AFC was a wildcard, one of the AFC teams played the AFC team in the AFC championship game. That team was the Patriots.'}]
[{'generated_text': ' Which NFL team represented the AFC in Super Bowl 50? Which NFL team represented the AFC at Super Bowl 50? Which team represented the AFC at Super Bowl 50? Which NFL team represented the AFC in Super Bowl 50? Which NFL team represented the AFC at Super Bowl 50?'}]
[{'generated_text': ' The team that won the AFC Championship played the team from the AFC Championship game. That was the Denver Broncos, not the Carolina Panthers, so it was not the Carolina Panthers that played the Panthers at the Super Bo

The advantage of abstractive QA comes with more ‘abstract’ questions like "Do NFL teams only care about playing at the Super Bowl?" More like we’re  asking for an opinion. 

Expect an unlikely exact answer. 

I'll **experiment below:**


**Retriever**
Here: We use a **DensePassageRetriever**

**Alternatives:**

The **ElasticsearchRetriever** with custom queries (e.g. boosting) and filters

Use **EmbeddingRetriever** to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)

Use **TfidfRetriever** in combination with a SQL or InMemory Document store for simple prototyping and debugging

In [17]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git

# If you run this notebook on Google Colab, you might need to
# restart the runtime after installing haystack.

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-gt4bz5yu
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-gt4bz5yu


In [20]:
from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")

In [21]:
from haystack.nodes import DensePassageRetriever
retriever = DensePassageRetriever(document_store=document_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=True,
                                  use_fast_tokenizers=True)
# Important: 
# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation. 
# While this can be a time consuming operation (depending on corpus size), it only needs to be done once. 
# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
#document_store.update_embeddings(retriever)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/493 [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-question_encoder-single-nq-base locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded facebook/dpr-question_encoder-single-nq-base


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/492 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find facebook/dpr-ctx_encoder-single-nq-base locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded facebook/dpr-ctx_encoder-single-nq-base


In [23]:
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer, \
                         DPRQuestionEncoder, DPRQuestionEncoderTokenizer

In [None]:
ctx_model = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

question_model = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')

In [33]:
query = "Do NFL teams only care about playing at the Superbowl?"
xq = retriever.encode([query]).tolist()

TypeError: ignored

In [None]:
xc = index.query(xq, top_k=5)
ids = [x['id'] for x in xc['results'][0]['matches']]
context =qa.filter(lamba
                   x: True if x['id'] in ids else False)

In [None]:
for context in contexts['context']:
  answer = seq2seq(
      f"question: [query} conetext: {context}",
      num_beams=4,
      do_sample=True,
      temperature=1.5,
      max_length=64
  )
  print(answer)

Much better answers than the ‘specific’ question. 

What to note is that, The returned contexts don’t include direct information about whether the teams care about being in the Super Bowl. Instead, they contain snippets of concrete NFL/Super Bowl details.

**Closed Book**

There is no retrieval step, nothing more than a generator model.


---



Let's drop the retriever model, that doesn’t mean we stick with the same reader model. 

As we saw before, the yjernite/bart_eli5 model requires input like:

`question: <our question> context: <a (hopefully) relevant context>`

*Without* the context input, the previous model does not perform as well. 

The **seq2seq model** is optimised to produce coherent answers when given both **question and context.** If our input is in a new, unexpected format, performance suffers:

In [27]:
query = "Do NFL teams only care about playing at the Super Bowl?"

seq2seq(
    f"question: {query} context: unknown",
    num_beams=4,
    do_sample=True,
    temperature=1.5,
    max_length=64
)

  next_indices = next_tokens // vocab_size


[{'generated_text': ' The best ELI5 I\'ve seen on the matter is: "You\'re playing at the Super Bowl and you\'re playing really well. But you\'re not going to win it, you\'re just going to tie it, which isn\'t good for you. But you\'re not going to win it either,'}]

"The model doesn’t know the answer and flips the direction of questioning. This isn’t what we want. However, there are many alternative models we can try. The GPT models from OpenAI are well-known examples of generative transformers and can produce good results.

GPT-3, the most recent GPT from OpenAI, would require an API, but there are open-source alternatives like GPT-Neo from Eleuther AI. 

We'll use **GPT-Neo models.**

In [28]:
gen = pipeline('text-generation', model='EleutherAI/gpt-neo-125M', tokenizer='EleutherAI/gpt-neo-125M')

gen(query, max_length=32)

Downloading:   0%|          | 0.00/0.98k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/560 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Do NFL teams only care about playing at the Super Bowl?\n\nThe NFL is a great place to play, but it’s not the only place'}]

we’re using the 'text-generation' pipeline. "All we do here is to generate text following a question. We do get an interesting answer which is true but doesn’t necessarily answer the question. We can try a few more questions."

In [29]:
gen("Where do cats come from?", max_length=32)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Where do cats come from?\n\nCats are a group of animals that are found in the wild. They are the most common species in the wild,'}]

In [30]:
gen("Who was the first person on the moon?", max_length=32)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Who was the first person on the moon?\n\nThe moon is the most important part of the Earth's atmosphere. It is the most important part of the"}]

In [31]:
gen("What is the moon made of?", max_length=32)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'What is the moon made of?\n\nThe moon is a kind of material that is made of a material that is made of a material that is made of'}]

Let's tweak parameters to reduce the likelihood of repetition.

In [32]:
gen("What is the moon made of?", max_length=32, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'What is the moon made of? If it was to look like, it might be. It’s called the moon. It’s the only'}]

Taking a look at the results, it's obvious that closed-book abstractive QA is more challenging of a task. "Larger models store more internal knowledge; thus, closed-book performance is very much tied to model size. With bigger models, we can get better results, but for consistent answers, the open-book alternatives tend to outperform the closed-book approach."

**References**

Question Answering with Pinecone
https://www.pinecone.io/learn/question-answering/


---
deepset-ai-haystack-python-natural-language-processing
https://pythonrepo.com/repo/deepset-ai-haystack-python-natural-language-processing 


---
Better Retrieval via DPR notebook: [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb#scrollTo=wgjedxx_A6N6)

