# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [1]:
!pip install -qU datasets pinecone-client==3.1.0 sentence-transformers torch

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/211.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m204.8/211.0 kB[0m [31m8.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.0/211.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.7/345.7 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m90.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [2]:
from datasets import load_dataset

# load the dataset from huggingface in streaming mode and shuffle it
wiki_data = load_dataset(
    'vblagoje/wikipedia_snippets_streamed',
    split='train',
    streaming=True
).shuffle(seed=960)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


wikipedia_snippets_streamed.py:   0%|          | 0.00/4.58k [00:00<?, ?B/s]

The repository for vblagoje/wikipedia_snippets_streamed contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/vblagoje/wikipedia_snippets_streamed.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [3]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'wiki_id': 'Q7649565',
 'start_paragraph': 20,
 'start_character': 272,
 'end_paragraph': 24,
 'end_character': 380,
 'article_title': 'Sustainable Agriculture Research and Education',
 'section_title': "2000s & Evaluation of the program's effectiveness",
 'passage_text': "preserving the surrounding prairies. It ran until March 31, 2001.\nIn 2008, SARE celebrated its 20th anniversary. To that date, the program had funded 3,700 projects and was operating with an annual budget of approximately $19 million. Evaluation of the program's effectiveness As of 2008, 64% of farmers who had received SARE grants stated that they had been able to earn increased profits as a result of the funding they received and utilization of sustainable agriculture methods. Additionally, 79% of grantees said that they had experienced a significant improvement in soil quality though the environmentally friendly, sustainable methods that they were"}

In [5]:
# filter only documents with History as section_title - Replace None with your code
history = wiki_data.filter(lambda example: example["section_title"] == "History")


Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [32]:
from tqdm.auto import tqdm  # progress bar

total_doc_count = 50000

counter = 0
docs = []

# iterate through the dataset and apply our filter
for d in tqdm(history, total=total_doc_count):
    # extract the fields we need - article, section, and passage
    docs.append({
        "article_title": d["article_title"],
        "section_title": d["section_title"],
       "passage_text": d["passage_text"]
    })

    # increase the counter on every iteration
    counter += 1
# stop after 1000 items
    if counter >= 1000:
        break

  0%|          | 0/50000 [00:00<?, ?it/s]

In [33]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article_title,section_title,passage_text
0,Taupo District,History,was not until the 1950s that the region starte...
1,The Bishop Wand Church of England School,History,The Bishop Wand Church of England School Histo...
2,Surface Hill Uniting Church,History,in perpetual reminder that work and worship go...
3,The Electras (band),History,"as its B-side. However, copies of the single, ..."
4,Swanton House,History,it. Lane provided funds for restoration by the...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [35]:
import os
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=api_key)
index = pc.Index(index_name)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [36]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [39]:
index_name = "abstractive-question-answering"
 #give your index a meaningful name

In [40]:
import time
from pinecone import ServerlessSpec

# check if index already exists
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )


    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

print(f"Index '{index_name}' is ready.")


Index 'abstractive-question-answering' is ready.


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [41]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')
 #load the retriever model from HuggingFace. Use the flax-sentence-embeddings/all_datasets_v3_mpnet-base model
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [42]:
from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    i_end = min(i + batch_size, len(df))

    # extract batch
    batch = df.iloc[i:i_end]

    # generate embeddings for batch
    emb = retriever.encode(batch["passage_text"].tolist()).tolist()

    # prepare metadata
    meta = batch[["article_title", "section_title", "passage_text"]].to_dict(orient="records")

    # generate unique IDs
    ids = [str(x) for x in range(i, i_end)]

    # upsert to Pinecone
    to_upsert = list(zip(ids, emb, meta))
    index.upsert(vectors=to_upsert)

# check how many vectors in the index
index.describe_index_stats()


  0%|          | 0/16 [00:00<?, ?it/s]

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1000}},
 'total_vector_count': 1000}

# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [43]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [29]:
def query_pinecone(query, top_k=3):
    # generate embedding for the query
    xq = retriever.encode([query]).tolist()

    # search pinecone index for relevant context
    response = index.query(
        vector=xq[0],
        top_k=top_k,
        include_metadata=True
    )

    # return only matches (with metadata, scores, etc.)
    matches = response.get("matches", [])
    return matches


In [44]:
def format_query(query, context):
    # extract passage_text from Pinecone search results
    context = [f"<P> {m['metadata']['passage_text']}" for m in context]

    # join all context passages
    context = "\n".join(context)

    # combine query and context into final input
    query = f"question: {query}\ncontext: {context}"

    return query


Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [45]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

[{'id': '769',
  'metadata': {'article_title': 'Energy in the United States',
               'passage_text': 'Energy in the United States History From its '
                               'founding until the late 19th century, the '
                               'United States was a largely agrarian country '
                               'with abundant forests. During this period, '
                               'energy consumption overwhelmingly focused on '
                               'readily available firewood. Rapid '
                               'industrialization of the economy, urbanization, '
                               'and the growth of railroads led to increased '
                               'use of coal, and by 1885 it had eclipsed wood '
                               "as the nation's primary energy source.\n"
                               'Coal remained dominant for the next seven '
                               'decades, but by 1950, it was surpassed in

In [46]:
from pprint import pprint

In [49]:
# format the query in the form generator expects the input

formatted_query = format_query(query, result)
print(formatted_query)

question: when was the first electric power system built?
context: <P> Energy in the United States History From its founding until the late 19th century, the United States was a largely agrarian country with abundant forests. During this period, energy consumption overwhelmingly focused on readily available firewood. Rapid industrialization of the economy, urbanization, and the growth of railroads led to increased use of coal, and by 1885 it had eclipsed wood as the nation's primary energy source.
Coal remained dominant for the next seven decades, but by 1950, it was surpassed in turn by both petroleum and natural gas. The 1973 oil embargo precipitated an energy crisis in the United States. In


The output looks great. Now let's write a function to generate answers.

In [52]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)

    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)

    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

    # return the answer
    return answer


In [53]:
generate_answer(query)

'Electricity was first used in the 19th century. The first electric power system built was a steam engine.'

As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [57]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context)
generate_answer(query)


'The first wireless message was sent in the early 1900s. The first wireless message was sent in the early 1900s. The first wireless message was sent in the early 1900s. The first'

To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [59]:
for doc in context:
    print(doc["metadata"]["passage_text"], end='\n---\n')


communications arose in the mid-1990s, when messaging and real-time communications began to combine. In 1993, ThinkRite (VoiceRite) developed the unified messaging system, POET, for IBM's internal use. It was installed in 55 IBM US Branch Offices for 54,000 employees and integrated with IBM OfficeVision/VM (PROFS) and provided IBMers with one phone number for voicemail, fax, alphanumeric paging and follow-me.  POET was in use until 2000.  In the late 1990s, a New Zealand-based organization called IPFX developed a commercially available presence product, which let users see the location of colleagues, make decisions on how to contact them, and define
---
Instead, the handset lived on the network as another computer device.  The transport of audio was therefore no longer a variation in voltages or modulation of frequency such as with the handsets from before, but rather encoding the conversation using a CODEC (G.711 originally) and transporting it with a protocol such as the Real-time Tr

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [61]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context)
generate_answer(query)

"COVID-19 is a virus that infects your body. It's not a virus that causes cancer, it's a virus that causes your body to produce more cancer. It's a"

In [62]:
for doc in context:
    print(doc["metadata"]["passage_text"], end='\n---\n')


as each appeared.
In 1991, a group of Russian chemistry students discovered a simplified synthesis route which used phosgene instead of phenethylamine. Soon, abuse of the drug became widespread, causing a tenth of overdoses in the Moscow region. α-Methylfentanyl became notorious for low safety, and production declined.
---
Α-Methylfentanyl History α-Methylfentanyl was initially discovered by a team at Janssen Pharmaceutica in the 1960s. In 1976, it began to appear mixed with "china white" heroin as an additive. It was first identified in the bodies of two drug overdose victims in Orange County, California, in December 1979, who appeared to have died from opiate overdose but tested negative for any known drugs of this type. Over the next year, there were 13 more deaths, and eventually the responsible agent was identified as α-methylfentanyl.
α-Methylfentanyl was placed on the Schedule I list in September 1981, only two years after its appearance
---
hockey team. The tenth WickFest will 

Let’s finish with a final few questions.

In [63]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context)
generate_answer(query)

'The war of currents is a term used to refer to a series of events in the history of ocean currents. The most famous of these events is the Gulf Stream, which is the result of'

In [64]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context)
generate_answer(query)

'The first person to walk on the moon was Neil Armstrong, who walked on the moon in 1969. He was the first person to walk on the moon, and he was the first person to'

In [65]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context)
generate_answer(query)

'I\'m not sure what you mean by "most expensive project". The National Science Foundation has a budget of about $1.5 trillion, which is a lot of money, but it\'s'

As we can see, the model can generate some decent answers.

#### Add a few more questions

In [70]:
question = "What is the history of the Eiffel Tower?"
context = query_pinecone(question, top_k=3)
context = format_query(question, context)
generate_answer(context)


'The Eiffel Tower was built in 1889 by the architect Emilio Zocchi. It was the tallest building in the world at the time, and was the tallest building in the world'