# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [1]:
pip install -qU datasets pinecone-client==3.1.0 sentence-transformers torch

Note: you may need to restart the kernel to use updated packages.


# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [2]:
from datasets import load_dataset

streamed_data = load_dataset("cnn_dailymail", "3.0.0", split="train", streaming=True).shuffle(seed=42)

history = streamed_data.filter(lambda example: "history" in example["article"].lower())

sample = next(iter(history))
print(sample["article"])



  from .autonotebook import tqdm as notebook_tqdm


Whether grainy movie reels featuring marching armies of khaki-clad soldiers heading towards the front line or harrowing images that reveal the aftermath in bloody detail, World War One was among the first to be documented in photos and on film. Now a new exhibition is to combine rarely seen photos of men and women fighting in the Great War with a series of harrowing artworks that shed light on the human tragedy that ensued. The Great War in Portraits, which debuts at the National Portrait Gallery next month, also tells the stories of some of the most fascinating participants, among them a Russian female soldier, a British nurse executed by the Germans and the first Nepalese recipient of the Victoria Cross. The Gassed and Wounded: Eric Kennington's 1918 work was based on sketches made on the front line . Harrowing: Two of the portraits from Henry Tonks' series, Soldiers With Facial Wounds . Others, among them the striking Henry Tonks' series, Soldiers With Facial Wounds, document the ex

We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [3]:
# show the contents of a single document in the dataset
next(iter(history))

{'article': "Whether grainy movie reels featuring marching armies of khaki-clad soldiers heading towards the front line or harrowing images that reveal the aftermath in bloody detail, World War One was among the first to be documented in photos and on film. Now a new exhibition is to combine rarely seen photos of men and women fighting in the Great War with a series of harrowing artworks that shed light on the human tragedy that ensued. The Great War in Portraits, which debuts at the National Portrait Gallery next month, also tells the stories of some of the most fascinating participants, among them a Russian female soldier, a British nurse executed by the Germans and the first Nepalese recipient of the Victoria Cross. The Gassed and Wounded: Eric Kennington's 1918 work was based on sketches made on the front line . Harrowing: Two of the portraits from Henry Tonks' series, Soldiers With Facial Wounds . Others, among them the striking Henry Tonks' series, Soldiers With Facial Wounds, do

In [4]:
# Filter articles that mention "history" in the article text
history = streamed_data.filter(lambda example: "history" in example["article"].lower())



Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [5]:
from tqdm.auto import tqdm  # progress bar

total_doc_count = 50000

counter = 0
docs = []

# Iterate through the filtered articles
for d in tqdm(history, total=total_doc_count):
    # cnn_dailymail doesn't have article_title or section_title
    article_title = "CNN/DailyMail Article"
    section_title = "History-related"
    passage_text = d["article"]  # this is the actual passage

    docs.append({
        "article_title": article_title,
        "section_title": section_title,
        "passage_text": passage_text
    })

    counter += 1
    if counter >= total_doc_count:
        print("Reached 50,000 documents.")
        break


 59%|█████▉    | 29446/50000 [00:26<00:18, 1128.05it/s]


In [6]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article_title,section_title,passage_text
0,CNN/DailyMail Article,History-related,Whether grainy movie reels featuring marching ...
1,CNN/DailyMail Article,History-related,THE FINAL SEASON by Nigel McCrery (Random Hous...
2,CNN/DailyMail Article,History-related,Phone companies have revealed the cost of usin...
3,CNN/DailyMail Article,History-related,A 14-year-old Florida boy had been bullied for...
4,CNN/DailyMail Article,History-related,Advertising magnate Lord Maurice Saatchi has s...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [7]:
pip install pinecone-client

Note: you may need to restart the kernel to use updated packages.


In [8]:
pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [9]:
import os
from dotenv import load_dotenv

# Load keys from .env file
load_dotenv()

# Access the API key
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

# Debug if needed
if not PINECONE_API_KEY:
    raise ValueError("❌ PINECONE_API_KEY not found. Check your .env file.")


In [10]:
from pinecone import Pinecone

pc = Pinecone(api_key=PINECONE_API_KEY)


In [11]:
#import os
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=PINECONE_API_KEY)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [12]:
from pinecone import ServerlessSpec
import os
cloud = 'aws'
region = 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [13]:
index_name = 'abstract' #give your index a meaningful name

In [14]:
from pinecone import ServerlessSpec
import time

# Define cloud + region — these must match your Pinecone project settings
cloud = 'aws'
region = 'us-east-1'

# Define spec
spec = ServerlessSpec(cloud=cloud, region=region)

# Use a valid index name
index_name = "abstract"

# Create index if it doesn't already exist
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=spec
    )
    print(f"✅ Index '{index_name}' created.")
else:
    print(f"ℹ️ Index '{index_name}' already exists.")

# Get index object for use
index = pc.Index(index_name)


ℹ️ Index 'abstract' already exists.


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [15]:
import torch
from sentence_transformers import SentenceTransformer

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model_name = 'flax-sentence-embeddings/all_datasets_v3_mpnet-base'

retriever = SentenceTransformer(model_name)
retriever = retriever.to(device)

print(f"Retriever loaded on: {device}")


Retriever loaded on: cpu


# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [None]:
from tqdm.auto import tqdm

batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):

    i_end = min(i + batch_size, len(df))

    batch = df.iloc[i:i_end]

    emb = retriever.encode(batch["passage_text"].tolist()).tolist()

    meta = batch[["article_title", "section_title", "passage_text"]].to_dict(orient="records")
    ids = [f"id-{i+j}" for j in range(len(batch))] 

    to_upsert = list(zip(ids, emb, meta))

    index.upsert(vectors=to_upsert)

index.describe_index_stats() 


100%|██████████| 461/461 [1:14:14<00:00,  9.66s/it]


{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 29440}},
 'total_vector_count': 29440}

# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [17]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [18]:
def query_pinecone(query, top_k):
    # generate embeddings for the query
    xq = retriever.encode(query).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(vector=xq, top_k=top_k, include_metadata=True)
    return xc

In [19]:
def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage_text']}" for m in context]
    # concatinate all context passages
    context = " ".join(context)
    # contcatinate the query and context passages
    query = f"question: {query} context: {context}"
    return query

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [20]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

{'matches': [{'id': 'id-3278',
              'metadata': {'article_title': 'CNN/DailyMail Article',
                           'passage_text': 'The City of Los Angeles has grown '
                                           'more than any major metropolitan '
                                           'city in America since the '
                                           'beginning of the twentieth '
                                           'century. In 1900, the city Angels '
                                           'had little over 100,000 people but '
                                           "it wasn't until people moved west, "
                                           'especially after World War II, the '
                                           'population in the Los Angeles area '
                                           'really exploded. In the late 1880s '
                                           'several small independent electric '
                              

In [21]:
from pprint import pprint

In [22]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

('question: when was the first electric power system built? context: <P> The '
 'City of Los Angeles has grown more than any major metropolitan city in '
 'America since the beginning of the twentieth century. In 1900, the city '
 "Angels had little over 100,000 people but it wasn't until people moved west, "
 'especially after World War II, the population in the Los Angeles area really '
 'exploded. In the late 1880s several small independent electric companies '
 'worked to bring power to Southern California. In 1897, West Side Lighting '
 'Co. and Los Angeles Electric Co. merged to form Edison Electric Co. of Los '
 'Angeles, . As electricity expanded it also played a vital role in creating '
 'and expanding the infrastructure. Edison Company photographers also '
 'documented the process, leaving a vast archive of photos that reveal the '
 'interiors of businesses, restaurants, nightclubs,hotels and other '
 'architectural gems of early Los Angeles. Organised by William . Deverell, 

The output looks great. Now let's write a function to generate answers.

In [23]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [24]:
generate_answer(query)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


('The first electric power system in the US was built in the late 19th '
 'century. The first electric power lines were built in California in the late '
 '1880s. The first electric power lines were')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [25]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The first wireless message was sent by a telegraph. The first telegraph was '
 'sent in 1844, and the first wireless message was sent in 1845. The first '
 'wireless message was sent')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [26]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

(CNN Student News) -- May 24, 2010 . Download PDF maps related to today's show: . • Panama City, Florida • Shanghai & Beijing, China • West Point, New York . Transcript . THIS IS A RUSH TRANSCRIPT. THIS COPY MAY NOT BE IN ITS FINAL FORM AND MAY BE UPDATED. CARL AZUZ, CNN STUDENT NEWS ANCHOR: May 24th, 1844: The first telegraph message was sent. May 24th, 2010: You can watch CNN Student News on TV, online and on iTunes. A lot changes over 150 years. I'm Carl Azuz. Let's do this. First Up: Gulf Coast Oil Spill . AZUZ: Construction delays, permit problems, the threat of hurricanes: Officials in Panama City, Florida have dealt with a lot of problems to open up a new airport. Here's a new one: a massive oil spill in the Gulf of Mexico. We've talked about some of the different industries that this thing is affecting. Tourism is a big one, especially in Panama City. This new airport is expected to bring in millions of dollars. The problem is, people are worried about how clean the water is, a

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [27]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

("It's not clear where COVID-19 came from, but it's possible that it's "
 'descended from the 1918 Spanish Flu. The 1918 Spanish Flu pandemic killed up '
 'to 80 million people')


In [28]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

By . Anna Hodgekiss . PUBLISHED: . 11:30 EST, 1 May 2013 . | . UPDATED: . 11:51 EST, 1 May 2013 . A deadly bird flu virus sweeping through China has taken the first steps towards becoming a global threat to humans, experts have revealed. In the space of one month, the avian strain known as H7N9 has spread through all 31 Chinese provinces and claimed 125 victims, killing a fifth of those infected. Scientists say it is mutating rapidly and already has two of five genetic changes believed to be necessary for human-to-human transmission. Experts speaking in London today said . there was no room for complacency over H7N9, and warned against the . mistake of assuming it was a far-away foreign problem. A Chinese tourist wears a face mask in front of a portrait of leader Sun Yat-sen at Tiananmen Square in Beijing. Scientists say it is mutating rapidly and becoming more of a threat to humans . GPs have been sent letters advising them on how to identify cases and what action to take if one is su

Let’s finish with a final few questions.

In [29]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The War of Currents was a series of conflicts that took place in the early '
 '20th century, and was the culmination of a series of conflicts that had been '
 'going on for centuries. The')


In [30]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The first person to walk on the moon was Neil Armstrong, who walked on the '
 'moon on July 20, 1969.')


In [31]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Space Shuttle was the most expensive project in the history of the US '
 'government. It cost about $10 billion to build.')


As we can see, the model can generate some decent answers.

#### Add a few more questions

In [32]:
query = "When the wheel was first invented?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The wheel was first invented in the early 19th century. The wheel was '
 'invented by a man named Charles Wicksteed, who was an engineer by trade. The '
 'wheel was first used in')


In [33]:
query = "What is the highest mounten on earth?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The tallest mountain on Earth is Mount Everest, which is about 6,000 feet '
 'tall. The tallest mountain on the Moon is the highest mountain on the Moon, '
 'which is about 6,000')
