# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [1]:
pip install -qU datasets pinecone-client==3.1.0 sentence-transformers torch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [2]:
pip install ipywidgets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
pip install --upgrade datasets pyarrow


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
from datasets import load_dataset

wiki_data = load_dataset(
    'vblagoje/wikipedia_snippets_streamed',
    split='train',
    streaming=True,
    trust_remote_code=True
)

We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [6]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'wiki_id': 'Q7593707',
 'start_paragraph': 2,
 'start_character': 0,
 'end_paragraph': 6,
 'end_character': 511,
 'article_title': "St John the Baptist's Church, Atherton",
 'section_title': 'History',
 'passage_text': "St John the Baptist's Church, Atherton History There have been three chapels or churches on the site of St John the Baptist parish church. The first chapel at Chowbent was built in 1645 by John Atherton as a chapel of ease of Leigh Parish Church. It was sometimes referred to as the Old Bent Chapel. It was not consecrated and used by the Presbyterians as well as the Vicar of Leigh. In 1721 Lord of the manor Richard Atherton expelled the dissenters who subsequently built Chowbent Chapel. The first chapel was consecrated in 1723 by the Bishop of Sodor and"}

In [7]:
# filter only documents with History as section_title - Replace None with your code
history = wiki_data.filter(lambda x: x["section_title"] == "History")

Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [None]:
total_doc_count = 1000  
counter = 0
docs = []

for d in tqdm(history):
    docs.append({
        "article_title": d["article_title"],
        "section_title": d["section_title"],
        "passage": d["passage_text"]
    })
    counter += 1
    if counter >= total_doc_count:
        break

999it [00:19, 51.16it/s] 


In [9]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article_title,section_title,passage
0,"St John the Baptist's Church, Atherton",History,"St John the Baptist's Church, Atherton History..."
1,"St John the Baptist's Church, Atherton",History,Man.\nThe first chapel was replaced by a new S...
2,Star Music,History,Star Music History Star Music was founded in F...
3,Star Music,History,in order to strengthen its production base and...
4,Star Music,History,"market. By December of the same year, the labe..."


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [26]:
import os
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [27]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [12]:
index_name = "historical-passages" #give your index a meaningful name

In [None]:
import os
from dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec
import time


load_dotenv()


api_key = os.getenv("PINECONE_API_KEY")


pc = Pinecone(api_key=api_key)

index_name = "historical-passages"


if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)
    print(f"Index '{index_name}' deleted.")


pc.create_index(
    name=index_name,
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)
print(f"Index '{index_name}' created.")

index = pc.Index(index_name)
print(index.describe_index_stats())

Index 'historical-passages' deleted.
Index 'historical-passages' created.
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [None]:
import torch
from sentence_transformers import SentenceTransformer


device = 'cuda' if torch.cuda.is_available() else 'cpu'


retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base")
retriever = retriever.to(device)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [15]:
print(df.columns)

Index(['article_title', 'section_title', 'passage'], dtype='object')


In [16]:
import openai

def get_embedding(text, model="text-embedding-ada-002"):
    response = openai.Embedding.create(
        input=text,
        model=model
    )
    return response['data'][0]['embedding']

In [17]:
pip install openai==0.28


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
from tqdm import tqdm

batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    i_end = min(i + batch_size, len(df))
    batch = df.iloc[i:i_end]

   
    embeddings = [get_embedding(text) for text in batch["passage"]]

    
    metadata = batch.to_dict(orient="records")

    
    index.upsert([
        {
            "id": f"{i + j}",
            "values": emb,
            "metadata": meta
        }
        for j, (emb, meta) in enumerate(zip(embeddings, metadata))
    ])

100%|██████████| 16/16 [07:54<00:00, 29.66s/it]


# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [35]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

tokenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [36]:
def query_pinecone(query, top_k=5):
  
    xq = get_embedding(query)  

    xc = index.query(vector=xq, top_k=top_k, include_metadata=True)

    return xc

In [45]:
def format_query(query, context):
    context = [f"<P> {m['metadata']['passage']}" for m in context]
    context = "\n".join(context)
    query = query + "\n" + context
    return query

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [46]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

{'matches': [{'id': '952',
              'metadata': {'article_title': 'Holy Family Parish, Pittsfield',
                           'passage': 'insufficient to meet the needs of '
                                      'growing populations and plans to build '
                                      'a new church began. Nearby a disused '
                                      'power generation station for the '
                                      'Pittsfield Electric Street Railway '
                                      'system was for sale. It had been closed '
                                      'on August 11, 1912 after only slightly '
                                      'more than 5-1/2 years of service. The '
                                      'new generation station on East Street '
                                      'made the Seymour Street plant surplus. '
                                      'After consulting with parishioners, Fr. '
                                

In [39]:
from pprint import pprint

In [48]:
from pprint import pprint
pprint(result["matches"][0])

{'id': '952',
 'metadata': {'article_title': 'Holy Family Parish, Pittsfield',
              'passage': 'insufficient to meet the needs of growing '
                         'populations and plans to build a new church began. '
                         'Nearby a disused power generation station for the '
                         'Pittsfield Electric Street Railway system was for '
                         'sale. It had been closed on August 11, 1912 after '
                         'only slightly more than 5-1/2 years of service. The '
                         'new generation station on East Street made the '
                         'Seymour Street plant surplus. After consulting with '
                         'parishioners, Fr. Stanczyk purchased the building on '
                         'Seymour Street for $8500. Parishioners willingly '
                         'undertook the soliciting of funds to pay for '
                         'acquiring the building and the costs of '
    

In [49]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

('when was the first electric power system built?\n'
 '<P> insufficient to meet the needs of growing populations and plans to build '
 'a new church began. Nearby a disused power generation station for the '
 'Pittsfield Electric Street Railway system was for sale. It had been closed '
 'on August 11, 1912 after only slightly more than 5-1/2 years of service. The '
 'new generation station on East Street made the Seymour Street plant surplus. '
 'After consulting with parishioners, Fr. Stanczyk purchased the building on '
 'Seymour Street for $8500. Parishioners willingly undertook the soliciting of '
 'funds to pay for acquiring the building and the costs of re-construction and '
 'renovation into a church edifice.')


The output looks great. Now let's write a function to generate answers.

In [50]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [51]:
generate_answer(query)

('The first electric power system was built in the early 1900s in the United '
 'States. The first electric power system was built in New York City in 1906. '
 'The first electric power system was built')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [52]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The first wireless message was sent in the early 1900s. The first wireless '
 'message was sent in the early 1900s. The first wireless message was sent in '
 'the early 1900s. The first')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [54]:
for doc in context["matches"]:
    print(doc["metadata"]["passage"], end='\n---\n')

Government Association then funded the operation of the station.
On April 25, 1971, WEGL Radio signed on the air with 10 watts of power and began broadcasting at 91.1 megahertz (MHz), as assigned by the FCC. The first song broadcast was  "Another Day" by Paul McCartney. The first WEGL studio was located in  room 1239 of Haley Center. After one year of operation, a student committee submitted a proposal to the Auburn University Board of Student Communication requesting a power increase. With the support of the University’s President, WEGL’s effective radiated power (ERP) increased to 380 watts in 1975.
---
WEGL History WEGL was not the first radio station at Auburn University. In 1922, WMAV began broadcasting from Broun Hall with a 1,500 watt homemade transmitter. It became part of the University’s Extension Service and received a new name, WAPI-AM (WAPI) (for the school’s name at the time: Alabama Polytechnic Institute.) WAPI was later moved to Birmingham, Alabama.
On June 1, 1970, the

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [55]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('COVID-19 is a strain of the bacterium SARS-CoV. It is a strain of the '
 'bacterium SARS-CoV, which is a strain of the bacter')


In [57]:
for doc in context["matches"]:
    print(doc["metadata"]["passage"], end='\n---\n')

Basidiobolus ranarum History In 1886, the fungus was first isolated from the dung and intestinal contents of frogs by Eidam. In 1927, it was found in the intestines of toads, slowworms, and salamanders by Levisohn. In 1956, Joe et al. reported and described the first four cases of zygomycosis in Indonesia. Since then, hundreds of the cases of this infection have been reported. In 1955, Drechsler isolated it from decaying plants material in North America. In 1971, it was first isolated by Nickerson and Hutchison from aquatic animals, suggesting that B. ranarum can survive in a wild range of ecological
---
to develop the concept. Initial financial support came in 1997 from Belgian government foreign development aid funds. In 2000 it moved its training and headquarters to Sokoine University of Agriculture (SUA) in Morogoro, Tanzania, partnering with the Tanzanian People's Defence Force.
In 2003 APOPO was awarded a grant from the World Bank, which provided seed funding to research another 

Let’s finish with a final few questions.

In [58]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The war of currents was not a war of currents, it was a war of currents '
 'against the prevailing winds. The prevailing winds were the prevailing winds '
 'of the day, and the prevailing winds were')


In [59]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The first man to walk on the moon was Neil Armstrong in 1969. He was the '
 'first man to walk on the moon.')


In [60]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Space Shuttle was the most expensive project in the history of the US '
 'government. It cost about $100 billion to build.')


As we can see, the model can generate some decent answers.

#### Add a few more questions

In [62]:
questions = [
    "What was NASA's most expensive project?",
    "When was the Holy Family Parish church purchased?",
    "Why was the Seymour Street plant considered surplus?",
    "What caused the Pittsfield Electric Street Railway to close?",
    "Who was Fr. Stanczyk and what did he do?"
]

for q in questions:
    print(f"Q: {q}")
    context = query_pinecone(q, top_k=3)
    formatted_query = format_query(q, context["matches"])
    answer = generate_answer(formatted_query)
    print(f"A: {answer}")
    print("-" * 50)

Q: What was NASA's most expensive project?
("The Space Shuttle was the most expensive project in NASA's history. It cost "
 '$2.5 billion to build.')
A: None
--------------------------------------------------
Q: When was the Holy Family Parish church purchased?
('The church was purchased by the city of Pittsfield in 1894. The city of '
 'Pittsfield was founded in 1867 and the city of Pittsfield was incorporated '
 'in 1871')
A: None
--------------------------------------------------
Q: Why was the Seymour Street plant considered surplus?
('The Seymour Street plant was closed in 1912 after only slightly more than '
 '5-1/2 years of service. It had been closed on August 11, 1912 after only '
 'slightly more than 5-')
A: None
--------------------------------------------------
Q: What caused the Pittsfield Electric Street Railway to close?
("I'm not sure if this is what you're looking for, but I can tell you that the "
 'Pittsfield Electric Street Railway closed in 1958. It had been closed