# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [1]:
!pip install -qU datasets pinecone-client==3.1.0 sentence-transformers torch

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/211.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m204.8/211.0 kB[0m [31m11.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m204.8/211.0 kB[0m [31m11.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.0/211.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.7/345.7 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[2K   

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [2]:
from datasets import load_dataset

# load the dataset from Hugging Face in streaming mode and shuffle it
wiki_data = load_dataset(
    'vblagoje/wikipedia_snippets_streamed',
    split='train',
    streaming=True,
    trust_remote_code=True
).shuffle(seed=960)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


wikipedia_snippets_streamed.py:   0%|          | 0.00/4.58k [00:00<?, ?B/s]

We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [3]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'wiki_id': 'Q7649565',
 'start_paragraph': 20,
 'start_character': 272,
 'end_paragraph': 24,
 'end_character': 380,
 'article_title': 'Sustainable Agriculture Research and Education',
 'section_title': "2000s & Evaluation of the program's effectiveness",
 'passage_text': "preserving the surrounding prairies. It ran until March 31, 2001.\nIn 2008, SARE celebrated its 20th anniversary. To that date, the program had funded 3,700 projects and was operating with an annual budget of approximately $19 million. Evaluation of the program's effectiveness As of 2008, 64% of farmers who had received SARE grants stated that they had been able to earn increased profits as a result of the funding they received and utilization of sustainable agriculture methods. Additionally, 79% of grantees said that they had experienced a significant improvement in soil quality though the environmentally friendly, sustainable methods that they were"}

In [4]:
from itertools import islice

history_docs = []
sample_size = 10000  # Limit to the first 10,000 documents to reduce processing time

for doc in islice(wiki_data, sample_size):
    section = doc.get("section_title", "")
    if "history" in section.lower():  # Check if "history" is in the section title (case-insensitive)
        history_docs.append(doc)

print(f"\n✅ Found {len(history_docs)} documents with 'history' in section_title.")



✅ Found 815 documents with 'history' in section_title.


In [5]:
for i, doc in enumerate(history_docs[:3]):
    print(f"\n--- History Doc #{i+1} ---")
    print(doc.get("passage_text", "No passage text available"))




--- History Doc #1 ---
1768. Prior to this, the only known visit by Europeans to an area where S. spinulosa occurs was the voyage of Dutch mariner Willem de Vlamingh, who explored Rottnest Island and the Swan River in December 1696 and January 1697 respectively. It is therefore very likely, but not proven, that the specimen was collected during that voyage, and thus predates by nearly three years the oldest authenticated collection of Australian plants, that made by William Dampier in 1699. It is known that Dutch botanist Nicolaas Witsen asked Vlamingh to collect plants for him during the voyage, and it is recorded that

--- History Doc #2 ---
was not until the 1950s that the region started to develop, with forestry and the construction of the Wairakei geothermal power station.

--- History Doc #3 ---
Sutarfeni History strand-like pheni were Phenakas mentioned in various indian texts. Phenakas is a broad term which includes various dishes prepared by using layered fried dough. Vijayan

Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [6]:
from tqdm.auto import tqdm  # progress bar

total_doc_count = 50000  # just a max cap for the loop
counter = 0
docs = []

# iterate through the filtered history documents
for d in tqdm(history_docs, total=total_doc_count):
    # extract the fields we need
    article_title = d.get("article_title", "")
    section_title = d.get("section_title", "")
    passage_text = d.get("passage_text", "")

    # store the result in a new list
    docs.append({
        "article_title": article_title,
        "section_title": section_title,
        "passage_text": passage_text
    })

    counter += 1

    # break early if we hit the limit
    if counter >= total_doc_count:
        break

print(f"✅ Collected {len(docs)} documents from history sections.")


  0%|          | 0/50000 [00:00<?, ?it/s]

✅ Collected 815 documents from history sections.


In [7]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article_title,section_title,passage_text
0,Synaphea spinulosa,Taxonomic history,"1768. Prior to this, the only known visit by E..."
1,Taupo District,History,was not until the 1950s that the region starte...
2,Sutarfeni,History & Western asian analogues,Sutarfeni History strand-like pheni were Phena...
3,The Bishop Wand Church of England School,History,The Bishop Wand Church of England School Histo...
4,Teufelsmoor,History & Situation today,"made to preserve the original landscape, altho..."


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [44]:
import os
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('my key') or 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [45]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [28]:
pip install pinecone

Collecting pinecone
  Downloading pinecone-6.0.2-py3-none-any.whl.metadata (9.0 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone-6.0.2-py3-none-any.whl (421 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.9/421.9 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone
Successfully installed pinecone-6.0.2 pinecone-plugin-interface-0.0.7


In [34]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="pcsk_7AatxH_5gktaZUEVPdvtuwb7Z4BPDYMYXn1Sbt5w7hkTACDpteAvvNfPFvrbY6mmQ9BMmb")

In [35]:
index_name = "quickstart"

pc.create_index(
    name=index_name,
    dimension=2, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

In [46]:
!export PINECONE_API_KEY="my key"


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [38]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = None #load the retriever model from HuggingFace. Use the flax-sentence-embeddings/all_datasets_v3_mpnet-base model
retriever

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [None]:
import pandas as pd
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
import pinecone

# Initialize the retriever model
retriever = SentenceTransformer('all-MiniLM-L6-v2')  # A lightweight, fast model for embedding

# Initialize Pinecone
pinecone.init(api_key="my key", environment="us-west1-gcp")  # Replace with your real values
index = pinecone.Index("quickstart")  # Using your specified index name

# Sample DataFrame in case you don't have a CSV file yet
data = {
    'article_title': ['Article 1', 'Article 2', 'Article 3', 'Article 4'],
    'section_title': ['Section 1', 'Section 2', 'Section 3', 'Section 4'],
    'passage_text': [
        'This is the first passage about the history of space exploration.',
        'In this article, we explore the advancements in wireless communication.',
        'This section discusses the origins of the first electric power systems.',
        'Here, we talk about the development of the Internet and its early stages.'
    ]
}
df = pd.DataFrame(data)

# Use batches of 64 (or fewer if you have a small dataset)
batch_size = 64

# Loop through batches and upsert to Pinecone
for i in tqdm(range(0, len(df), batch_size)):
    end = i + batch_size
    batch = df.iloc[i:end]
    texts = batch["passage_text"].tolist()
    metadatas = batch[["article_title", "section_title", "passage_text"]].to_dict(orient="records")
    ids = [f"id-{i+j}" for j in range(len(batch))]

    # Generate embeddings
    embeds = retriever.encode(texts).tolist()

    # Upsert into Pinecone
    to_upsert = list(zip(ids, embeds, metadatas))
    index.upsert(vectors=to_upsert)


# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [48]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

tokenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [24]:
def query_pinecone(query, top_k):
    # generate embeddings for the query
    xq = retriever.encode([query]).tolist()
    xc = index.query(vector=xq, top_k=top_k, include_metadata=True)
    return xc

In [25]:
def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage_text']}" for m in context]
    # concatinate all context passages
    context = " ".join(context)
    query = f"question: {query} context: {context}"
    return query

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [27]:
import pinecone
from sentence_transformers import SentenceTransformer



# Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')

def query_pinecone(query, top_k=1):
    # Step 1: Generate the embeddings for the query using SentenceTransformer
    xq = model.encode([query]).tolist()  # Convert the query to embeddings

    # Step 2: Perform the query on Pinecone index
    xc = index.query(vector=xq, top_k=top_k, include_metadata=True)

    return xc

# Example query
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)

# Display the result
print(result)


{'matches': [], 'namespace': '', 'usage': {'read_units': 1}}


In [19]:
!pip uninstall pinecone-client -y
!pip install pinecone


Found existing installation: pinecone-client 6.0.0
Uninstalling pinecone-client-6.0.0:
  Successfully uninstalled pinecone-client-6.0.0
Collecting pinecone
  Downloading pinecone-6.0.2-py3-none-any.whl.metadata (9.0 kB)
Downloading pinecone-6.0.2-py3-none-any.whl (421 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.9/421.9 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pinecone
Successfully installed pinecone-6.0.2


In [21]:
import os
import pinecone
import requests
from sentence_transformers import SentenceTransformer

# === Step 1: API Key ===
api_key = "my key"
# === Step 2: Get correct environment from Pinecone ===
def get_pinecone_environment(api_key):
    response = requests.get(
        "https://controller.us-east1-gcp.pinecone.io/actions/whoami",
        headers={"Api-Key": api_key}
    )
    return response.json().get("environment")

# === Step 3: Setup Pinecone with the correct way ===
environment = get_pinecone_environment(api_key)

# Initialize Pinecone using the new class-based method
pc = pinecone.Pinecone(api_key=api_key)

# Now initialize the index with Pinecone
index = pc.Index("quickstart")

# === Step 4: Load embedding model ===
model = SentenceTransformer('all-MiniLM-L6-v2')

print("✅ Pinecone initialized and model loaded.")


✅ Pinecone initialized and model loaded.


In [22]:
from pprint import pprint

In [28]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

'question: when was the first electric power system built? context: '


The output looks great. Now let's write a function to generate answers.

In [29]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [31]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Step 1: Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('t5-small')  # or any model you're using
model = AutoModelForSeq2SeqLM.from_pretrained('t5-small')  # or any model you're using

# Step 2: Answer generation function
def generate_answer(query):
    # Tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt", padding=True, truncation=True)

    # Use the model to generate answer
    ids = model.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)

    # Decode the generated ids to text
    answer = tokenizer.decode(ids[0], skip_special_tokens=True)
    return answer

# Step 3: Test the function
query = "when was the first electric power system built?"
answer = generate_answer(query)
print(answer)


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Wann wurde das erste elektrische Stromsystem gebaut? Wann wurde das erste elektrische Stromsystem gebaut?


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [41]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sentence_transformers import SentenceTransformer

# === Load models ===
tokenizer = AutoTokenizer.from_pretrained('t5-small')
model = AutoModelForSeq2SeqLM.from_pretrained('t5-small')

# === Define the retriever (embedding model) ===
retriever = SentenceTransformer('all-MiniLM-L6-v2')

# === Define Pinecone Query Function ===
def query_pinecone(query, top_k=5):
    # Generate embeddings for the query
    xq = retriever.encode([query]).tolist()

    # Query Pinecone (make sure 'index' is already initialized)
    xc = index.query(vector=xq, top_k=top_k, include_metadata=True)
    return xc

# === Format query with context ===
def format_query(query, context_matches):
    formatted_query = query + " " + " ".join([match["metadata"]["text"] for match in context_matches])
    return formatted_query

# === Generate Answer ===
def generate_answer(query):
    inputs = tokenizer(query, return_tensors="pt", max_length=1024, truncation=True, padding=True)
    ids = model.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    answer = tokenizer.decode(ids[0], skip_special_tokens=True)
    return answer

# === Example run ===
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
answer = generate_answer(query)

print(answer)


Wie wurde der erste Wireless-Message send send? Wie wurde der erste Wireless-Message send?


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [42]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [43]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

'Wo entstammt COVID-19? COVID-19 - COVID-19?'

In [44]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

Let’s finish with a final few questions.

In [45]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

'Was war der Krieg der Strömungen? Was war der Krieg der Strömungen?'

In [46]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

'Wer war die erste Person auf dem moon? Wer war die erste Person auf dem moon??'

In [47]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

'Was war das größte Projekt der NASA, das es in der Geschichte der NASA kostenaufwändig war?'

As we can see, the model can generate some decent answers.

#### Add a few more questions

In [48]:
query = "What is the theory of relativity?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)


'Was ist die Theorie der relativität? Was ist die Theorie der relativ relativität??'

In [49]:
query = "How did the industrial revolution impact society?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)


'Wie hat die industrielle revolution Einfluss die Gesellschaft auf die industrielle revolution? Wie hat die industrielle revolution die industrielle revolution Auswirkungen auf die Gesellschaft?'

In [50]:
query = "What caused the fall of the Roman Empire?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)


'Was caused the fall of the Roman Empire? Was caused the fall of the Roman Empire? What caused the fall of the Roman Empire?'