<a href="https://colab.research.google.com/github/Affiwhizz/lab-abstractive-question-answering/blob/main/lab-abstractive-question-answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [1]:
!pip install -qU datasets pinecone-client==3.1.0 sentence-transformers torch

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/211.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.0/211.0 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.3/506.3 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.[0m[31m
[0m

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [8]:
from datasets import load_dataset
from itertools import islice

wiki_stream = load_dataset(
    "wikimedia/wikipedia",
    "20231101.en",
    split="train",
    streaming=True
).shuffle(buffer_size=10_000, seed=960)

sample_size = 5000
wiki_small_iter = islice(wiki_stream, sample_size)

wiki_small = list(wiki_small_iter)
len(wiki_small)

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

5000

We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [9]:
# show the contents of a single document in the dataset
next(iter(wiki_small))

{'id': '2117822',
 'url': 'https://en.wikipedia.org/wiki/57th%20United%20States%20Congress',
 'title': '57th United States Congress',
 'text': 'The 57th United States Congress was a meeting of the legislative branch of the United States federal government, composed of the United States Senate and the United States House of Representatives. It met in Washington, DC from March 4, 1901, to March 4, 1903, during the final six months of William McKinley\'s presidency, and the first year and a half of the first administration of his successor, Theodore Roosevelt. The apportionment of seats in the House of Representatives was based on the 1890 United States census. Both chambers had a Republican majority.\n\nMajor events\n\n September 6, 1901: Leon Czolgosz shot President William McKinley at the Pan-American Exposition in Buffalo, New York\n September 14, 1901: President William McKinley died. Vice President Theodore Roosevelt became President of the United States\n October 16, 1901: Presiden

In [10]:
# filter only documents with History as section_title
history = [d for d in wiki_small if d.get("section_title") == "History"]

Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [11]:
from tqdm.auto import tqdm  # progress bar

total_doc_count = 50000

counter = 0
docs = []
# iterate through the dataset and apply our filter
for d in tqdm(wiki_small, total=len(wiki_small)):
    # extract the fields we need - article, section, and passage
   docs.append({
        "article_title": d.get("title"),
        "section_title": None,  # not present in this dataset
        "passage_text": d.get("text")
    })

  0%|          | 0/5000 [00:00<?, ?it/s]

In [12]:
len(docs)

5000

In [13]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article_title,section_title,passage_text
0,57th United States Congress,,The 57th United States Congress was a meeting ...
1,Augustin-Norbert Morin,,"Augustin-Norbert Morin (October 13, 1803 – Jul..."
2,OK Go (album),,OK Go is the debut studio album by American ro...
3,Shree 420,,Shree 420 (also spelled as Shri 420; ) is a 19...
4,Débora Sulca,,"Débora Susan Sulca Cravero (born 1986 in Lima,..."


## Data Cleaning

In [14]:
import re
def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = re.sub(r"\[\d+\]", "", text)   # remove citation markers
    text = re.sub(r"\s+", " ", text)      # normalize whitespace
    text = re.sub(r"\(.*?\)", "", text)   # remove text in parentheses
    text = text.strip()
    return text

# Apply cleaning
df["clean_text"] = df["passage_text"].apply(clean_text)

# Truncation function
def truncate_text(text, max_words=500):
    words = text.split()
    return " ".join(words[:max_words])

# Apply truncation
df["clean_text"] = df["clean_text"].apply(truncate_text)

df.head()

Unnamed: 0,article_title,section_title,passage_text,clean_text
0,57th United States Congress,,The 57th United States Congress was a meeting ...,The 57th United States Congress was a meeting ...
1,Augustin-Norbert Morin,,"Augustin-Norbert Morin (October 13, 1803 – Jul...",Augustin-Norbert Morin was a Canadien journali...
2,OK Go (album),,OK Go is the debut studio album by American ro...,OK Go is the debut studio album by American ro...
3,Shree 420,,Shree 420 (also spelled as Shri 420; ) is a 19...,Shree 420 is a 1955 Indian Hindi comedy-drama ...
4,Débora Sulca,,"Débora Susan Sulca Cravero (born 1986 in Lima,...",Débora Susan Sulca Cravero is a Peruvian model...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [15]:
import os
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [19]:
from google.colab import userdata
import os

os.environ["PINECONE_API_KEY"] = userdata.get("PINECONE_API_KEY")

print("PINECONE_API_KEY found:", bool(os.environ.get("PINECONE_API_KEY")))

PINECONE_API_KEY found: True


In [20]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [21]:
index_name = "abstractive-question-answering" #give your index a meaningful name

In [23]:
from pinecone import Pinecone, ServerlessSpec
import os, time

# Initialize Pinecone client
api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)

# Define your index details
index_name = "extractive-question-answering"
spec = ServerlessSpec(cloud="aws", region="us-east-1")

# Check if index exists; create if not
if index_name not in [i["name"] for i in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=384,       # match index dimension
        metric="cosine",
        spec=spec
    )
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

# Connect to the index
index = pc.Index(index_name)

# View index stats to confirm it’s active
print(index.describe_index_stats())

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891}


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [27]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load the retriever model
retriever = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=device)


# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [28]:
from tqdm.auto import tqdm

# Upsert (insert) into Pinecone in batches
batch_size = 100

for i in tqdm(range(0, len(vectors), batch_size)):
    batch = vectors[i:i+batch_size]

    for v in batch:
        v["metadata"] = {k: str(v["metadata"][k]) for k in v["metadata"]}

    try:
        index.upsert(vectors=batch, namespace="default")  # add namespace
    except Exception as e:
        print(f"Error on batch {i//batch_size}: {e}")
        break

print("Upsert complete!")
print(index.describe_index_stats())

  0%|          | 0/50 [00:00<?, ?it/s]

Upsert complete!
{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 18891}, 'default': {'vector_count': 5000}},
 'total_vector_count': 23891}


# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [29]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

tokenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [30]:
def query_pinecone(query, top_k=3):
    # Generate embeddings for the query using the retriever model
    xq = retriever.encode([query]).tolist()

    # Search Pinecone for similar passages
    search_results = index.query(vector=xq, top_k=top_k, include_metadata=True, namespace="default")

    # Extract matches
    matches = search_results["matches"]
    return matches

In [31]:
def format_query(query, context):
    # Extract passage text from Pinecone results and wrap them in <P> tags
    context_texts = [f"<P> {m['metadata']['text']}" for m in context if 'metadata' in m and 'text' in m['metadata']]

    # Concatenate all context passages
    combined_context = " ".join(context_texts)

    # Combine query with the retrieved context
    final_input = f"question: {query} context: {combined_context}"

    return final_input

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [35]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

[{'id': '1791',
  'metadata': {'article_title': 'State electrician',
               'section_title': 'None',
               'text': 'The official title of "state electrician" was given to '
                       'some American state executioners in states using the '
                       'electric chair during the early 20th century, including '
                       'the New York State electrician. Hangings had usually '
                       'been carried out by untrained county sheriffs, but when '
                       'electrocution was introduced in New York, the first in '
                       'the world ever to adopt it, it was felt that a trained '
                       'electrician should be hired to operate the chair. Edwin '
                       "Davis was New York's first state electrician. He "
                       'carried out the execution of William Kemmler, the first '
                       'man executed with the electric chair, and that of '
           

In [36]:
from pprint import pprint

In [38]:
# format the query in the form generator expects the input
query = format_query(query, result)
pprint(query)

('question: when was the first electric power system built? context: <P> The '
 'official title of "state electrician" was given to some American state '
 'executioners in states using the electric chair during the early 20th '
 'century, including the New York State electrician. Hangings had usually been '
 'carried out by untrained county sheriffs, but when electrocution was '
 'introduced in New York, the first in the world ever to adopt it, it was felt '
 'that a trained electrician should be hired to operate the chair. Edwin Davis '
 "was New York's first state electrician. He carried out the execution of "
 'William Kemmler, the first man executed with the electric chair, and that of '
 'Martha M. Place, the first woman to be legally electrocuted. Davis also held '
 'patents on certain features of the electric chair and trained two of his '
 'successors, Robert G. Elliott and John Hulbert, who served as his assistants '
 'during executions. In New York, state electricians were no

The output looks great. Now let's write a function to generate answers.

In [42]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, truncation=True, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [43]:
generate_answer(query)

('The first wireless message was sent by a telegraph. It was sent by a '
 'telegraph operator.')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [45]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context)
generate_answer(query)

('The first wireless message was sent in the early 1900s. The first wireless '
 'message was sent in the early 1900s. The first wireless message was sent in '
 'the early 1900s. The first')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [48]:
for doc in context:
    print(doc["metadata"]["text"], end='\n---\n')

William "Bill" Yeager is an American engineer. He is best known for being the inventor of a packet-switched, "Ships in the Night", multiple-protocol router in 1981, during his 20-year tenure at Stanford's Knowledge Systems Laboratory as well as the Stanford University Computer Science department. Biography The code routed PARC Universal Packet , Xerox Network Systems , Internet Protocol and Chaosnet. The router used Bill's Network Operating System . The NOS also supported the EtherTIPS that were used throughout the Stanford LAN for terminal access to both the LAN and the Internet. This code was licensed by Cisco Systems in 1987 and comprised the core of the first Cisco IOS. This provided the groundwork for a new, global communications approach. He is also known for his role in the creation of the Internet Message Access Protocol mail protocol. In 1984 he conceived of a client/server protocol, designed its functionality, applied for and received the grant money for its implementation. I

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [49]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context)
generate_answer(query)

('COVID-19 is a virus that infects humans. It is not a virus that infects '
 'animals. It is a virus that infects humans. It is not a virus that infect')


In [50]:
for doc in context:
    print(doc["metadata"]["text"], end='\n---\n')

Tioman virus is a paramyxovirus first isolated from the urine of island fruit bats on Tioman Island, Malaysia in 2000. The virus was discovered during efforts to identify the natural host of Nipah virus which was responsible for a large outbreak of encephalitic illness in humans and pigs in Malaysia and Singapore in 1998–99. Taxonomy Tioman virus is antigenically related to Menangle virus which is also harboured by Pteropid fruit bats and caused an outbreak of foetal deformities in pigs in Australia in 1997. Clinical importance Although there is no evidence that Tioman virus can cause illness in humans or animals, its close relationship to other disease-causing paramyxoviruses suggests the possibility that it may cause disease upon crossing the species barrier. The recent emergence of a number of zoonotic bat-borne viruses in the Asia-Pacific region also demonstrates that conditions are increasingly favouring this type of event. References Rubulaviruses
---
DRV may refer to: Darunavir,

Let’s finish with a final few questions.

In [51]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context)
generate_answer(query)

('The War of Currents is a term used to describe a series of events that '
 'occurred in the late 19th and early 20th centuries. The most famous of these '
 'events was the Battle of')


In [52]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context)
generate_answer(query)

('The first person to go to the moon was Neil Armstrong, who landed on the '
 'moon in 1969. He was the first man to walk on the moon.')


In [53]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context)
generate_answer(query)

('The Space Shuttle was the most expensive project in the history of NASA. It '
 'cost about $2.5 billion to build, and it was launched in 1969. The Space '
 'Shuttle was the first')


As we can see, the model can generate some decent answers.

#### Add a few more questions

In [74]:
import io
import sys

def ask(question, top_k=5, min_score=0.3):
    print(f"\n Question: {question}\n{'-'*80}")

    # Retrieve top matches from Pinecone
    context = query_pinecone(question, top_k=top_k)
    filtered_context = [doc for doc in context if doc.get('score', 0) >= min_score]

    if not filtered_context:
        print("\n No relevant contexts found above the score threshold.")
        return

    # Format query for generator
    formatted_query = format_query(question, filtered_context)

    # Capture printed output from generate_answer()
    buffer = io.StringIO()
    sys_stdout = sys.stdout
    sys.stdout = buffer
    try:
        raw_answer = generate_answer(formatted_query)
    finally:
        sys.stdout = sys_stdout
    printed_output = buffer.getvalue().strip()

    # Prefer returned value or captured print
    answer = raw_answer or printed_output

    # Clean repetition
    if isinstance(answer, (tuple, list)):
        answer = answer[0]
    if not isinstance(answer, str):
        answer = str(answer)
    words = answer.split()
    clean_answer = " ".join(dict.fromkeys(words))

    # Print only the clean generated answer (inside function)
    print(f"\n Generated Answer:\n{clean_answer}\n")

In [63]:
query = "What causes earthquakes?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context)
generate_answer(query)

("I'm not a geologist, but I do have a degree in seismology. I'm not sure if "
 "this is what you're looking for, but I can give you a general idea")


In [75]:
ask("What is artificial intelligence?")


 Question: What is artificial intelligence?
--------------------------------------------------------------------------------

 Generated Answer:
('Artificial intelligence is the ability of a computer to learn and adapt ' 'its environment. It its 'environment. a')



In [82]:
ask ("What is the difference between AI and ML?")


 Question: What is the difference between AI and ML?
--------------------------------------------------------------------------------

 Generated Answer:
('AI is the process of making a computer do something. ML ' 'making something.')



In [83]:
ask ("How does machine learning work?")


 Question: How does machine learning work?
--------------------------------------------------------------------------------

 No relevant contexts found above the score threshold.


In [85]:
ask("When was the first car invented?")


 Question: When was the first car invented?
--------------------------------------------------------------------------------

 Generated Answer:
('The first car was invented in the early 19th century. The ' 'probably a bicycle. bicycle late 18th 'The in')



In [86]:
ask ("what is consciousness?")


 Question: what is consciousness?
--------------------------------------------------------------------------------

 Generated Answer:
("I'm not sure if this is what you're looking for, but I'll give it a shot. " "Consciousness the ability to think. It's physical thing, it's")



In [87]:
ask ("How does AI improve customer experience?")


 Question: How does AI improve customer experience?
--------------------------------------------------------------------------------

 No relevant contexts found above the score threshold.
