[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/question-answering/abstractive-question-answering.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/question-answering/abstractive-question-answering.ipynb)

# Abstractive Question Answering

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [132]:
# !pip install -qU datasets pinecone-client sentence-transformers torch
# !pip install PyPDF2
# !pip install fitz
# !pip install frontend
# !pip install tools
# !pip install --upgrade PyMuPDF
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

In [133]:
from nltk.corpus import stopwords
import nltk
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import re
import string

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [183]:
import os

# Check if the 'static/' directory exists, and create it if not
static_directory = 'static/'
if not os.path.exists(static_directory):
    os.makedirs(static_directory)


import fitz  # PyMuPDF
import pandas as pd

# Function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    texts = []

    for page_number in range(len(doc)):
        page = doc[page_number]

        # Extract text from the page
        page_text = page.get_text()
        texts.append(page_text)

    return texts

# Provide a list of PDF file paths
#pdf_file_paths = ["file1.pdf", "file2.pdf", "file3.pdf","file4.pdf", "file6.pdf", "file7.pdf"]

pdf_file_paths = ["MeatLife.pdf"]

# Create a list to store documents
docs = []

for pdf_file_path in pdf_file_paths:
    pdf_texts = extract_text_from_pdf(pdf_file_path)
    docs.append({"passage_text": pdf_texts})

# Create a pandas DataFrame with the documents
df1 = pd.DataFrame(docs)
df1['passage_text'] = df1['passage_text'].apply(lambda x: ' '.join(x))
df1.head()

delimiter = '.'
split_data = df1['passage_text'].str.split(delimiter, expand=True)
split_data = split_data.stack().reset_index(level=1, drop=True)
split_data.name = 'passage_text2'
split_data
# Create a new DataFrame with the split data
df = pd.concat([df1, split_data], axis=1)

# Drop the original 'Text' column if it's no longer needed
df = df.drop('passage_text', axis=1)
def clean_txt(text):
   text = text.replace('\n', ' ')
   text = text.replace("'", '')
   text = text.replace('.', '')
   text = text.replace('"', '')
   text = text.replace(',', '')
   return text

df['passage_text2'] = df['passage_text2'].apply(clean_txt)
df.head()


Unnamed: 0,passage_text2
0,Podcasts YouTube Need to know Ad Sign up for T...
0,Accordingly MML will be restructured to separ...
0,The company will transform into a business pl...
0,Masan MEATLife (a member company of Masan Gro...
0,The company invested in the rest of the suppl...


In [184]:
df.shape

(182, 1)

In [168]:
## Preprocess data

In [181]:
# stop = stopwords.words('english')
# stop_words_ = set(stopwords.words('english'))
# wn = WordNetLemmatizer()

# def black_txt(token):
#     return  token not in stop_words_ and token not in list(string.punctuation)  and len(token)>2



Unnamed: 0,passage_text2
0,KKR Extends Partnership with Masan Group Doubl...
0,KKR will invest US$200 million in addition t...
0,Madhur Maini CEO of Masan Group commented “O...
0,We believe KKR is the right partner to broade...
0,” Masan Group is one of the largest publicly-l...


In [185]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [186]:

import pinecone

# connect to pinecone environment
pinecone.init(
    api_key="bb491778-4927-4fab-913d-18ca0578500a",
    environment="gcp-starter"  # find next to API key in console
)

index_name = "abstractive-question-answering"

# Check if the abstractive-question-answering index exists
if index_name not in pinecone.list_indexes():
    # Create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=retriever.get_sentence_embedding_dimension(),
        metric="cosine"
    )

# Connect to the abstractive-question-answering index
index = pinecone.Index(index_name)

import torch
from sentence_transformers import SentenceTransformer

# Set the device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Load the retriever model from the Hugging Face model hub
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)

# We will use batches of 1 since we're processing a single PDF document
batch_size = 1

for i in range(0, len(df), batch_size):
    # Extract a single document from the DataFrame
    batch = df.iloc[i:i + batch_size]
    # Generate embeddings for the document
    emb = retriever.encode(batch["passage_text2"].tolist()).tolist()
    # Get metadata
    meta = batch.to_dict(orient="records")
    # Create unique IDs
    ids = [f"{idx}" for idx in range(i, i + batch_size)]
    # Add all to the upsert list
    to_upsert = list(zip(ids, emb, meta))
    # Upsert/insert these records into Pinecone
    _ = index.upsert(vectors=to_upsert)#Time taking step

# Check that we have all vectors in the index
index.describe_index_stats()


{'dimension': 768,
 'index_fullness': 0.00692,
 'namespaces': {'': {'vector_count': 692}},
 'total_vector_count': 692}

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [140]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

# from transformers import GPT2Tokenizer, GPT2LMHeadModel

# # Load GPT-2 tokenizer and model from Hugging Face
# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# generator = GPT2LMHeadModel.from_pretrained('gpt2').to(device)



All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [141]:
def query_pinecone(query, top_k):
    # generate embeddings for the query
    xq = retriever.encode([query]).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(xq, top_k=top_k, include_metadata=True)
    return xc

In [142]:
def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage_text2']}" for m in context]
    # concatinate all context passages
    context = " ".join(context)
    # contcatinate the query and context passages
    query = f"question: {query} context: {context}"
    return query

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [143]:
query = "when was winlife launched?"
result = query_pinecone(query, top_k=1)
result

{'matches': [{'id': '363',
              'metadata': {'passage_text2': 'crownx tcx masan integrate '
                                            'consumer retail platform '
                                            'consolidate wincommerce wcm masan '
                                            'consumer hold mch grow net '
                                            'revenues'},
              'score': 0.330994904,
              'values': []}],
 'namespace': ''}

In [144]:
from pprint import pprint

In [145]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

('question: when was winlife launched? context: <P> crownx tcx masan integrate '
 'consumer retail platform consolidate wincommerce wcm masan consumer hold mch '
 'grow net revenues')


In [146]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [147]:
generate_answer(query)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


('WinLife was launched in 2011. It was a game that allowed you to buy and play '
 'virtual currencies. It was a game that allowed you to play virtual '
 'currencies.')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [148]:
query = "Is MeatDeli profitable?"
context = query_pinecone(query, top_k=15)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli is not profitable. It is profitable because it has a very large '
 'customer base and a very high profit margin. It is not profitable because it '
 'has a very low profit margin.')


In [149]:
context

{'matches': [{'id': '288',
              'metadata': {'passage_text2': 'nearly years market launch '
                                            'meatdeli distribution system '
                                            'point sale hanoi chi minh city '
                                            'surround areas customer base '
                                            'millions people'},
              'score': 0.641336322,
              'values': []},
             {'id': '454',
              'metadata': {'passage_text2': 'nearly years market launch '
                                            'meatdeli distribution system '
                                            'point sale hanoi chi minh city '
                                            'surround areas customer base '
                                            'millions people'},
              'score': 0.641336322,
              'values': []},
             {'id': '111',
              'metadata': {'passage_text2': 'dec masa

In [152]:
query = "What is winlife?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('WinLife is a company that provides a service that allows you to buy a '
 'subscription to a service. WinLife is a service that allows you to buy a '
 'subscription to a service that allows you')


In [153]:
query = "Who is the CEO of Masan?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The CEO of Masan is the Chairman of the Board of Directors. The CEO is the '
 'CEO of Masan Group. The CEO of Masan Group is the Chairman of the Board of '
 'Directors')


In [154]:
context

{'matches': [{'id': '231',
              'metadata': {'passage_text2': 'currently chief executive ofﬁcer '
                                            'masan group'},
              'score': 0.838496923,
              'values': []},
             {'id': '612',
              'metadata': {'passage_text2': 'currently chief executive oﬃcer '
                                            'masan group'},
              'score': 0.833334863,
              'values': []},
             {'id': '634',
              'metadata': {'passage_text2': 'nguyen thieu nam deputy ceo masan '
                                            'group'},
              'score': 0.780959904,
              'values': []},
             {'id': '639',
              'metadata': {'passage_text2': 'michael hang nguyen deputy ceo '
                                            'masan group join masan since '
                                            'michael leverage experience focus '
                                            'str

In [155]:
query = "What was the most profitable product for Masan?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The most profitable product for Masan was the "Masa" brand. It was a brand '
 'that was very popular in Japan, and it was very profitable. It was also very '
 'profitable in')


In [156]:
query = "What was the most profitable product for MeatDeli?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli was profitable for a long time. They were able to make a lot of '
 'money off of the meat they sold. They were able to make a lot of money off '
 'of the')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [117]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text2"], end='\n---\n')


Truning a proﬁt and creating momentum for breakthrough
In the third quarter of 2021, MEATDeli, MML's branded chilled meat business 
(excluding farms, 3F Viet, and feed) marked an important milestone whenby 
delivering its ﬁrst ever positive net proﬁt after tax
---

 Truning a proﬁt and creating momentum for breakthrough
In the third quarter of 2021, MEATDeli, MML's branded chilled meat business (excluding farms, 3F Viet,
and feed) marked an important milestone whenby delivering its ﬁrst ever positive net proﬁt after tax
---
 This creates conditions for MML's clean
meat brand MEATDeli to have a lot of potential for growth
---
 This creates conditions for MML's clean meat brand 
MEATDeli to have a lot of potential for growth
---

Thanks to the launch of the MEATDeli brand and its investment in 3F VIET, MML is well-positioned to
disrupt the $15 billion meat market in Vietnam, unlocking its potential for sustainable growth as well
as signiﬁcantly higher proﬁt margin
---

Thanks to the lau

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [101]:
query = "when was WinLife launched?"
context = query_pinecone(query, top_k=10) ##Tune this for better results
query = format_query(query, context["matches"])
generate_answer(query)

('WinLife was launched in 2011. It was a free service that was available to '
 'everyone. It was a free service that was available to everyone. It was free '
 'for everyone. It was free')


In [102]:
query = "when was MeatDeli launched?"
context = query_pinecone(query, top_k=12)
query = format_query(query, context["matches"])
generate_answer(query)

("I'm not sure if this is what you're looking for, but here's a link to the "
 'original article:')


In [103]:
query = "Who is the CEO of Masan? "
context = query_pinecone(query, top_k=12)
query = format_query(query, context["matches"])
generate_answer(query)

('The CEO of Masan is the Chairman of Masan Group. He is the CEO of 6 '
 'subsidiaries of Masan Group, including the role of Chairman of Masan '
 'High-Tech Materials Corporation')


In [104]:
query = "Is MeatDeli profitable?"
context = query_pinecone(query, top_k=12)
query = format_query(query, context["matches"])
generate_answer(query)

("It's not profitable, but it's profitable in the sense that it's profitable "
 "in the sense that it's profitable in the sense that it's profitable in the "
 "sense that it's profitable in")


In [71]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text2"], end='\n---\n')


Truning a proﬁt and creating momentum for breakthrough
In the third quarter of 2021, MEATDeli, MML's branded chilled meat business 
(excluding farms, 3F Viet, and feed) marked an important milestone whenby 
delivering its ﬁrst ever positive net proﬁt after tax
---

 Truning a proﬁt and creating momentum for breakthrough
In the third quarter of 2021, MEATDeli, MML's branded chilled meat business (excluding farms, 3F Viet,
and feed) marked an important milestone whenby delivering its ﬁrst ever positive net proﬁt after tax
---
 This creates conditions for MML's clean
meat brand MEATDeli to have a lot of potential for growth
---
 This creates conditions for MML's clean meat brand 
MEATDeli to have a lot of potential for growth
---

Thanks to the launch of the MEATDeli brand and its investment in 3F VIET, MML is well-positioned to
disrupt the $15 billion meat market in Vietnam, unlocking its potential for sustainable growth as well
as signiﬁcantly higher proﬁt margin
---

Thanks to the lau

Let’s finish with a final few questions.

In [None]:
query = "What was revenue for Masan?"
context = query_pinecone(query, top_k=12)
query = format_query(query, context["matches"])
generate_answer(query)

('Masaan was a company that sold food products. They sold a lot of food '
 'products, but they also sold a lot of other things. They also sold a lot of '
 'other things.')


In [None]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The first man to walk on the moon was Neil Armstrong. He was the first '
 'person to walk on the moon.')


In [None]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Space Shuttle was the most expensive project in the history of NASA. It '
 'cost about $10 billion to build.')


As we can see, the model can generate some decent answers.

# Example Application

To try out an application like this one, see this [example application](https://huggingface.co/spaces/pinecone/abstractive-question-answering).