[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/question-answering/abstractive-question-answering.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/question-answering/abstractive-question-answering.ipynb)

# Abstractive Question Answering

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [1]:
!pip install -qU datasets pinecone-client sentence-transformers torch
!pip install PyPDF2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.4/179.4 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.4/300.4 kB[0m [31m13

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [132]:
!pip install -qU datasets pinecone-client sentence-transformers torch
!pip install PyPDF2
!pip install fitz
!pip install frontend
!pip install tools
!pip install --upgrade PyMuPDF
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [169]:
import os
import re
import string
import time
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk.corpus import stopwords
wn = WordNetLemmatizer()

# Check if the 'static/' directory exists, and create it if not
static_directory = 'static/'
if not os.path.exists(static_directory):
    os.makedirs(static_directory)


import fitz  # PyMuPDF
import pandas as pd

# Function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    texts = []

    for page_number in range(len(doc)):
        page = doc[page_number]

        # Extract text from the page
        page_text = page.get_text()
        texts.append(page_text)

    return texts

# Provide a list of PDF file paths
#pdf_file_paths = ["file1.pdf", "file2.pdf", "file3.pdf","file4.pdf", "file6.pdf", "file7.pdf"]

pdf_file_paths = ["MeatLife2.pdf"]

# Create a list to store documents
docs = []

for pdf_file_path in pdf_file_paths:
    pdf_texts = extract_text_from_pdf(pdf_file_path)
    docs.append({"passage_text": pdf_texts})

# Create a pandas DataFrame with the documents
df1 = pd.DataFrame(docs)
df1['passage_text2'] = df1['passage_text'].apply(lambda x: ' '.join(x))

df=df1
df.head()


# def split_text_into_rows(text, max_length):
#     split_text = [text[i:i + max_length] for i in range(0, len(text), max_length)]
#     return split_text

# # Maximum character count for each row
# max_length = 100

# # Split the 'text' column into multiple rows
# df1['passage_text2'] = df1['passage_text'].apply(lambda x: split_text_into_rows(x, max_length))

# # Expand the list of split text into multiple rows
# df1 = df1.explode('passage_text2')

# # Reset the DataFrame index
# df1.reset_index(drop=True, inplace=True)
# df1.head()

# delimiter = '.'
# split_data = df1['passage_text'].str.split(delimiter, expand=True)
# split_data = split_data.stack().reset_index(level=1, drop=True)
# split_data.name = 'passage_text2'
# split_data
# # Create a new DataFrame with the split data
# df = pd.concat([df1, split_data], axis=1)

# # Drop the original 'Text' column if it's no longer needed
# #df = df1.drop('passage_text2', axis=1)
# def clean_txt(text):
#    text = text.replace('\n', ' ')
#    text = text.replace("'", '')
#    #text = text.replace('.', '')
#    text = text.replace('"', '')
#    text = text.replace(',', '')
#    clean_text = [ wn.lemmatize(word, pos="v") for word in word_tokenize(text.lower())]
#    #clean_text2 = [word for word in clean_text if black_txt(word)]
#    return " ".join(clean_text)
#    #return text


# df['passage_text2'] = df['passage_text2'].apply(clean_txt)
# df.head()


Unnamed: 0,passage_text,passage_text2
0,[Advantages of Masan MEATLife \nafter restruct...,Advantages of Masan MEATLife \nafter restructu...


In [170]:
# import pandas as pd


# # # # Function to split text into multiple rows
# def split_text_into_rows(text, max_length):
#     split_text = [text[i:i + max_length] for i in range(0, len(text), max_length)]
#     return split_text

# # Maximum character count for each row
# max_length = 100

# # Split the 'text' column into multiple rows
# df['passage_text2'] = df1['passage_text'].apply(lambda x: split_text_into_rows(x, max_length))

# # Expand the list of split text into multiple rows
# df = df.explode('passage_text2')

# # Reset the DataFrame index
# df.reset_index(drop=True, inplace=True)
# df.head()



# delimiter = '.'
# split_data = df1['passage_text'].str.split(delimiter, expand=True)
# split_data = split_data.stack().reset_index(level=1, drop=True)
# split_data.name = 'passage_text2'
# split_data
# # Create a new DataFrame with the split data
# df = pd.concat([df1, split_data], axis=1)

# # Drop the original 'Text' column if it's no longer needed
# df = df.drop('passage_text', axis=1)




In [171]:
df.shape

(1, 2)

In [172]:
# def clean_txt(text):
#    text = text.replace('\n', ' ')
#    text = text.replace("'", '')
#    text = text.replace('.', '')
#    text = text.replace('"', '')
#    text = text.replace(',', '')
#    return text

# df['passage_text2'] = df['passage_text2'].apply(clean_txt)
# df.head()

In [173]:
#df = df1
df.shape


(1, 2)

In [174]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [175]:

import pinecone

# connect to pinecone environment
pinecone.init(
    api_key="bb491778-4927-4fab-913d-18ca0578500a",
    environment="gcp-starter"  # find next to API key in console
)

index_name = "abstractive-question-answering"

# Check if the abstractive-question-answering index exists
if index_name not in pinecone.list_indexes():
    # Create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=retriever.get_sentence_embedding_dimension(),
        metric="cosine"
    )

# Connect to the abstractive-question-answering index
index = pinecone.Index(index_name)

import torch
from sentence_transformers import SentenceTransformer

# Set the device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Load the retriever model from the Hugging Face model hub
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)

# We will use batches of 1 since we're processing a single PDF document
batch_size = 64

for i in range(0, len(df), batch_size):
    # Extract a single document from the DataFrame
    batch = df.iloc[i:i + batch_size]
    # Generate embeddings for the document
    emb = retriever.encode(batch["passage_text2"].tolist()).tolist()
    # Get metadata
    meta = batch.to_dict(orient="records")
    # Create unique IDs
    ids = [f"{idx}" for idx in range(i, i + batch_size)]
    # Add all to the upsert list
    to_upsert = list(zip(ids, emb, meta))
    # Upsert/insert these records into Pinecone
    _ = index.upsert(vectors=to_upsert)#Time taking step

# Check that we have all vectors in the index
index.describe_index_stats()


{'dimension': 768,
 'index_fullness': 0.03107,
 'namespaces': {'': {'vector_count': 3107}},
 'total_vector_count': 3107}

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [176]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

# from transformers import GPT2Tokenizer, GPT2LMHeadModel

# # Load GPT-2 tokenizer and model from Hugging Face
# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# generator = GPT2LMHeadModel.from_pretrained('gpt2').to(device)



All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [177]:
df['passage_text2']

0    Advantages of Masan MEATLife \nafter restructu...
Name: passage_text2, dtype: object

In [178]:
# from google.colab import files
# df.to_csv('test.csv')
# files.download('test.csv')

In [197]:
def query_pinecone(query, top_k):
    # generate embeddings for the query
    xq = retriever.encode([query]).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(xq, top_k=top_k, include_metadata=True)
    return xc

In [198]:
def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage_text2']}" for m in context]
    # concatinate all context passages
    context = " ".join(context)
    # contcatinate the query and context passages
    query = f"question: {query} context: {context}"
    return query

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [199]:
# # format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

('question: question: question: Is MeatDeli profitable? context: <P> this '
 'create condition for mmls clean meat brand meatdeli to have a lot of '
 'potential for growth <P> after nearly 3 years of market launch meatdeli own '
 'a distribution system of more than 2700 point of sale in hanoi ho chi minh '
 'city and surround areas with a customer base of millions of people <P> er '
 'nearly 3 years of market launch MEATDeli owns  <P> er nearly 3 years of '
 'market launch MEATDeli owns  <P> er nearly 3 years of market launch MEATDeli '
 'owns  context: <P>  After nearly 3 years of market launch MEATDeli owns a '
 'distribution  system of more than 2700 points of sale in Hanoi Ho Chi Minh '
 'City  and surrounding areas with a customer base of millions of people '
 'context: <P>  After nearly 3 years of market launch MEATDeli owns a '
 'distribution  system of more than 2700 points of sale in Hanoi Ho Chi Minh '
 'City  and surrounding areas with a customer base of millions of people')

In [200]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [201]:
generate_answer(query)

('MeatDeli is not profitable. They are not profitable because they are not '
 'profitable because they are not profitable because they are not profitable '
 'because they are not profitable because they are not profitable because they')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [202]:
query = "Is MeatDeli profitable?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli is not profitable. They are not profitable because they are not '
 'profitable because they are not profitable because they are not profitable '
 'because they are not profitable because they are not profitable because they')


In [203]:
query = "What was the most profitable product for MeatDeli?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli was the most profitable product for MeatDeli. They made a lot of '
 'money on the meat they sold. They made a lot of money on the meat they sold. '
 'Meat')


In [204]:
query = "What type of meat is sold the most by MeatDeli?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli is a chain of restaurants that specialize in meat. MeatDeli is a '
 'chain of restaurants that specialize in meat. MeatDeli is a chain of '
 'restaurants that specialize in meat')


In [205]:
query = "When was MeatLife launched?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatLife is a company that sells meat insurance. It was founded by Masan '
 'Masan, the founder of Masan. Masan is a company that sells meat insurance. '
 'MeatLife was')


In [206]:
query = "how many distribution centres does meatdeli own?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli has a distribution system of more than 2700 point of sale in hanoi '
 'ho chi minh city and surround areas with a customer base of millions of '
 'people')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [120]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text2"], end='\n---\n')

er nearly 3 years of market launch MEATDeli owns 
---
er nearly 3 years of market launch MEATDeli owns 
---
er nearly 3 years of market launch MEATDeli owns 
---


In [121]:
query = "What was the most profitable product for MeatDeli?"
context = query_pinecone(query, top_k=10) ##Tune this for better results
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli was the most profitable product for MeatDeli. MeatDeli was the most '
 'profitable product for MeatDeli. MeatDeli was the most profitable product '
 'for MeatDeli')


In [122]:
query = "when was MeatDeli launched?"
context = query_pinecone(query, top_k=12)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli was founded in the early 90s in Vietnam. It was founded by a couple '
 'of Vietnamese businessmen who wanted to start their own restaurant. The idea '
 'was that they would sell their')


In [123]:
query = "What does MeatDeli sell "
context = query_pinecone(query, top_k=12)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli is a company that sells meat. MeatDeli sells a lot of different '
 'things. MeatDeli sells a lot of different things. MeatDeli sells a lot of '
 'different')


In [124]:
query = "Is MeatDeli profitable?"
context = query_pinecone(query, top_k=12)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli is not profitable. They are not profitable because they are not '
 'profitable because they are not profitable because they are not profitable '
 'because they are not profitable because they are not profitable because they')


In [125]:
query = "What was the revenue growth of MeatDeli?"
context = query_pinecone(query, top_k=15)
query = format_query(query, context["matches"])
generate_answer(query)

('MeatDeli is a chain of restaurants that specialize in fresh meat. They have '
 'been around for a long time, and have been around for a long time. They have '
 'been around for a')


In [126]:
query = "What is MML?"
context = query_pinecone(query, top_k=15)
query = format_query(query, context["matches"])
generate_answer(query)

('MLM stands for "Multimedia Learning Machine". It\'s basically a computer '
 "program that is used to learn how to make things. It's used to learn how to "
 'make video games, music')


In [127]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text2"], end='\n---\n')

oing its role well MML focuses on developing the 
---
oing its role well MML focuses on developing the 
---
oing its role well MML focuses on developing the 
---
oing its role well MML focuses on developing the 
---
oing its role well MML focuses on developing the 
---
oing its role well MML focuses on developing the 
---
oing its role well MML focuses on developing the 
---
oing its role well MML focuses on developing the 
---
oing its role well MML focuses on developing the 
---
oing its role well MML focuses on developing the 
---
 It can be seen that the restructuring of MML is part of the strategy  announced by Mr
---
 It can be seen that the restructuring of MML is part of the strategy announced by Mr
---
een that the restructuring of MML is part of the s
---
een that the restructuring of MML is part of the s
---
een that the restructuring of MML is part of the s
---


Let’s finish with a final few questions.

In [None]:
query = "What was revenue for Masan?"
context = query_pinecone(query, top_k=12)
query = format_query(query, context["matches"])
generate_answer(query)

('Masaan was a company that sold food products. They sold a lot of food '
 'products, but they also sold a lot of other things. They also sold a lot of '
 'other things.')


In [None]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The first man to walk on the moon was Neil Armstrong. He was the first '
 'person to walk on the moon.')


In [None]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Space Shuttle was the most expensive project in the history of NASA. It '
 'cost about $10 billion to build.')


As we can see, the model can generate some decent answers.

# Example Application

To try out an application like this one, see this [example application](https://huggingface.co/spaces/pinecone/abstractive-question-answering).