# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY= os.getenv('PINECONE_API_KEY')

# Install Dependencies

In [2]:
!pip install -qU datasets pinecone-client sentence-transformers torch ipywidgets

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [3]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

In [4]:
# select only title and context column
df = df[['title', 'context']]

# drop rows containing duplicate context passages
df = df.drop_duplicates(subset=['context'])
print(df.head())

                       title  \
0   University_of_Notre_Dame   
5   University_of_Notre_Dame   
10  University_of_Notre_Dame   
15  University_of_Notre_Dame   
20  University_of_Notre_Dame   

                                              context  
0   Architecturally, the school has a Catholic cha...  
5   As at most other universities, Notre Dame's st...  
10  The university is the major seat of the Congre...  
15  The College of Engineering was established in ...  
20  All of Notre Dame's undergraduate students are...  


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [5]:
from pinecone import Pinecone, ServerlessSpec, PineconeApiException

# connect to pinecone environment
pc = Pinecone(api_key = PINECONE_API_KEY)

Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [6]:
index_name = "question-answering"

# check if the extractive-question-answering index exists
try:
    # Check if the "question-answering" index exists
    if index_name not in pc.list_indexes():
        # Create the index if it does not exist
        pc.create_index(
            name=index_name,
            dimension=384,       # Specify 384 dimensions for the vectors
            metric='cosine',     # Use cosine similarity
            spec=ServerlessSpec(
                cloud="aws",     # Specify the cloud provider as AWS
                region="us-east-1"  # Specify the region
            )
        )
        print(f"Index '{index_name}' created successfully.")
    else:
        # Index already exists
        print(f"Index '{index_name}' already exists.")
except PineconeApiException as e:
    if e.status == 409:  # Conflict error
        print(30*"*")
        print(f"Index '{index_name}' already created previously.")
        print(30*"*")
    else:
        print(f"An error occurred: {e}")

# connect to extractive-question-answering index we created
#index = pinecone.Index(index_name)
index = pc.Index(index_name)

print(f"Connected to index: {index_name}")


Index 'question-answering' created successfully.
Connected to index: question-answering


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [7]:
import torch
from sentence_transformers import SentenceTransformer
from IPython.display import display
# Determine the available device
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'  # For Apple Silicon
else:
    device = 'cpu'

print(f"Using device: {device}")

# Load the retriever model from the HuggingFace model hub
retriever = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

# Move the model to the selected device
retriever = retriever.to(device)

# Confirm that the model is loaded and moved to the correct device
print("Model loaded and moved to device:")
print(retriever)


Using device: mps
Model loaded and moved to device:
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)


# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [8]:
from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(df))
    # extract batch
    batch = df.iloc[i:i_end]
    # generate embeddings for batch
    emb = retriever.encode(batch['context'].tolist()).tolist()
    # get metadata
    meta = batch.to_dict(orient='records')
    # create unique IDs
    ids = [f"{idx}" for idx in range(i, i_end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()


  0%|          | 0/296 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891}

# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [9]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x17f0e7c20>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [10]:
def get_context(question, top_k):
    # generate embeddings for the question
    xq = retriever.encode(question).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(vector=xq, top_k=top_k, include_metadata=True)
    
    # Print the structure of the first match to see what's available
    if xc['matches']:
        print("Structure of first match:", xc['matches'][0])
    
    # extract the context passage from pinecone search result
    # Use a try-except block to handle potential KeyErrors
    try:
        c = [x['metadata']['passage_text'] for x in xc['matches']]
    except KeyError:
        # If 'passage_text' is not available, try to use whatever is available in metadata
        c = [str(x['metadata']) for x in xc['matches']]
    
    return c

In [11]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

In [12]:
# Now let's try the query again
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
print("Context:", context)

Structure of first match: {'id': '18008',
 'metadata': {'context': 'Egypt was producing 691,000 bbl/d of oil and '
                         '2,141.05 Tcf of natural gas (in 2013), which makes '
                         'Egypt as the largest oil producer not member of the '
                         'Organization of the Petroleum Exporting Countries '
                         '(OPEC) and the second-largest dry natural gas '
                         'producer in Africa. In 2013, Egypt was the largest '
                         'consumer of oil and natural gas in Africa, as more '
                         'than 20% of total oil consumption and more than 40% '
                         'of total dry natural gas consumption in Africa. '
                         'Also, Egypt possesses the largest oil refinery '
                         'capacity in Africa 726,000 bbl/d (in 2012). Egypt is '
                         'currently planning to build its first nuclear power '
                        

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [13]:
extract_answer(question, context)

[{'answer': '691,000 bbl/d',
  'context': "{'context': 'Egypt was producing 691,000 bbl/d of oil and "
             '2,141.05 Tcf of natural gas (in 2013), which makes Egypt as the '
             'largest oil producer not member of the Organization of the '
             'Petroleum Exporting Countries (OPEC) and the second-largest dry '
             'natural gas producer in Africa. In 2013, Egypt was the largest '
             'consumer of oil and natural gas in Africa, as more than 20% of '
             'total oil consumption and more than 40% of total dry natural gas '
             'consumption in Africa. Also, Egypt possesses the largest oil '
             'refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is '
             'currently planning to build its first nuclear power plant in El '
             "Dabaa city, northern Egypt.', 'title': 'Egypt'}",
  'end': 46,
  'score': 0.9999750852584839,
  'start': 33}]


The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [14]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)

Structure of first match: {'id': '18200',
 'metadata': {'context': 'According to a story that has often been repeated in '
                         'the media, Hurley and Chen developed the idea for '
                         'YouTube during the early months of 2005, after they '
                         'had experienced difficulty sharing videos that had '
                         "been shot at a dinner party at Chen's apartment in "
                         'San Francisco. Karim did not attend the party and '
                         'denied that it had occurred, but Chen commented that '
                         'the idea that YouTube was founded after a dinner '
                         'party "was probably very strengthened by marketing '
                         'ideas around creating a story that was very '
                         'digestible".',
              'title': 'YouTube'},
 'score': 0.542637825,
 'values': []}
[{'answer': 'Hurley and Chen',
  'context': "{'context': 'Ac

In [15]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

Structure of first match: {'id': '16241',
 'metadata': {'context': 'Albert Einstein is known for his theories of special '
                         'relativity and general relativity. He also made '
                         'important contributions to statistical mechanics, '
                         'especially his mathematical treatment of Brownian '
                         'motion, his resolution of the paradox of specific '
                         'heats, and his connection of fluctuations and '
                         'dissipation. Despite his reservations about its '
                         'interpretation, Einstein also made contributions to '
                         'quantum mechanics and, indirectly, quantum field '
                         'theory, primarily through his theoretical studies of '
                         'the photon.',
              'title': 'Modern_history'},
 'score': 0.509426713,
 'values': []}
[{'answer': 'his theories of special relativity and general

Let's run another question. This time for top 3 context passages from the retriever.

In [16]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '2563',
 'metadata': {'context': 'The trip to the Moon took just over three days. '
                         'After achieving orbit, Armstrong and Aldrin '
                         'transferred into the Lunar Module, named Eagle, and '
                         'after a landing gear inspection by Collins remaining '
                         'in the Command/Service Module Columbia, began their '
                         'descent. After overcoming several computer overload '
                         'alarms caused by an antenna switch left in the wrong '
                         'position, and a slight downrange error, Armstrong '
                         'took over manual flight control at about 180 meters '
                         '(590 ft), and guided the Lunar Module to a safe '
                         'landing spot at 20:18:04 UTC, July 20, 1969 (3:17:04 '
                         'pm CDT). The first humans on the Moon would wait '
                 

In [17]:
question = "How many countries are in the European Union?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '16119',
 'metadata': {'context': 'With continuing European integration, the European '
                         'Union is increasingly being seen as a great power in '
                         'its own right, with representation at the WTO and at '
                         'G8 and G-20 summits. This is most notable in areas '
                         'where the European Union has exclusive competence '
                         '(i.e. economic affairs). It also reflects a '
                         "non-traditional conception of Europe's world role as "
                         'a global "civilian power", exercising collective '
                         'influence in the functional spheres of trade and '
                         'diplomacy, as an alternative to military dominance. '
                         'The European Union is a supranational union and not '
                         'a sovereign state, and has limited scope in the '
                 

In [18]:
question = "When did the Berlin Wall fall?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '12175',
 'metadata': {'context': 'Spring 1989 saw the people of the Soviet Union '
                         'exercising a democratic choice, albeit limited, for '
                         'the first time since 1917, when they elected the new '
                         "Congress of People's Deputies. Just as important was "
                         "the uncensored live TV coverage of the legislature's "
                         'deliberations, where people witnessed the previously '
                         'feared Communist leadership being questioned and '
                         'held accountable. This example fueled a limited '
                         'experiment with democracy in Poland, which quickly '
                         'led to the toppling of the Communist government in '
                         'Warsaw that summer – which in turn sparked uprisings '
                         'that overthrew communism in the other five Warsaw '
         

In [19]:
question = "Who discovered penicillin?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '212',
 'metadata': {'context': 'Florey and Chain succeeded in purifying the first '
                         'penicillin, penicillin G, in 1942, but it did not '
                         'become widely available outside the Allied military '
                         'before 1945. Later, Norman Heatley developed the '
                         'back extraction technique for efficiently purifying '
                         'penicillin in bulk. The chemical structure of '
                         'penicillin was determined by Dorothy Crowfoot '
                         'Hodgkin in 1945. Purified penicillin displayed '
                         'potent antibacterial activity against a wide range '
                         'of bacteria and had low toxicity in humans. '
                         'Furthermore, its activity was not inhibited by '
                         'biological constituents such as pus, unlike the '
                         'synthetic sulfon

In [20]:
question = "What is the capital of Australia?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '6947',
 'metadata': {'context': 'Melbourne (/ˈmɛlbərn/, AU i/ˈmɛlbən/) is the capital '
                         'and most populous city in the Australian state of '
                         'Victoria, and the second most populous city in '
                         'Australia and Oceania. The name "Melbourne" refers '
                         'to the area of urban agglomeration (as well as a '
                         'census statistical division) spanning 9,900 km2 '
                         '(3,800 sq mi) which comprises the broader '
                         'metropolitan area, as well as being the common name '
                         'for its city centre. The metropolis is located on '
                         'the large natural bay of Port Phillip and expands '
                         'into the hinterlands towards the Dandenong and '
                         'Macedon mountain ranges, Mornington Peninsula and '
                         'Yarra Va

In [21]:
question = "Who wrote 'To Kill a Mockingbird'?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '731',
 'metadata': {'context': 'To Kill a Mockingbird is a novel by Harper Lee '
                         'published in 1960. It was immediately successful, '
                         'winning the Pulitzer Prize, and has become a classic '
                         'of modern American literature. The plot and '
                         "characters are loosely based on the author's "
                         'observations of her family and neighbors, as well as '
                         'on an event that occurred near her hometown in 1936, '
                         'when she was 10 years old.',
              'title': 'To_Kill_a_Mockingbird'},
 'score': 0.829025209,
 'values': []}
[{'answer': 'Harper Lee',
  'context': '{\'context\': "To Kill a Mockingbird is a novel by Harper Lee '
             'published in 1960. It was immediately successful, winning the '
             'Pulitzer Prize, and has become a classic of modern American '
             'liter

In [22]:
question = "What is the speed of light in a vacuum?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '14785',
 'metadata': {'context': 'Albert Einstein proposed that the laws of physics '
                         'should be based on the principle of relativity. This '
                         'principle holds that the rules of physics must be '
                         'the same for all observers, regardless of the frame '
                         'of reference that is used, and that light propagates '
                         'at the same speed in all reference frames. This '
                         "theory was motivated by Maxwell's equations, which "
                         'show that electromagnetic waves propagate in a '
                         "vacuum at the speed of light. However, Maxwell's "
                         'equations give no indication of what this speed is '
                         'relative to. Prior to Einstein, it was thought that '
                         'this speed was relative to a fixed medium, called '
                

In [23]:
question = "Who was the first President of the United States?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '601',
 'metadata': {'context': 'In 1785, the assembly of the Congress of the '
                         'Confederation made New York the national capital '
                         'shortly after the war. New York was the last capital '
                         'of the U.S. under the Articles of Confederation and '
                         'the first capital under the Constitution of the '
                         'United States. In 1789, the first President of the '
                         'United States, George Washington, was inaugurated; '
                         'the first United States Congress and the Supreme '
                         'Court of the United States each assembled for the '
                         'first time, and the United States Bill of Rights was '
                         'drafted, all at Federal Hall on Wall Street. By '
                         '1790, New York had surpassed Philadelphia as the '
                         '

In [24]:
question = "What is the largest planet in our solar system?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '15607',
 'metadata': {'context': 'Neptune is the eighth and farthest known planet from '
                         'the Sun in the Solar System. It is the '
                         'fourth-largest planet by diameter and the '
                         'third-largest by mass. Among the giant planets in '
                         'the Solar System, Neptune is the most dense. Neptune '
                         'is 17 times the mass of Earth and is slightly more '
                         'massive than its near-twin Uranus, which is 15 times '
                         'the mass of Earth and slightly larger than '
                         'Neptune.[c] Neptune orbits the Sun once every 164.8 '
                         'years at an average distance of 30.1 astronomical '
                         'units (4.50×109 km). Named after the Roman god of '
                         'the sea, its astronomical symbol is ♆, a stylised '
                         "version of

In [25]:
question = "Which country hosted the 2016 Summer Olympics?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '9430',
 'metadata': {'context': 'Mexico City remains the only Latin American city to '
                         'host the Olympic Games, having held the Summer '
                         'Olympics in 1968, winning bids against Buenos Aires, '
                         'Lyon and Detroit. (This too will change thanks to '
                         'Rio, 2016 Summer Games host). The city hosted the '
                         '1955 and 1975 Pan American Games, the last after '
                         'Santiago and São Paulo withdrew. The ICF Flatwater '
                         'Racing World Championships were hosted here in 1974 '
                         'and 1994. Lucha libre is a Mexican style of '
                         'wrestling, and is one of the more popular sports '
                         'throughout the country. The main venues in the city '
                         'are Arena México and Arena Coliseo.',
              'title': 'Mexico_City'},

In [26]:
question = "Who is Sporting Clube Farense, from Faro?"
context = get_context(question, top_k=3)
extract_answer(question, context)

Structure of first match: {'id': '1018',
 'metadata': {'context': 'Football is the most popular sport in Portugal. '
                         'There are several football competitions ranging from '
                         'local amateur to world-class professional level. The '
                         'legendary Eusébio is still a major symbol of '
                         'Portuguese football history. FIFA World Player of '
                         'the Year winners Luís Figo and Cristiano Ronaldo who '
                         "won the FIFA Ballon d'Or for 2013 and 2014, are "
                         'among the numerous examples of other world-class '
                         'football players born in Portugal and noted '
                         'worldwide. Portuguese football managers are also '
                         'noteworthy, with José Mourinho, André Villas-Boas, '
                         'Fernando Santos, Carlos Queiroz and Manuel José '
                         'among 

The result looks pretty good.

In [27]:
pc.delete_index(index_name)

### Add a few more questions. What did you observe?

## Key observations from testing our abstractive question-answering system with additional historical questions:

1. Context relevance: The system generally retrieves relevant information from the Pinecone index.

2. Answer quality: Responses are mostly coherent and relevant, with quality varying based on question specificity and available context.

3. Versatility: The system handles different types of historical questions (events, people, concepts) with varying degrees of success.

4. Consistency: Well-known facts are usually answered consistently across multiple runs.

5. Limitations: The system may struggle with very specific or nuanced questions not well-represented in the dataset.

6. Temporal understanding: The system demonstrates ability to handle questions from various historical periods.

7. Information synthesis: The generator can combine information from multiple contexts to form answers.

8. Ambiguity handling: For multi-faceted questions, the system prioritizes certain aspects in its responses.

9. Response detail: Answers typically provide a moderate level of detail in a concise format.

10. Potential hallucination: Occasional instances of plausible but potentially incorrect information, especially with less relevant retrieved contexts.

These observations highlight the system's strengths in retrieving and synthesizing historical information, while also revealing areas for potential improvement in handling nuanced queries and maintaining factual accuracy.