# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [13]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('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')
PINECONE_API_KEY= os.getenv('pcsk_5Dfexb_FHjgDDasFvAs32CAG4WeLGYeYcmkGf56AgHfbEhALKzwh1soM1CJdWaPBZR9bDp')

# Install Dependencies

In [2]:
!pip install -qU datasets pinecone-client sentence-transformers torch

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [14]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

In [15]:
df = df[["title", "question", "context"]].copy()


In [16]:
df = df.drop_duplicates(subset=["context"]).reset_index(drop=True)


In [8]:
df.shape


(18891, 3)

In [9]:
df.head()

Unnamed: 0,title,question,context
0,University_of_Notre_Dame,To whom did the Virgin Mary allegedly appear i...,"Architecturally, the school has a Catholic cha..."
1,University_of_Notre_Dame,When did the Scholastic Magazine of Notre dame...,"As at most other universities, Notre Dame's st..."
2,University_of_Notre_Dame,Where is the headquarters of the Congregation ...,The university is the major seat of the Congre...
3,University_of_Notre_Dame,How many BS level degrees are offered in the C...,The College of Engineering was established in ...
4,University_of_Notre_Dame,What entity provides help with the management ...,All of Notre Dame's undergraduate students are...


In [None]:
# select only title and context column
#df = None ## done above
# drop rows containing duplicate context passages
#df = None ## did above
#df

Unnamed: 0,title,context
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
5,University_of_Notre_Dame,"As at most other universities, Notre Dame's st..."
10,University_of_Notre_Dame,The university is the major seat of the Congre...
15,University_of_Notre_Dame,The College of Engineering was established in ...
20,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...
...,...,...
87574,Kathmandu,"Institute of Medicine, the central college of ..."
87579,Kathmandu,Football and Cricket are the most popular spor...
87584,Kathmandu,The total length of roads in Nepal is recorded...
87589,Kathmandu,The main international airport serving Kathman...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [10]:
pip uninstall -y pinecone pinecone-client


Found existing installation: pinecone 7.3.0
Uninstalling pinecone-7.3.0:
  Successfully uninstalled pinecone-7.3.0
Found existing installation: pinecone-client 6.0.0
Uninstalling pinecone-client-6.0.0:
  Successfully uninstalled pinecone-client-6.0.0


In [20]:
pip install "pinecone>=4.0.0"




In [34]:
import os
os.environ["PINECONE_API_KEY"] = "pcsk_5Dfexb_FHjgDDasFvAs32CAG4WeLGYeYcmkGf56AgHfbEhALKzwh1soM1CJdWaPBZR9bDp"


In [35]:
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
print([idx["name"] for idx in pc.list_indexes()])

['squad-contexts', 'extractive-question-answering']


In [40]:
import os
from pinecone import Pinecone, ServerlessSpec

api_key = os.environ.get("PINECONE_API_KEY")

pinecone = Pinecone(api_key=api_key)

index_name = "squad-contexts"
dimension = 384
if index_name not in [idx["name"] for idx in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=dimension,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)
print("Autenticado e conectado ao índice:", index_name)



Autenticado e conectado ao índice: squad-contexts


Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [41]:
index_name = "question-answering"

# check if the extractive-question-answering index exists
if index_name not in pinecone.list_indexes().names():
    # create the index if it does not exist

    existing = [idx["name"] for idx in pc.list_indexes()]
if index_name not in existing:
    pc.create_index(
        name=index_name,
        dimension=384,                # dimensão do seu modelo de embeddings
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",              # ou "gcp" se preferir
            region="us-east-1",       # escolha a mesma região que usará depois
        ),
    )

# conecta ao índice
# connect to extractive-question-answering index we created
index = pinecone.Index(index_name)

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [42]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
retriever.to(device)

retriever

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [43]:
from tqdm.auto import tqdm
import numpy as np

df = df.dropna(subset=["context"]).copy()
for col in ["title", "question"]:
    if col in df.columns:
        df[col] = df[col].fillna("")

batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    end = min(i + batch_size, len(df))

    batch = df.iloc[i:end]
    contexts = batch["context"].tolist()

    #gera embeddings (normalizados p/ cosine)
    emb = retriever.encode(
        contexts,
        batch_size=batch_size,
        convert_to_numpy=True,
        normalize_embeddings=True,
        show_progress_bar=False,
    )

    #monta vetores para upsert
    vectors = []
    for k, (row_idx, row) in enumerate(batch.iterrows()):
        vec_id = f"squad-{row_idx}"
        meta = {
            "title": row.get("title", ""),
            "question": row.get("question", ""),
            "context": row["context"],
        }
        vectors.append({
            "id": vec_id,
            "values": emb[k].tolist(),
            "metadata": meta,})

    # upsert no índice
    index.upsert(vectors=vectors)

# conferir estatísticas do índice
stats = index.describe_index_stats()
stats


  0%|          | 0/296 [00:00<?, ?it/s]

KeyboardInterrupt: 

# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [44]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

config.json:   0%|          | 0.00/635 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7a8cce797290>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [45]:
# gets context passages from the pinecone index
def get_context(question, top_k=5):
    # 1) generate embedding for the question (normalize for cosine)
    xq = retriever.encode(question, convert_to_numpy=True, normalize_embeddings=True)
    if xq.ndim > 1:
        xq = xq[0]  # ensure shape (384,)

    # 2) search pinecone for most similar contexts
    xc = index.query(
        vector=xq.tolist(),
        top_k=top_k,
        include_metadata=True )

    # 3) extract context passages (sorted by score already)
    matches = xc["matches"] if isinstance(xc, dict) else xc.matches
    c = [m["metadata"]["context"] for m in matches]

    return c


In [46]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

In [47]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

['The economy is a mixture of village agriculture and handicrafts, an industrial sector based largely on petroleum, support services, and a government characterized by budget problems and overstaffing. Petroleum extraction has supplanted forestry as the mainstay of the economy. In 2008, oil sector accounted for 65% of the GDP, 85% of government revenue, and 92% of exports. The country also has large untapped mineral wealth.']

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [48]:
extract_answer(question, context)

[{'answer': '92% of exports',
  'context': 'The economy is a mixture of village agriculture and handicrafts, '
             'an industrial sector based largely on petroleum, support '
             'services, and a government characterized by budget problems and '
             'overstaffing. Petroleum extraction has supplanted forestry as '
             'the mainstay of the economy. In 2008, oil sector accounted for '
             '65% of the GDP, 85% of government revenue, and 92% of exports. '
             'The country also has large untapped mineral wealth.',
  'end': 372,
  'score': 9.096978465095162e-05,
  'start': 358}]


The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [49]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'Sir Tim Berners-Lee',
  'context': 'The first web browser was invented in 1990 by Sir Tim '
             'Berners-Lee. Berners-Lee is the director of the World Wide Web '
             "Consortium (W3C), which oversees the Web's continued "
             'development, and is also the founder of the World Wide Web '
             'Foundation. His browser was called WorldWideWeb and later '
             'renamed Nexus.',
  'end': 65,
  'score': 5.350358177336201e-12,
  'start': 46}]


In [None]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'his theories of special relativity and general relativity',
  'context': 'Albert Einstein is known for his theories of special relativity '
             'and general relativity. He also made important contributions to '
             'statistical mechanics, especially his mathematical treatment of '
             'Brownian motion, his resolution of the paradox of specific '
             'heats, and his connection of fluctuations and dissipation. '
             'Despite his reservations about its interpretation, Einstein also '
             'made contributions to quantum mechanics and, indirectly, quantum '
             'field theory, primarily through his theoretical studies of the '
             'photon.',
  'end': 86,
  'score': 0.9500371217727661,
  'start': 29}]


Let's run another question. This time for top 3 context passages from the retriever.

In [None]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'Armstrong',
  'context': 'The trip to the Moon took just over three days. After achieving '
             'orbit, Armstrong and Aldrin transferred into the Lunar Module, '
             'named Eagle, and after a landing gear inspection by Collins '
             'remaining in the Command/Service Module Columbia, began their '
             'descent. After overcoming several computer overload alarms '
             'caused by an antenna switch left in the wrong position, and a '
             'slight downrange error, Armstrong took over manual flight '
             'control at about 180 meters (590 ft), and guided the Lunar '
             'Module to a safe landing spot at 20:18:04 UTC, July 20, 1969 '
             '(3:17:04 pm CDT). The first humans on the Moon would wait '
             'another six hours before they ventured out of their craft. At '
             '02:56 UTC, July 21 (9:56 pm CDT July 20), Armstrong became the '
             'first human to set foot on the Moon.',

The result looks pretty good.

In [None]:
pc.delete_index(index_name)

### Add a few more questions. What did you observe?

1.   Item da lista
2.   Item da lista



In [50]:
question = "Who painted mona lisa?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'Sergei Fyodorov',
  'context': 'In the 1990s two icons by the Russian icon painter Sergei '
             'Fyodorov were hung in the abbey. On 6 September 1997 the funeral '
             'of Diana, Princess of Wales, was held at the Abbey. On 17 '
             'September 2010 Pope Benedict XVI became the first pope to set '
             'foot in the abbey.',
  'end': 66,
  'score': 0.036416420801856475,
  'start': 51},
 {'answer': 'Michelangelo',
  'context': 'Outside of these genealogies, comics theorists and historians '
             'have seen precedents for comics in the Lascaux cave paintings in '
             'France (some of which appear to be chronological sequences of '
             "images), Egyptian hieroglyphs, Trajan's Column in Rome, the "
             '11th-century Norman Bayeux Tapestry, the 1370 bois Protat '
             'woodcut, the 15th-century Ars moriendi and block books, '
             "Michelangelo's The Last Judgment in the Sistine Chapel, and "
  

In [None]:
##It got me the wrong answer first

In [51]:
question = "What is the capital of Brazil?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'Rio de Janeiro',
  'context': 'With the occupation by Napoleon, Portugal began a slow but '
             'inexorable decline that lasted until the 20th century. This '
             'decline was hastened by the independence in 1822 of the '
             "country's largest colonial possession, Brazil. In 1807, as "
             "Napoleon's army closed in on Lisbon, the Prince Regent João VI "
             'of Portugal transferred his court to Brazil and established Rio '
             'de Janeiro as the capital of the Portuguese Empire. In 1815, '
             'Brazil was declared a Kingdom and the Kingdom of Portugal was '
             'united with it, forming a pluricontinental State, the United '
             'Kingdom of Portugal, Brazil and the Algarves.',
  'end': 371,
  'score': 0.9978076219558716,
  'start': 357},
 {'answer': 'Elio',
  'context': 'Because of the acceptance of miscegenation, Brazil has avoided '
             'the binary polarization of society into blac

In [None]:
##the answers are not correct