# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('my key')
PINECONE_API_KEY= os.getenv('my key')

# Install Dependencies

In [None]:
# !pip install -qU datasets pinecone-client sentence-transformers torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.7/345.7 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [None]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

In [None]:
import pandas as pd

# Step 1: Create a sample dataset manually
data = {
    'title': [
        'Battery Life Review',
        'Sound Quality Review',
        'Battery Life Review',  # duplicate title, okay
        'Design Feedback'
    ],
    'context': [
        'This battery lasts 10 hours and charges quickly.',
        'The sound is clear with great bass response.',
        'This battery lasts 10 hours and charges quickly.',  # duplicate context
        'The design is sleek and comfortable to hold.'
    ]
}

# Step 2: Load it into a DataFrame
df = pd.DataFrame(data)

# Step 3: Select only 'title' and 'context' columns (already true here)
df = df[['title', 'context']]

# Step 4: Drop rows with duplicate 'context'
df = df.drop_duplicates(subset='context')

# Step 5: Show the result
print("Cleaned DataFrame:")
print(df)


Cleaned DataFrame:
                  title                                           context
0   Battery Life Review  This battery lasts 10 hours and charges quickly.
1  Sound Quality Review      The sound is clear with great bass response.
3       Design Feedback      The design is sleek and comfortable to hold.


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [None]:
# !pip install -qU langchain-pinecone pinecone-notebooks

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.6/433.6 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.9/421.9 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.2/52.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2024.12.0 which is incompatible.[0m[31m
[0m

In [None]:
import os
from pinecone import Pinecone, ServerlessSpec

# Set the Pinecone API key as an environment variable (if not already set)
os.environ["PINECONE_API_KEY"] = "my key"

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# Connect to Pinecone environment
pc = Pinecone(
    api_key=os.getenv("my key "),
    environment='us-east-1'  # Specify the environment
)


Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [None]:
from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone with your actual API key (replace 'your_actual_pinecone_api_key' with your actual API key)
pc = Pinecone('my key')  # Replace this with your Pinecone API key

index_name = "extractive-question-answering"

# Check if the index exists
if index_name not in pc.list_indexes().names():
    # Create index if it doesn't exist (replace 768 with your embedding dimension)
    pc.create_index(
        name=index_name,
        dimension=768,
        spec=ServerlessSpec(cloud="aws", region="us-east-1")  # or use your desired cloud/region
    )

# Connect to the index
index = pc.Index(index_name)


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [None]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = None #use the 'multi-qa-MiniLM-L6-cos-v1' model from HuggingFace to build the retriever
retriever

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [None]:
from tqdm.auto import tqdm
import numpy as np

# Assuming 'df' is already loaded and contains a 'text' column

# Set the batch size
batch_size = 64

# Loop over the data in batches
for i in tqdm(range(0, len(df), batch_size)):
    # Find the end index of the current batch
    end = min(i + batch_size, len(df))

    # Extract the current batch
    batch = df.iloc[i:end]

    # Generate embeddings for the batch
    # You should replace this with your actual embedding model (e.g., OpenAI, HuggingFace, etc.)
    emb = np.random.rand(len(batch), 768).tolist()

    # Prepare metadata for each item (e.g., the original text)
    meta = batch.to_dict(orient="records")

    # Create unique IDs for each vector
    ids = [f"id-{j}" for j in range(i, end)]

    # Combine IDs, embeddings, and metadata into upsert list
    to_upsert = list(zip(ids, emb, meta))

    # Upsert the vectors to the Pinecone index
    index.upsert(vectors=to_upsert)

# Print index stats to verify all vectors are uploaded
print(index.describe_index_stats())


  0%|          | 0/1 [00:00<?, ?it/s]

{'dimension': 768,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}


# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [None]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

config.json:   0%|          | 0.00/635 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7dbf69fd9010>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [None]:
# !pip install langchain openai




In [None]:
# from langchain_community.embeddings import OpenAIEmbeddings


In [None]:
# !pip install -U langchain-community


Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.0.0-py3-no

In [None]:
from langchain.embeddings import OpenAIEmbeddings
import numpy as np

def get_context(question, top_k=3):
    # 1. Generate embedding for the question
    embed = OpenAIEmbeddings(model="text-embedding-ada-002")
    xq = embed.embed_query(question)

    # 2. Query Pinecone for relevant contexts
    xc = index.query(vector=xq, top_k=top_k, include_metadata=True)

    # 3. Extract context texts
    contexts = [match['metadata']['text'] for match in xc['matches']]

    # 4. Combine all contexts
    c = "\n\n".join(contexts)

    return c


In [None]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

In [None]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [None]:
from transformers import pipeline

# Load the QA pipeline (you can change the model if needed)
reader = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

def extract_answer(question, context):
    """
    Extracts answers from a list of context strings for a given question.
    """
    if context is None or not isinstance(context, list):
        raise ValueError("Context must be a list of strings.")

    results = []
    for c in context:
        if c.strip() == "":
            continue  # Skip empty contexts
        answer = reader(question=question, context=c)
        results.append({
            "answer": answer["answer"],
            "score": answer["score"],
            "context": c
        })

    return results


config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [None]:
def extract_answer(question, context):
    """
    Given a question and a list of context strings,
    this function extracts the most relevant answers.
    """
    if context is None or not isinstance(context, list):
        raise ValueError("Context must be a list of strings.")

    results = []
    for c in context:
        if c is None or not isinstance(c, str) or c.strip() == "":
            continue  # Skip invalid or empty contexts
        # Feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        results.append(answer)

    return results


In [None]:
question = "What is Albert Einstein famous for?"
raw_context = get_context(question, top_k=1)

# Ensure context is a list of strings
context = raw_context if isinstance(raw_context, list) else [raw_context]

answers = extract_answer(question, context)

for a in answers:
    print(f"Answer: {a['answer']} (Score: {a['score']:.2f})")


Let's run another question. This time for top 3 context passages from the retriever.

In [None]:
question = "Who was the first person to step foot on the moon?"

# Get the context (ensure it's a list of strings)
context = get_context(question, top_k=3)

# Ensure the context is a list of strings
if context is None:
    context = []
elif not isinstance(context, list):
    context = [context]  # Convert single string to list

# Now extract the answer
answers = extract_answer(question, context)

for a in answers:
    print(f"Answer: {a['answer']} (Score: {a['score']:.2f})")


The result looks pretty good.

In [None]:
pc.delete_index(index_name)

### Add a few more questions. What did you observe?

In [None]:
# List of questions
questions = [
    "Who was the first person to step foot on the moon?",
    "What is Albert Einstein famous for?",
    "What are the first names of the men that invented YouTube?",
    "Where is the Eiffel Tower located?",
    "What is the capital city of France?"
]

# Loop through the questions
for question in questions:
    context = get_context(question, top_k=3)

    # Ensure context is a list of strings
    if context is None:
        context = []
    elif not isinstance(context, list):
        context = [context]

    # Extract answers
    answers = extract_answer(question, context)

    # Print the answers
    print(f"Question: {question}")
    for a in answers:
        print(f"Answer: {a['answer']} (Score: {a['score']:.2f})")
    print("\n" + "-"*50 + "\n")









Question: Who was the first person to step foot on the moon?

--------------------------------------------------

Question: What is Albert Einstein famous for?

--------------------------------------------------

Question: What are the first names of the men that invented YouTube?

--------------------------------------------------

Question: Where is the Eiffel Tower located?

--------------------------------------------------

Question: What is the capital city of France?

--------------------------------------------------

