# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [1]:
import os

from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
PINECONE_API_KEY= "pcsk_6vPgNJ_HUk7bSjErBvshxnRxbaNxUgTfF14i4Mja2VSqy6D1CScrkiRsTfrwEBctq8tgKB"

In [2]:
pip install python-dotenv



# Install Dependencies

In [3]:
!pip install -qU datasets pinecone-client sentence-transformers torch

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [4]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
# select only title and context column
df = df[['title', 'context']]
# drop rows containing duplicate context passages
df = df.drop_duplicates(subset=['context'])
df

Unnamed: 0,title,context
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
5,University_of_Notre_Dame,"As at most other universities, Notre Dame's st..."
10,University_of_Notre_Dame,The university is the major seat of the Congre...
15,University_of_Notre_Dame,The College of Engineering was established in ...
20,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...
...,...,...
87574,Kathmandu,"Institute of Medicine, the central college of ..."
87579,Kathmandu,Football and Cricket are the most popular spor...
87584,Kathmandu,The total length of roads in Nepal is recorded...
87589,Kathmandu,The main international airport serving Kathman...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [6]:
!pip install -qU langchain-pinecone pinecone-notebooks

[0m

In [7]:
import pinecone
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# connect to pinecone environment
pc = Pinecone(
    api_key = PINECONE_API_KEY,
    environment='us-east-1'  # find next to API key in console
)

In [8]:
!pip install pinecone-client



In [9]:
!pip install -U pinecone-client



Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [10]:
index_name = "extractive-question-answering"

# check if the "extractive-question-answering" index exists
if index_name not in pc.list_indexes().names():
    # create the index if it does not exist
    pc.create_index(
        name=index_name,
        dimension=384,  # dimension of the embedding
        metric="cosine",
        spec=spec
    )

# connect to the "extractive-question-answering" index we created
index = pc.Index(index_name)

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [11]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1').to(device)


 #use the 'multi-qa-MiniLM-L6-cos-v1' model from HuggingFace to build the retriever
retriever



SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [12]:
pc.describe_index("extractive-question-answering")['dimension']

384

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [13]:
from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    end = i + batch_size
    # extract batch
    batch = df.iloc[i:end]
    # generate embeddings for batch
    emb = retriever.encode(batch["context"].tolist(), device=device).tolist()
    # get metadata
    meta = [{"title": t, "context": c} for t, c in zip(batch["title"], batch["context"])]
    # create unique IDs
    ids = [str(j) for j in range(i, end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

  0%|          | 0/296 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891,
 'vector_type': 'dense'}

In [14]:
# delete the old index
pc.delete_index(index_name)

# re-create index with correct dimension
pc.create_index(
    name=index_name,
    dimension=384,  # match the model output

    metric="cosine",
    spec=spec
)

# re-connect to the new index
index = pc.Index(index_name)

# Initialize Reader

In [15]:
from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    end = i + batch_size
    # extract batch
    batch = df.iloc[i:end]
    # generate embeddings for batch
    emb = retriever.encode(batch["context"].tolist(), device=device).tolist()
    # get metadata
    meta = [{"title": t, "context": c} for t, c in zip(batch["title"], batch["context"])]
    # create unique IDs
    ids = [str(j) for j in range(i, end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

  0%|          | 0/296 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891,
 'vector_type': 'dense'}

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [16]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

Device set to use cuda


<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7b977e0afd50>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [17]:
# gets context passages from the pinecone index
def get_context(question, top_k):
    # generate embeddings for the question
    xq = retriever.encode([question]).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(vector=xq[0], top_k=top_k, include_metadata=True)
    # extract the context passage from pinecone search result
    c = xc['matches'][0]['metadata']['context']
    return c

In [18]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

In [19]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

'Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of natural gas (in 2013), which makes Egypt as the largest oil producer not member of the Organization of the Petroleum Exporting Countries (OPEC) and the second-largest dry natural gas producer in Africa. In 2013, Egypt was the largest consumer of oil and natural gas in Africa, as more than 20% of total oil consumption and more than 40% of total dry natural gas consumption in Africa. Also, Egypt possesses the largest oil refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is currently planning to build its first nuclear power plant in El Dabaa city, northern Egypt.'

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [20]:
extract_answer(question, context)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


[{'answer': 'E',
  'context': 'E',
  'end': 1,
  'score': 0.2500016689300537,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500016689300537,
  'start': 0},
 {'answer': 'E',
  'context': 'E',
  'end': 1,
  'score': 0.2500016689300537,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500016689300537,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500016689300537,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500016689300537,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500016689300537,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500016689300537,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500016689300537,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500016689300537,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500016689

The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [21]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)
context

[{'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999965727329254,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999965727329254,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999965727329254,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999965727329254,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999965727329254,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999965727329254,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999965727329254,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999965727329254,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999965727329254,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999965727329254,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.

'According to a story that has often been repeated in the media, Hurley and Chen developed the idea for YouTube during the early months of 2005, after they had experienced difficulty sharing videos that had been shot at a dinner party at Chen\'s apartment in San Francisco. Karim did not attend the party and denied that it had occurred, but Chen commented that the idea that YouTube was founded after a dinner party "was probably very strengthened by marketing ideas around creating a story that was very digestible".'

In [22]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)
context

[{'answer': 'i',
  'context': 'i',
  'end': 1,
  'score': 5.621192826765764e-07,
  'start': 0},
 {'answer': 'i',
  'context': 'i',
  'end': 1,
  'score': 5.621192826765764e-07,
  'start': 0},
 {'answer': 'i',
  'context': 'i',
  'end': 1,
  'score': 5.621192826765764e-07,
  'start': 0},
 {'answer': 'i',
  'context': 'i',
  'end': 1,
  'score': 5.621192826765764e-07,
  'start': 0},
 {'answer': 'i',
  'context': 'i',
  'end': 1,
  'score': 5.621192826765764e-07,
  'start': 0},
 {'answer': 'i',
  'context': 'i',
  'end': 1,
  'score': 5.621192826765764e-07,
  'start': 0},
 {'answer': 'i',
  'context': 'i',
  'end': 1,
  'score': 5.621192826765764e-07,
  'start': 0},
 {'answer': 'i',
  'context': 'i',
  'end': 1,
  'score': 5.621192826765764e-07,
  'start': 0},
 {'answer': 'i',
  'context': 'i',
  'end': 1,
  'score': 5.621192826765764e-07,
  'start': 0},
 {'answer': 'i',
  'context': 'i',
  'end': 1,
  'score': 5.621192826765764e-07,
  'start': 0},
 {'answer': 'i',
  'context': 'i',
  'en

'Albert Einstein is known for his theories of special relativity and general relativity. He also made important contributions to statistical mechanics, especially his mathematical treatment of Brownian motion, his resolution of the paradox of specific heats, and his connection of fluctuations and dissipation. Despite his reservations about its interpretation, Einstein also made contributions to quantum mechanics and, indirectly, quantum field theory, primarily through his theoretical studies of the photon.'

Let's run another question. This time for top 3 context passages from the retriever.

In [23]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999800324440002,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999800324440002,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999800324440002,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999800324440002,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999800324440002,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999800324440002,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999800324440002,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999800324440002,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999800324440002,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.24999800324440002,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.

In [24]:
question = "Who is the inventor of the turbo jet engine?"
context = get_context(question, top_k=3)
extract_answer(question, context)
context

[{'answer': 'E',
  'context': 'E',
  'end': 1,
  'score': 0.2500024437904358,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500024437904358,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500024437904358,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500024437904358,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500024437904358,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500024437904358,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500024437904358,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500024437904358,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500024437904358,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500024437904358,
  'start': 0},
 {'answer': 'e',
  'context': 'e',
  'end': 1,
  'score': 0.2500024437

"GE's history of working with turbines in the power-generation field gave them the engineering know-how to move into the new field of aircraft turbosuperchargers.[citation needed] Led by Sanford Alexander Moss, GE introduced the first superchargers during World War I, and continued to develop them during the Interwar period. Superchargers became indispensable in the years immediately prior to World War II, and GE was the world leader in exhaust-driven supercharging when the war started. This experience, in turn, made GE a natural selection to develop the Whittle W.1 jet engine that was demonstrated in the United States in 1941. GE ranked ninth among United States corporations in the value of wartime production contracts. Although their early work with Whittle's designs was later handed to Allison Engine Company, GE Aviation emerged as one of the world's largest engine manufacturers, second only to the British company, Rolls-Royce plc."

In [25]:
question = "who is youngest female famous pilot?"
context = get_context(question, top_k=3)
extract_answer(question, context)
context

[{'answer': ')',
  'context': ')',
  'end': 1,
  'score': 1.7390034656727948e-07,
  'start': 0},
 {'answer': 'c',
  'context': 'c',
  'end': 1,
  'score': 4.26619095605929e-09,
  'start': 0},
 {'answer': 'c',
  'context': 'c',
  'end': 1,
  'score': 4.26619095605929e-09,
  'start': 0},
 {'answer': 'c',
  'context': 'c',
  'end': 1,
  'score': 4.26619095605929e-09,
  'start': 0},
 {'answer': 'c',
  'context': 'c',
  'end': 1,
  'score': 4.26619095605929e-09,
  'start': 0},
 {'answer': 'c',
  'context': 'c',
  'end': 1,
  'score': 4.26619095605929e-09,
  'start': 0},
 {'answer': 'c',
  'context': 'c',
  'end': 1,
  'score': 4.26619095605929e-09,
  'start': 0},
 {'answer': 'c',
  'context': 'c',
  'end': 1,
  'score': 4.26619095605929e-09,
  'start': 0},
 {'answer': 'c',
  'context': 'c',
  'end': 1,
  'score': 4.26619095605929e-09,
  'start': 0},
 {'answer': 'c',
  'context': 'c',
  'end': 1,
  'score': 4.26619095605929e-09,
  'start': 0},
 {'answer': 'C',
  'context': 'C',
  'end': 1,
 

'Bell was a supporter of aerospace engineering research through the Aerial Experiment Association (AEA), officially formed at Baddeck, Nova Scotia, in October 1907 at the suggestion of his wife Mabel and with her financial support after the sale of some of her real estate. The AEA was headed by Bell and the founding members were four young men: American Glenn H. Curtiss, a motorcycle manufacturer at the time and who held the title "world\'s fastest man", having ridden his self-constructed motor bicycle around in the shortest time, and who was later awarded the Scientific American Trophy for the first official one-kilometre flight in the Western hemisphere, and who later became a world-renowned airplane manufacturer; Lieutenant Thomas Selfridge, an official observer from the U.S. Federal government and one of the few people in the army who believed that aviation was the future; Frederick W. Baldwin, the first Canadian and first British subject to pilot a public flight in Hammondsport, N

In [26]:
question = "who is the first Saudi female astronaut?"
context = get_context(question, top_k=3)
extract_answer(question, context)
context

[{'answer': ')',
  'context': ')',
  'end': 1,
  'score': 1.239100839711682e-07,
  'start': 0},
 {'answer': ')',
  'context': ')',
  'end': 1,
  'score': 1.239100839711682e-07,
  'start': 0},
 {'answer': 'S',
  'context': 'S',
  'end': 1,
  'score': 4.6173720846809374e-08,
  'start': 0},
 {'answer': 's',
  'context': 's',
  'end': 1,
  'score': 4.6173720846809374e-08,
  'start': 0},
 {'answer': 's',
  'context': 's',
  'end': 1,
  'score': 4.6173720846809374e-08,
  'start': 0},
 {'answer': 's',
  'context': 's',
  'end': 1,
  'score': 4.6173720846809374e-08,
  'start': 0},
 {'answer': 's',
  'context': 's',
  'end': 1,
  'score': 4.6173720846809374e-08,
  'start': 0},
 {'answer': 's',
  'context': 's',
  'end': 1,
  'score': 4.6173720846809374e-08,
  'start': 0},
 {'answer': 's',
  'context': 's',
  'end': 1,
  'score': 4.6173720846809374e-08,
  'start': 0},
 {'answer': 's',
  'context': 's',
  'end': 1,
  'score': 4.6173720846809374e-08,
  'start': 0},
 {'answer': 's',
  'context': 's

"The Soviet Union duplicated its dual-launch feat with Vostok 5 and Vostok 6 (June 16, 1963). This time they launched the first woman (also the first civilian), Valentina Tereshkova, into space on Vostok 6. Launching a woman was reportedly Korolev's idea, and it was accomplished purely for propaganda value. Tereshkova was one of a small corps of female cosmonauts who were amateur parachutists, but Tereshkova was the only one to fly. The USSR didn't again open its cosmonaut corps to women until 1980, two years after the United States opened its astronaut corps to women."

The result looks pretty good.

In [27]:
#pc.delete_index(index_name)

### Add a few more questions. What did you observe?

 - hellucanition in the last query at high level XDD



 

