<a href="https://colab.research.google.com/github/IshuDhana/lab-extractive-question-answering/blob/main/lab_extractive_question_answering_checked.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [1]:
import torch
from sentence_transformers import SentenceTransformer
# Set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Load a small, lightweight model
retriever = SentenceTransformer("all-MiniLM-L6-v2", device=device)
print("Retriever loaded successfully on", device)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Retriever loaded successfully on cuda


In [13]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY= os.getenv('PINECONE_API_KEY')

In [14]:
from google.colab import userdata
import os

# Retrieve the API keys
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')

# Set as environment variables
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY

print("OpenAI API key loaded and set as environment variable.")

OpenAI API key loaded and set as environment variable.


# Install Dependencies

In [3]:
!pip install -qU datasets pinecone-client

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
!pip install sentence-transformers torch



# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [5]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [6]:
# select only title and context column
df = df[['title', 'context']]
# drop rows containing duplicate context passages
df = df.drop_duplicates(subset=['context'])
df

Unnamed: 0,title,context
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
5,University_of_Notre_Dame,"As at most other universities, Notre Dame's st..."
10,University_of_Notre_Dame,The university is the major seat of the Congre...
15,University_of_Notre_Dame,The College of Engineering was established in ...
20,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...
...,...,...
87574,Kathmandu,"Institute of Medicine, the central college of ..."
87579,Kathmandu,Football and Cricket are the most popular spor...
87584,Kathmandu,The total length of roads in Nepal is recorded...
87589,Kathmandu,The main international airport serving Kathman...


In [7]:
df = df.sample(5000)

# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [8]:
!pip install -qU langchain-pinecone pinecone-notebooks

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.5/82.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.5/471.5 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m133.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.6/587.6 kB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m259.3/259.3 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the followi

In [15]:
from pinecone import Pinecone, ServerlessSpec
spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)
# connect to pinecone environment
pc = Pinecone(
    api_key = PINECONE_API_KEY,
    environment='us-east-1'  # find next to API key in console
)

In [16]:
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# connect to pinecone environment
pc = Pinecone(
    api_key = PINECONE_API_KEY,
    environment='us-east-1'  # find next to API key in console
)

Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [20]:
index_name = "question-answering"

# check if the index exists
if index_name not in pc.list_indexes().names():
    # create the index if it does not exist
    pc.create_index(
        name=index_name,
        dimension=384,  # This matches the 'multi-qa-MiniLM-L6-cos-v1' model
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

# connect to the index we created
index = pc.Index(index_name)

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [21]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = 'all-MiniLM-L6-v2' #use the 'multi-qa-MiniLM-L6-cos-v1' model from HuggingFace to build the retriever
retriever

'all-MiniLM-L6-v2'

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [22]:
retriever = SentenceTransformer('all-MiniLM-L6-v2', device=device)

In [23]:
from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    i_end = min(i + batch_size, len(df))
    # extract batch
    batch = df.iloc[i:i_end]
    # generate embeddings for batch
    emb = retriever.encode(batch['context'].tolist()).tolist()
    # get metadata
    meta = batch[['title', 'context']].to_dict('records')
    # create unique IDs
    ids = [f"id_{j}" for j in range(i, i_end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

  0%|          | 0/79 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 5000}},
 'total_vector_count': 5000,
 'vector_type': 'dense'}

# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [24]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

config.json:   0%|          | 0.00/635 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda


<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7e4910e2b200>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [25]:
# gets context passages from the pinecone index
def get_context(question, top_k):
    # generate embeddings for the question
    xq = retriever.encode(question).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(vector=xq, top_k=top_k, include_metadata=True)
    # extract the context passage from pinecone search result
    c = [x['metadata']['context'] for x in xc['matches']]
    return c

In [26]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

In [28]:
# Test it with a question
question = "What is machine learning?"
contexts = get_context(question, top_k=3)  # Get relevant contexts
answers = extract_answer(question, contexts)  # Extract answers

[{'answer': 'produce qualified professionals that can apply their knowledge '
            'and skills',
  'context': 'Its mission is to provide high quality education, training and '
             'research in the areas of science and technology to produce '
             'qualified professionals that can apply their knowledge and '
             "skills in the country's development.",
  'end': 187,
  'score': 9.082054219788915e-09,
  'start': 114},
 {'answer': 'any computer with a minimum capability',
  'context': 'The ability to store and execute lists of instructions called '
             'programs makes computers extremely versatile, distinguishing '
             'them from calculators. The Church–Turing thesis is a '
             'mathematical statement of this versatility: any computer with a '
             'minimum capability (being Turing-complete) is, in principle, '
             'capable of performing the same tasks that any other computer can '
             'perform. Therefore,

In [29]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

["Egypt's economy depends mainly on agriculture, media, petroleum imports, natural gas, and tourism; there are also more than three million Egyptians working abroad, mainly in Saudi Arabia, the Persian Gulf and Europe. The completion of the Aswan High Dam in 1970 and the resultant Lake Nasser have altered the time-honored place of the Nile River in the agriculture and ecology of Egypt. A rapidly growing population, limited arable land, and dependence on the Nile all continue to overtax resources and stress the economy."]

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [30]:
extract_answer(question, context)

[{'answer': 'on',
  'context': "Egypt's economy depends mainly on agriculture, media, petroleum "
             'imports, natural gas, and tourism; there are also more than '
             'three million Egyptians working abroad, mainly in Saudi Arabia, '
             'the Persian Gulf and Europe. The completion of the Aswan High '
             'Dam in 1970 and the resultant Lake Nasser have altered the '
             'time-honored place of the Nile River in the agriculture and '
             'ecology of Egypt. A rapidly growing population, limited arable '
             'land, and dependence on the Nile all continue to overtax '
             'resources and stress the economy.',
  'end': 33,
  'score': 0.010330144315958023,
  'start': 31}]


The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [31]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'Marc Andreessen',
  'context': 'In 1993, browser software was further innovated by Marc '
             'Andreessen with the release of Mosaic, "the world\'s first '
             'popular browser", which made the World Wide Web system easy to '
             "use and more accessible to the average person. Andreesen's "
             'browser sparked the internet boom of the 1990s. The introduction '
             'of Mosaic in 1993 – one of the first graphical web browsers – '
             'led to an explosion in web use. Andreessen, the leader of the '
             'Mosaic team at National Center for Supercomputing Applications '
             '(NCSA), soon started his own company, named Netscape, and '
             'released the Mosaic-influenced Netscape Navigator in 1994, which '
             "quickly became the world's most popular browser, accounting for "
             '90% of all web use at its peak (see usage share of web '
             'browsers).',
  'end': 66,
  'sco

In [32]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'Industrial Prince"',
  'context': 'In October 1919, Albert went up to Trinity College, Cambridge, '
             'where he studied history, economics and civics for a year. On 4 '
             'June 1920, he was created Duke of York, Earl of Inverness and '
             'Baron Killarney. He began to take on more royal duties. He '
             'represented his father, and toured coal mines, factories, and '
             'railyards. Through such visits he acquired the nickname of the '
             '"Industrial Prince". His stammer, and his embarrassment over it, '
             'together with his tendency to shyness, caused him to appear much '
             'less impressive than his older brother, Edward. However, he was '
             'physically active and enjoyed playing tennis. He played at '
             "Wimbledon in the Men's Doubles with Louis Greig in 1926, losing "
             'in the first round. He developed an interest in working '
             'conditions, an

Let's run another question. This time for top 3 context passages from the retriever.

In [33]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


[{'answer': 'Yuri Gagarin',
  'context': 'By 1959, American observers believed that the Soviet Union would '
             'be the first to get a human into space, because of the time '
             "needed to prepare for Mercury's first launch. On April 12, 1961, "
             'the USSR surprised the world again by launching Yuri Gagarin '
             'into a single orbit around the Earth in a craft they called '
             'Vostok 1. They dubbed Gagarin the first cosmonaut, roughly '
             'translated from Russian and Greek as "sailor of the universe". '
             'Although he had the ability to take over manual control of his '
             'spacecraft in an emergency by opening an envelope he had in the '
             'cabin that contained a code that could be typed into the '
             'computer, it was flown in an automatic mode as a precaution; '
             'medical science at that time did not know what would happen to a '
             'human in the weightless

The result looks pretty good.

In [35]:
pc.delete_index(index_name)

### Add a few more questions. What did you observe?

In [34]:
# Test it with a question
question = "What is RAG?"
contexts = get_context(question, top_k=3)  # Get relevant contexts
answers = extract_answer(question, contexts)  # Extract answers

[{'answer': 'Cwarmê is a pure walloon',
  'context': 'The Carnival of Malmedy is locally called Cwarmê. Even if '
             'Malmedy is located in the east Belgium, near the German-speaking '
             'area, the Cwarmê is a pure walloon and Latin carnival. The '
             'celebration takes place during 4 days before the Shrove Tuesday. '
             'The Cwarmê Sunday is the most important and insteresting to see. '
             'All the old traditional costumes parade in the street. The '
             'Cwarmê is a "street carnival" and is not only a parade. People '
             'who are disguised pass through the crowd and perform a part of '
             'the traditional costume they wear. The famous traditional '
             'costumes at the Cwarmê of Malmedy are the Haguète, the '
             'Longuès-Brèsses and the Long-Né.',
  'end': 157,
  'score': 7.565150728083836e-09,
  'start': 133},
 {'answer': 'Wool is considered as pure and is used as a ritual cloth',
  'c