#  **BERT for Question Answering**

This notebook explores using the BERT (Bidirectional Encoder Representations from Transformers) model for answering questions posed against a context. It is an engaging way to understand how natural language understanding works in practice.

## Setup and Imports

Before we start, we need to import the necessary libraries to handle data processing, modeling, and visualization.

In [None]:
!pip install transformers[torch]==4.38.2
!pip install datasets===2.13.1
!pip install plotly



In [None]:
import torch
from transformers import (
    BertForQuestionAnswering,
    BertTokenizerFast,
)
from scipy.special import softmax
import plotly.express as px
import pandas as pd
import numpy as np


## Data Preparation

Define the context and question that will be used to demonstrate the BERT model's question-answering capabilities.

In [None]:
context = "The giraffe is a large African hoofed mammal belonging to the genus Giraffa. It is the tallest living terrestrial animal and the largest ruminant on Earth. Traditionally, giraffes were thought to be one species, Giraffa camelopardalis, with nine subspecies. Most recently, researchers proposed dividing them into up to eight extant species due to new research into their mitochondrial and nuclear DNA, as well as morphological measurements. Seven other extinct species of Giraffa are known from the fossil record."
question = "How many giraffe species are there?"


## Model Setup

Load the BERT model and tokenizer designed for question answering tasks.

In [None]:
model_name = "deepset/bert-base-cased-squad2"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)



`resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Tokenization for Question Answering

Tokenize the context and the question, and perform inference to get model predictions.

The result, `inputs`, is a dictionary containing:
- `input_ids`: The token IDs of the tokenized input.
- `token_type_ids`: Segment IDs to distinguish between the question and the context.
- `attention_mask`: Mask to avoid attending to padding tokens.


In [None]:
inputs = tokenizer(question, context, return_tensors="pt")
tokenizer.tokenize(context)


['The',
 'g',
 '##ira',
 '##ffe',
 'is',
 'a',
 'large',
 'African',
 'ho',
 '##of',
 '##ed',
 'ma',
 '##mmal',
 'belonging',
 'to',
 'the',
 'genus',
 'G',
 '##ira',
 '##ff',
 '##a',
 '.',
 'It',
 'is',
 'the',
 'tallest',
 'living',
 'terrestrial',
 'animal',
 'and',
 'the',
 'largest',
 'r',
 '##umi',
 '##nant',
 'on',
 'Earth',
 '.',
 'Traditionally',
 ',',
 'g',
 '##ira',
 '##ffe',
 '##s',
 'were',
 'thought',
 'to',
 'be',
 'one',
 'species',
 ',',
 'G',
 '##ira',
 '##ff',
 '##a',
 'came',
 '##lop',
 '##ard',
 '##alis',
 ',',
 'with',
 'nine',
 'subspecies',
 '.',
 'Most',
 'recently',
 ',',
 'researchers',
 'proposed',
 'dividing',
 'them',
 'into',
 'up',
 'to',
 'eight',
 'extant',
 'species',
 'due',
 'to',
 'new',
 'research',
 'into',
 'their',
 'mitochondrial',
 'and',
 'nuclear',
 'DNA',
 ',',
 'as',
 'well',
 'as',
 'm',
 '##or',
 '##phological',
 'measurements',
 '.',
 'Seven',
 'other',
 'extinct',
 'species',
 'of',
 'G',
 '##ira',
 '##ff',
 '##a',
 'are',
 'known',
 

In [None]:
# Running the model and getting the start and end scores
with torch.no_grad():
    outputs = model(**inputs)
start_scores, end_scores = softmax(outputs.start_logits)[0], softmax(outputs.end_logits)[0]


### Visualization of Token Scores 🎯

Plot the scores associated with each token to understand the model's decision-making process.

In [None]:
# Creating a dataframe with the scores and plotting them
scores_df = pd.DataFrame({
    "Token Position": list(range(len(start_scores))) * 2,
    "Score": list(start_scores) + list(end_scores),
    "Score Type": ["Start"] * len(start_scores) + ["End"] * len(end_scores),
})
px.bar(scores_df, x="Token Position", y="Score", color="Score Type", barmode="group", title="Start and End Scores for Tokens")


### Extracting the Answer

Identify the tokens that comprise the answer based on the highest start and end scores.

In [None]:
# Getting the answer from the model
start_idx = np.argmax(start_scores)
end_idx = np.argmax(end_scores)
answer_ids = inputs.input_ids[0][start_idx: end_idx + 1]
answer_tokens = tokenizer.convert_ids_to_tokens(answer_ids)
answer = tokenizer.convert_tokens_to_string(answer_tokens)


In [None]:
start_idx, end_idx

(85, 85)

In [None]:
answer_ids

tensor([2022])

In [None]:
answer_tokens

['eight']

In [None]:
answer

'eight'

## Advanced Usage: Sliding Windows & Stride for QA 🚀

Define a function to automate the question-answering process with new contexts and questions.

In [None]:
# Part 2
# Defining a function to predict the answer to a question given a context
def predict_answer(context, question):
    inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    start_scores, end_scores = softmax(outputs.start_logits)[0], softmax(outputs.end_logits)[0]
    start_idx = np.argmax(start_scores)
    end_idx = np.argmax(end_scores)
    confidence_score = (start_scores[start_idx] + end_scores[end_idx]) /2
    answer_ids = inputs.input_ids[0][start_idx: end_idx + 1]
    answer_tokens = tokenizer.convert_ids_to_tokens(answer_ids)
    answer = tokenizer.convert_tokens_to_string(answer_tokens)
    if answer != tokenizer.cls_token:
        return answer, confidence_score
    return None, confidence_score



### Evaluating the Function

Test the `predict_answer` function with various questions and contexts.

In [None]:
# Defining a new context and predicting answers to some questions
context = """Coffee is a beverage brewed from roasted coffee beans. Darkly colored, bitter, and slightly acidic, coffee has a stimulating effect on humans, primarily due to its caffeine content. It has the highest sales in the world market for hot drinks.[2]

The seeds of the Coffea plant's fruits are separated to produce unroasted green coffee beans. The beans are roasted and then ground into fine particles typically steeped in hot water before being filtered out, producing a cup of coffee. It is usually served hot, although chilled or iced coffee is common. Coffee can be prepared and presented in a variety of ways (e.g., espresso, French press, caffè latte, or already-brewed canned coffee). Sugar, sugar substitutes, milk, and cream are often added to mask the bitter taste or enhance the flavor.

Though coffee is now a global commodity, it has a long history tied closely to food traditions around the Red Sea. The earliest credible evidence of coffee drinking as the modern beverage appears in modern-day Yemen in southern Arabia in the middle of the 15th century in Sufi shrines, where coffee seeds were first roasted and brewed in a manner similar to how it is now prepared for drinking.[3] The coffee beans were procured by the Yemenis from the Ethiopian Highlands via coastal Somali intermediaries, and cultivated in Yemen. By the 16th century, the drink had reached the rest of the Middle East and North Africa, later spreading to Europe.

The two most commonly grown coffee bean types are C. arabica and C. robusta.[4] Coffee plants are cultivated in over 70 countries, primarily in the equatorial regions of the Americas, Southeast Asia, the Indian subcontinent, and Africa. As of 2023, Brazil was the leading grower of coffee beans, producing 35% of the world's total. Green, unroasted coffee is traded as an agricultural commodity. Despite sales of coffee reaching billions of dollars worldwide, farmers producing coffee beans disproportionately live in poverty. Critics of the coffee industry have also pointed to its negative impact on the environment and the clearing of land for coffee-growing and water use. The global coffee industry is massive and worth $495.50 billion as of 2023.[5] Brazil, Vietnam, and Colombia are the top exporters of coffee beans as of 2023. \n Rapid growth in coffee production in South America during the second half of the 19th century was matched by an increase in consumption in developed countries, though nowhere has this growth been as pronounced as in the United States, where a high rate of population growth was compounded by doubling of per capita consumption between 1860 and 1920. Though the United States was not the heaviest coffee-drinking nation at the time (Belgium, the Netherlands and Nordic countries all had comparable or higher levels of per capita consumption), due to its sheer size, it was already the largest consumer of coffee in the world by 1860, and, by 1920, around half of all coffee produced worldwide was consumed in the US.[37]

Coffee has become a vital cash crop for many developing countries. Over one hundred million people in developing countries have become dependent on coffee as their primary source of income. It has become the primary export and economic backbone for African countries like Uganda, Burundi, Rwanda, and Ethiopia,[40] as well as many Central American countries.

"""

len(tokenizer.tokenize(context))
predict_answer(context, "What is coffee?")
predict_answer(context, "What are the most common coffee beans?")
predict_answer(context, "How can I make ice coffee?")
predict_answer(context[4000:], "How many people are dependent on coffee for their income?")


Token indices sequence length is longer than the specified maximum sequence length for this model (714 > 512). Running this sequence through the model will result in indexing errors


(None, 0.9988429546356201)

In [None]:
len(tokenizer.tokenize(context))


714

In [None]:
predict_answer(context, "What is coffee?")


('a beverage brewed from roasted coffee beans', 0.814056396484375)

In [None]:
predict_answer(context, "What are the most common coffee beans?")


('C. arabica and C. robusta', 0.9379842877388)

In [None]:
predict_answer(context, "Tell me what do you know about the most used coffee beans?")

('C. arabica and C. robusta', 0.46422529220581055)

In [None]:
# predict_answer(context, "names of some coffee beans?")


In [None]:
predict_answer(context, "How can I make ice coffee?")


(None, 0.9901087284088135)

In [None]:
predict_answer(context[4000:], "How many people are dependent on coffee for their income?")


(None, 0.9988429546356201)

In [None]:
# Defining a function to chunk sentences
def chunk_sentences(sentences, chunk_size, stride):
    chunks = []
    num_sentences = len(sentences)
    for i in range(0, num_sentences, chunk_size - stride):
        chunk = sentences[i: i + chunk_size]
        chunks.append(chunk)
    return chunks

In [None]:
sentences = [
    "Sentence 1.",
    "Sentence 2.",
    "Sentence 3.",
    "Sentence 4.",
    "Sentence 5.",
    "Sentence 6.",
    "Sentence 7.",
    "Sentence 8.",
    "Sentence 9.",
    "Sentence 10."
]


sentences = context.split("\n")
chunked_sentences = chunk_sentences(sentences, chunk_size=3, stride=1)


questions = ["What is coffee?", "What are the most common coffee beans?", "How can I make ice coffee?", "How many people are dependent on coffee for their income?"]

answers = {}

for chunk in chunked_sentences:
    sub_context = "\n".join(chunk)
    for question in questions:
        answer, score = predict_answer(sub_context, question)

        if answer:
            if question not in answers:
                answers[question] = (answer, score)
            else:
                if score > answers[question][1]:
                    answers[question] = (answer, score)

print(answers)

{'What is coffee?': ('a beverage brewed from roasted coffee beans', 0.9282469749450684), 'What are the most common coffee beans?': ('C. arabica and C. robusta', 0.9481030106544495), 'How many people are dependent on coffee for their income?': ('Over one hundred million', 0.8877787590026855)}


In [None]:
sentences = [
    "Coffee is a beverage brewed from roasted coffee beans.",
    "Darkly colored, bitter, and slightly acidic, coffee has a stimulating effect on humans, primarily due to its caffeine content.",
    "It has the highest sales in the world market for hot drinks.",
    "The seeds of the Coffea plant's fruits are separated to produce unroasted green coffee beans.",
    "The beans are roasted and then ground into fine particles typically steeped in hot water before being filtered out, producing a cup of coffee.",
    "Rapid growth in coffee production in South America during the second half of the 19th century was matched by an increase in consumption in developed countries, though nowhere has this growth been as pronounced as in the United States, where a high rate of population growth was compounded by doubling of per capita consumption between 1860 and 1920.",
    "Though the United States was not the heaviest coffee-drinking nation at the time (Belgium, the Netherlands and Nordic countries all had comparable or higher levels of per capita consumption), due to its sheer size, it was already the largest consumer of coffee in the world by 1860, and, by 1920, around half of all coffee produced worldwide was consumed in the US.",
    "Coffee has become a vital cash crop for many developing countries.",
    "Over one hundred million people in developing countries have become dependent on coffee as their primary source of income.",
    "It has become the primary export and economic backbone for African countries like Uganda, Burundi, Rwanda, and Ethiopia, as well as many Central American countries."
]


In [None]:
chunked_sentences = chunk_sentences(sentences, chunk_size=3, stride=1)
chunked_sentences

[['Coffee is a beverage brewed from roasted coffee beans.',
  'Darkly colored, bitter, and slightly acidic, coffee has a stimulating effect on humans, primarily due to its caffeine content.',
  'It has the highest sales in the world market for hot drinks.'],
 ['It has the highest sales in the world market for hot drinks.',
  "The seeds of the Coffea plant's fruits are separated to produce unroasted green coffee beans.",
  'The beans are roasted and then ground into fine particles typically steeped in hot water before being filtered out, producing a cup of coffee.'],
 ['The beans are roasted and then ground into fine particles typically steeped in hot water before being filtered out, producing a cup of coffee.',
  'Rapid growth in coffee production in South America during the second half of the 19th century was matched by an increase in consumption in developed countries, though nowhere has this growth been as pronounced as in the United States, where a high rate of population growth wa

In [None]:
sentences = [
    "Sentence 1.",
    "Sentence 2.",
    "Sentence 3.",
    "Sentence 4.",
    "Sentence 5.",
    "Sentence 6.",
    "Sentence 7.",
    "Sentence 8.",
    "Sentence 9.",
    "Sentence 10."
]

chunked_sentences = chunk_sentences(sentences, chunk_size=3, stride=1)

In [None]:
chunked_sentences

[['Sentence 1.', 'Sentence 2.', 'Sentence 3.'],
 ['Sentence 3.', 'Sentence 4.', 'Sentence 5.'],
 ['Sentence 5.', 'Sentence 6.', 'Sentence 7.'],
 ['Sentence 7.', 'Sentence 8.', 'Sentence 9.'],
 ['Sentence 9.', 'Sentence 10.']]