<a href="https://colab.research.google.com/github/Adarsh-Vemali/Natural-Language-Processing/blob/master/Passage_comprehension.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question Answering with a Fine-Tuned BERT
*by Chris McCormick*

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 2.8MB/s 
Collecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 12.2MB/s 
[?25hCollecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 31.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB

In [2]:
import torch

In [3]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1340675298.0, style=ProgressStyle(descr…




In [5]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [6]:
question = "what was the ambition?"
answer_text = "(1) I had ambition not only to go further than any man had been before; but as far as it was possible for man to go, wrote James Cook, the explorer, who added Australia and New Zealand to the BritishEmpire.(ii) James Cook, the father of Antarctic exploration, was born in Marton village, Cleveland, on October 28, 1728. From his boyhood, he was interested in seafaring. One day the lad made up his mind; he, too, was going to sea in order to visit glamorous lands. At the age of twenty-seven, Cook had risen to the position of first mate. The first service Cook saw was in Canada, where he was employed in the dangerous task of surveying the St. Lawrence Island. (iii) When Cook, on August 25, 1768, with a company of eighty-three men (including a party of scientists, among whom was the great Sir Joseph Banks) set sail in the Endeavour, they had before them the possibility of filling in a substantial area of the globe's surface. They reached Tahiti in the spring of 1769. Cook sailed south on his quest for the unknown continent, and skirting the Society Islands, at length reached New Zealand. The tattooed natives met them. Cook greeted these Maori warriors with friendly signs and eventually prevailed on them to lay down their spears in sign of truce. After circumnavigating the North and South Islands, Cook surveyed the coastline and landed at Queen Charlotte's Sound. He then hoisted the Union Jack and informed his company that he had taken possession of the islands on behalf of His Majesty George the Third."

We'll need to run the BERT tokenizer against both the `question` and the `answer_text`. To feed these into BERT, we actually concatenate them together and place the special [SEP] token in between.


In [7]:
# Apply the tokenizer to the input text, treating them as a text-pair.
input_ids = tokenizer.encode(question, answer_text)

print('The input has a total of {:} tokens.'.format(len(input_ids)))

The input has a total of 343 tokens.


In [8]:
# BERT only needs the token IDs, but for the purpose of inspecting the 
# tokenizer's behavior, let's also get the token strings and display them.
tokens = tokenizer.convert_ids_to_tokens(input_ids)

# For each token and its id...
for token, id in zip(tokens, input_ids):
    
    # If this is the [SEP] token, add some space around it to make it stand out.
    if id == tokenizer.sep_token_id:
        print('')
    
    # Print the token string and its ID in two columns.
    print('{:<12} {:>6,}'.format(token, id))

    if id == tokenizer.sep_token_id:
        print('')
    

[CLS]           101
what          2,054
was           2,001
the           1,996
ambition     16,290
?             1,029

[SEP]           102

(             1,006
1             1,015
)             1,007
i             1,045
had           2,018
ambition     16,290
not           2,025
only          2,069
to            2,000
go            2,175
further       2,582
than          2,084
any           2,151
man           2,158
had           2,018
been          2,042
before        2,077
;             1,025
but           2,021
as            2,004
far           2,521
as            2,004
it            2,009
was           2,001
possible      2,825
for           2,005
man           2,158
to            2,000
go            2,175
,             1,010
wrote         2,626
james         2,508
cook          5,660
,             1,010
the           1,996
explorer     10,566
,             1,010
who           2,040
added         2,794
australia     2,660
and           1,998
new           2,047
zealand       3,41

In [9]:
# Search the input_ids for the first instance of the `[SEP]` token.
sep_index = input_ids.index(tokenizer.sep_token_id)

# The number of segment A tokens includes the [SEP] token istelf.
num_seg_a = sep_index + 1

# The remainder are segment B.
num_seg_b = len(input_ids) - num_seg_a

# Construct the list of 0s and 1s.
segment_ids = [0]*num_seg_a + [1]*num_seg_b

# There should be a segment_id for every input token.
assert len(segment_ids) == len(input_ids)

In [10]:
# Run our example through the model.
start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing our input text.
                                 token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text


In [11]:
# Find the tokens with the highest `start` and `end` scores.
answer_start = torch.argmax(start_scores)
answer_end = torch.argmax(end_scores)

# Combine the tokens in the answer and print it out.
answer = ' '.join(tokens[answer_start:answer_end+1])

print('Answer: "' + answer + '"')

Answer: "to go further than any man had been before ; but as far as it was possible for man to go"


In [12]:
# Start with the first token.
answer = tokens[answer_start]

# Select the remaining answer tokens and join them with whitespace.
for i in range(answer_start + 1, answer_end + 1):
    
    # If it's a subword token, then recombine it with the previous token.
    if tokens[i][0:2] == '##':
        answer += tokens[i][2:]
    
    # Otherwise, add a space then the token.
    else:
        answer += ' ' + tokens[i]

print('Answer: "' + answer + '"')

Answer: "to go further than any man had been before ; but as far as it was possible for man to go"
