## Question Answering
---

In [50]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

!pip install transformers
!pip install googletrans

import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer



### Data loading

In [51]:
coqa = pd.read_json('http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json')
coqa.head()

Unnamed: 0,version,data
0,1,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,1,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,1,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,1,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."


### Data cleaning

In [52]:
del coqa["version"]

In [53]:
cols = ["text","question","answer"]

# j = 1
comp_list = []
for index, row in coqa.iterrows():
    for i in range(len(row["data"]["questions"])):
        temp_list = []
#         temp_list.append(j)
        temp_list.append(row["data"]["story"])
        temp_list.append(row["data"]["questions"][i]["input_text"])
        temp_list.append(row["data"]["answers"][i]["input_text"])
        comp_list.append(temp_list)
#     j += 1
new_df = pd.DataFrame(comp_list, columns=cols)

### Saving the data as csv

In [54]:
new_df.to_csv("CoQA_data.csv", index=False)

### Loading the data

In [55]:
data = pd.read_csv("CoQA_data.csv")
data.head()

Unnamed: 0,text,question,answer
0,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475
1,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research
2,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law"
3,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology"
4,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project


In [56]:
print("Number of question and answers: ", len(data))

Number of question and answers:  108647


### Building the chatbot

In [57]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [58]:
random_num = np.random.randint(0,len(data))

question = data["question"][random_num]
text = data["text"][random_num]

In [59]:
print(question, "\n", text)

Where they the only ones? 
 Jerusalem (CNN) -- The Indian nanny who saved the life of an Israeli boy during the Mumbai terror attacks in 2008 has been granted honorary citizenship and temporary residency in Israel. 

At a ceremony Monday, the Israeli interior ministry in Jerusalem handed Sandra Samuel her identity card. 

"I hope I will honor the citizenship and love Israel. I would give my heart and soul for Israel," she said. 

Samuel has been caring for the boy, Moshe Holtzberg, since his parents died in the terror attacks on a Jewish cultural center, Chabad House, and several luxury hotels in India's financial capital. 

They were among six people who were killed at Chabad House. Altogether, more than 160 people died in the attacks. 

During the raids, 10 men also attacked buildings including the luxury Taj Mahal Palace and Tower and Oberoi-Trident hotels and the city's Chhatrapati Shivaji train station. 

The only surviving gunman, Mohammed Ajmal Kasab, a Pakistani, was convicted 

In [60]:
input_ids = tokenizer.encode(question, text)
print("The input has a total of {} tokens.".format(len(input_ids)))

The input has a total of 369 tokens.


In [61]:
tokens = tokenizer.convert_ids_to_tokens(input_ids)

for token, id in zip(tokens, input_ids):
    print('{:8}{:8,}'.format(token,id))

[CLS]        101
where      2,073
they       2,027
the        1,996
only       2,069
ones       3,924
?          1,029
[SEP]        102
jerusalem   6,744
(          1,006
cnn       13,229
)          1,007
-          1,011
-          1,011
the        1,996
indian     2,796
nanny     19,174
who        2,040
saved      5,552
the        1,996
life       2,166
of         1,997
an         2,019
israeli    5,611
boy        2,879
during     2,076
the        1,996
mumbai     8,955
terror     7,404
attacks    4,491
in         1,999
2008       2,263
has        2,038
been       2,042
granted    4,379
honorary   5,756
citizenship   9,068
and        1,998
temporary   5,741
residency  14,079
in         1,999
israel     3,956
.          1,012
at         2,012
a          1,037
ceremony   5,103
monday     6,928
,          1,010
the        1,996
israeli    5,611
interior   4,592
ministry   3,757
in         1,999
jerusalem   6,744
handed     4,375
sandra    12,834
samuel     5,212
her        2,014
identit

In [62]:
#first occurence of [SEP] token
sep_idx = input_ids.index(tokenizer.sep_token_id)
print(sep_idx)

#number of tokens in segment A - question
num_seg_a = sep_idx+1
print(num_seg_a)

#number of tokens in segment B - text
num_seg_b = len(input_ids) - num_seg_a
print(num_seg_b)

segment_ids = [0]*num_seg_a + [1]*num_seg_b
print(segment_ids)

assert len(segment_ids) == len(input_ids)

7
8
361
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [63]:
#token input_ids to represent the input
#token segment_ids to differentiate our segments - text and question
output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
#print(output.start_logits, output.end_logits)

In [64]:
#tokens with highest start and end scores
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits)
#print(answer_start, answer_end)

In [65]:
if answer_end >= answer_start:
    answer = " ".join(tokens[answer_start:answer_end+1])
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")

print("Text:\n{}".format(text.capitalize()))
print("\nQuestion:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}.".format(answer.capitalize()))

Text:
Jerusalem (cnn) -- the indian nanny who saved the life of an israeli boy during the mumbai terror attacks in 2008 has been granted honorary citizenship and temporary residency in israel. 

at a ceremony monday, the israeli interior ministry in jerusalem handed sandra samuel her identity card. 

"i hope i will honor the citizenship and love israel. i would give my heart and soul for israel," she said. 

samuel has been caring for the boy, moshe holtzberg, since his parents died in the terror attacks on a jewish cultural center, chabad house, and several luxury hotels in india's financial capital. 

they were among six people who were killed at chabad house. altogether, more than 160 people died in the attacks. 

during the raids, 10 men also attacked buildings including the luxury taj mahal palace and tower and oberoi-trident hotels and the city's chhatrapati shivaji train station. 

the only surviving gunman, mohammed ajmal kasab, a pakistani, was convicted of murder, conspiracy,

### Code to join the broken words

In [72]:
answer = tokens[answer_start]

for i in range(answer_start+1, answer_end+1):
    if tokens[i][0:2] == "##":
        answer += tokens[i][2:]
    else:
        answer += " " + tokens[i]

### Function

In [73]:
def question_answer(question, text):

    #tokenize question and text in ids as a pair
    input_ids = tokenizer.encode(question, text)

    #string version of tokenized ids
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    #segment IDs
    #first occurence of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)

    #number of tokens in segment A - question
    num_seg_a = sep_idx+1

    #number of tokens in segment B - text
    num_seg_b = len(input_ids) - num_seg_a

    #list of 0s and 1s
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    assert len(segment_ids) == len(input_ids)

    #model output using input_ids and segment_ids
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))

    #reconstructing the answer
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)

    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]

    if answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."

    print("\nAnswer:\n{}".format(answer.capitalize()))
    return answer.capitalize()

In [74]:
text = """New York (CNN) -- More than 80 Michael Jackson collectibles -- including the late pop star's famous rhinestone-studded glove from a 1983 performance -- were auctioned off Saturday, reaping a total $2 million. Profits from the auction at the Hard Rock Cafe in New York's Times Square crushed pre-sale expectations of only $120,000 in sales. The highly prized memorabilia, which included items spanning the many stages of Jackson's career, came from more than 30 fans, associates and family members, who contacted Julien's Auctions to sell their gifts and mementos of the singer. Jackson's flashy glove was the big-ticket item of the night, fetching $420,000 from a buyer in Hong Kong, China. Jackson wore the glove at a 1983 performance during \"Motown 25,\" an NBC special where he debuted his revolutionary moonwalk. Fellow Motown star Walter \"Clyde\" Orange of the Commodores, who also performed in the special 26 years ago, said he asked for Jackson's autograph at the time, but Jackson gave him the glove instead. "The legacy that [Jackson] left behind is bigger than life for me,\" Orange said. \"I hope that through that glove people can see what he was trying to say in his music and what he said in his music.\" Orange said he plans to give a portion of the proceeds to charity. Hoffman Ma, who bought the glove on behalf of Ponte 16 Resort in Macau, paid a 25 percent buyer's premium, which was tacked onto all final sales over $50,000. Winners of items less than $50,000 paid a 20 percent premium."""
question = "Where was the Auction held?"

question_answer(question, text)


Answer:
Hard rock cafe in new york ' s times square


"Hard rock cafe in new york ' s times square"

In [75]:
print("Original answer:\n", data.loc[data["question"] == question]["answer"].values[0])

Original answer:
 Hard Rock Cafe


### Chatbot Q/A

In [91]:
random_num = np.random.randint(0, len(data))
question = data["question"][random_num]
text = data["text"][random_num]
correct_answer = data["answer"][random_num]

In [94]:
from googletrans import Translator
import numpy as np

# pick a random row

print(" ------------------- English ---------------------")
print(f"Text: {text}")
print(f"Question: {question}")

# assuming you have this function defined
answer = question_answer(question, text)

print("-------------- Romanian Equivalent --------------")

translator = Translator()
translation_text = await translator.translate(text, dest='ro')
print(f"Text: {translation_text.text}")
translation_question = await translator.translate(question, dest='ro')
print(f"Intrebare: {translation_question.text}")
translation_answer = await translator.translate(answer, dest='ro')
print(f"Raspuns: {translation_answer.text}")
print(f"Raspuns corect: {(await translator.translate(correct_answer, dest='ro')).text}")


 ------------------- English ---------------------
Text: (CNN)As "Mad Men" returned for its seventh season, many viewers tuned in to see what happened next for Don, Peggy, Pete and the other characters of the hit AMC show. Many were eager to see the fabulous clothes the actors wore. 

We can't help but wonder -- was all that glamour real, or is it just the magic of TV? We asked readers to share their snapshots from 1967-69 and show us what the late '60s really looked like. 

Janie Lambert, 61, says she thinks "Mad Men" portrays the decade's conservative fashion and mod look accurately. But she remembers the late 1960s as more colorful and vibrant. 

"My favorite looks in the '60s were the bright colors and bold patterns, stripes and polka dots, miniskirts, long hair and pale lipstick," Lambert says. 

'Mad Men' and the other 1960s 

Many iReporters strived to keep up with the fast pace of the changing fashion in the late '60s. Patricia Anne Alfano, 66, went from a British-inspired mod 

In [83]:
translation = await translator.translate(question, dest = 'ro')
translation.text

'Cine este prima persoană menționată?'