<a href="https://colab.research.google.com/github/Shadidi/NLP_Chatbot/blob/main/Answering_Bot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Answering Bot**

*Developed by : Adam Kharsa*

*Reviewed by : Souad Hadidi*

In [None]:
# verify GPU availability
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [None]:
# install huggingface libraries
!pip install pytorch-pretrained-bert pytorch-nlp pytorch_transformers



In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pytorch_transformers import BertTokenizer, BertConfig, BertModel
from pytorch_transformers import AdamW, BertForQuestionAnswering
from tqdm import tqdm, trange
import pandas as pd
import io
import os
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# BERT imports
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pytorch_pretrained_bert import BertTokenizer, BertConfig
from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

# specify GPU device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)
torch.cuda.empty_cache()

In [None]:
#To run this code,
#The attached file final_project needs to be on the root directory
#of a google drive account
#Once there, the weights and required training files will load and the
#training time will just be loading the weights into the model
from google.colab import drive
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


In [None]:
!ls /drive/My\ Drive/final_project

cache_train	  checkpoint-6000		 custom_predictions.json
cache_validation  checkpoint-7000		 dev-v2.0.json
checkpoint-1000   checkpoint-8000		 nbest_predictions.json
checkpoint-10000  checkpoint-9000		 null_odds.json
checkpoint-2000   checkpoint-final		 predictions.json
checkpoint-3000   custom_input.json		 train-v2.0.json
checkpoint-4000   custom_nbest_predictions.json
checkpoint-5000   custom_null_odds.json


In [None]:
import sys
sys.path.append('/drive/My Drive/final_project')

In [None]:
!wget 'https://raw.githubusercontent.com/nlpyang/pytorch-transformers/master/examples/utils_squad.py'
!wget 'https://raw.githubusercontent.com/nlpyang/pytorch-transformers/master/examples/utils_squad_evaluate.py'

--2021-05-05 04:47:22--  https://raw.githubusercontent.com/nlpyang/pytorch-transformers/master/examples/utils_squad.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41529 (41K) [text/plain]
Saving to: ‘utils_squad.py.6’


2021-05-05 04:47:23 (11.2 MB/s) - ‘utils_squad.py.6’ saved [41529/41529]

--2021-05-05 04:47:23--  https://raw.githubusercontent.com/nlpyang/pytorch-transformers/master/examples/utils_squad_evaluate.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12493 (12K) [text/plain]
Saving to: ‘utils_squad_evaluate.py.6

In [None]:
from utils_squad import (read_squad_examples, convert_examples_to_features,
                         RawResult, write_predictions,
                         RawResultExtended, write_predictions_extended)
from utils_squad_evaluate import EVAL_OPTS, main as evaluate_on_squad, plot_pr_curve

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [None]:
input_file = '/drive/My Drive/final_project/train-v2.0.json'
examples = read_squad_examples(input_file=input_file,
                                is_training=True,
                                version_2_with_negative=True)

In [None]:
examples[:5]

[qas_id: 56be85543aeaaa14008c9063, question_text: When did Beyonce start becoming popular?, doc_tokens: [Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".], start_position: 39, end_position: 42,
 qas_id: 56be85543aeaaa14008c9065, question_text: What areas did Beyonce compete in when she was growing up?, doc_tokens: [Beyoncé Giselle Knowles-Carter (/biːˈ

In [None]:
train_data = pd.DataFrame.from_records([vars(example) for example in examples])
train_data.head()

Unnamed: 0,qas_id,question_text,doc_tokens,orig_answer_text,start_position,end_position,is_impossible
0,56be85543aeaaa14008c9063,When did Beyonce start becoming popular?,"[Beyoncé, Giselle, Knowles-Carter, (/biːˈjɒnse...",in the late 1990s,39,42,False
1,56be85543aeaaa14008c9065,What areas did Beyonce compete in when she was...,"[Beyoncé, Giselle, Knowles-Carter, (/biːˈjɒnse...",singing and dancing,28,30,False
2,56be85543aeaaa14008c9066,When did Beyonce leave Destiny's Child and bec...,"[Beyoncé, Giselle, Knowles-Carter, (/biːˈjɒnse...",2003,82,82,False
3,56bf6b0f3aeaaa14008c9601,In what city and state did Beyonce grow up?,"[Beyoncé, Giselle, Knowles-Carter, (/biːˈjɒnse...","Houston, Texas",22,23,False
4,56bf6b0f3aeaaa14008c9602,In which decade did Beyonce become famous?,"[Beyoncé, Giselle, Knowles-Carter, (/biːˈjɒnse...",late 1990s,41,42,False


In [None]:
sample = train_data.sample(frac=1).head(1)
context = sample.doc_tokens.values
train_data[train_data.doc_tokens.values==context]

Unnamed: 0,qas_id,question_text,doc_tokens,orig_answer_text,start_position,end_position,is_impossible
79997,5727b298ff5b5019007d92ca,What group was the main antagonist during the ...,"[The, Second, Sino-Japanese, War, was, soon, f...",the Communists,21,22,False
79998,5727b298ff5b5019007d92cb,Who led the defense of Chongqing in November 1...,"[The, Second, Sino-Japanese, War, was, soon, f...",Chiang Kai-Shek,58,59,False
79999,5727b298ff5b5019007d92cc,On what date in 1949 did Chengdu fall to the c...,"[The, Second, Sino-Japanese, War, was, soon, f...",10 December,86,87,False
80000,5727b298ff5b5019007d92cd,Why did Sichuan see some communist activity?,"[The, Second, Sino-Japanese, War, was, soon, f...",it was one area on the road of the Long March,47,57,False
80001,5a513846ce860b001aa3fc2f,What resumed after the First Sino-Japanese War?,"[The, Second, Sino-Japanese, War, was, soon, f...",,-1,-1,True
80002,5a513846ce860b001aa3fc30,What happened to the cities of west China?,"[The, Second, Sino-Japanese, War, was, soon, f...",,-1,-1,True
80003,5a513846ce860b001aa3fc31,What government fled sichuan again?,"[The, Second, Sino-Japanese, War, was, soon, f...",,-1,-1,True
80004,5a513846ce860b001aa3fc32,Who flew from Chongqing to Tawian to lead the ...,"[The, Second, Sino-Japanese, War, was, soon, f...",,-1,-1,True
80005,5a513846ce860b001aa3fc33,What other city fell following the fall of Che...,"[The, Second, Sino-Japanese, War, was, soon, f...",,-1,-1,True
80006,5a68f3838476ee001a58a8f6,What group was the on the defense during the C...,"[The, Second, Sino-Japanese, War, was, soon, f...",,-1,-1,True


In [None]:
import random
def print_squad_sample(train_data, line_length=14, separator_length=120):
  sample = train_data.sample(frac=1).head(1)
  context = sample.doc_tokens.values
  print('='*separator_length)
  print('CONTEXT: ')
  print('='*separator_length)
  lines = [' '.join(context[0][idx:idx+line_length]) for idx in range(0, len(context[0]), line_length)]
  for l in lines:
      print(l)
  print('='*separator_length)
  questions = train_data[train_data.doc_tokens.values==context]
  print('QUESTION:', ' '*(3*separator_length//4), 'ANSWER:')
  for idx, row in questions.iterrows():
    question = row.question_text
    answer = row.orig_answer_text
    print(question, ' '*(3*separator_length//4-len(question)+9), (answer if answer else 'No awnser found'))

In [None]:
print_squad_sample(train_data)

CONTEXT: 
The army's major campaign against the Indians was fought in Florida against Seminoles. It
took long wars (1818–58) to finally defeat the Seminoles and move them to Oklahoma.
The usual strategy in Indian wars was to seize control of the Indians winter
food supply, but that was no use in Florida where there was no winter.
The second strategy was to form alliances with other Indian tribes, but that too
was useless because the Seminoles had destroyed all the other Indians when they entered
Florida in the late eighteenth century.
QUESTION:                                                                                            ANSWER:
What Indian tribe was the Army's major campaign against?                                             Seminoles
During what years did the wars between the Army and the Seminoles take place?                        1818–58
What state were the Seminoles moved to?                                                              Oklahoma
What did the Army tr

In [None]:

train_data['paragraph_len'] = train_data['doc_tokens'].apply(len)
train_data['question_len'] = train_data['question_text'].apply(len)
train_data.sample(frac=1).head(5)

Unnamed: 0,qas_id,question_text,doc_tokens,orig_answer_text,start_position,end_position,is_impossible,paragraph_len,question_len
114394,57301295b2c2fd14005687ff,How long did william tubman rule?,"[Longstanding, political, tensions, from, the,...",27 year,5,6,False,108,33
68019,5726b79bdd62a815002e8deb,Where is the North Carolina State Fair?,"[The, Time, Warner, Cable, Music, Pavilion, at...",Dorton Arena,101,102,False,131,39
5776,56daec56e7c41114004b4b20,Which original judge was a choreographer?,"[American, Idol, employs, a, panel, of, judges...",Paula Abdul,27,28,False,85,41
113723,5acd8d4907355d001abf46f7,Ice and snow covered prehistoric Ireland and w...,"[As, with, most, of, Europe,, prehistoric, Bri...",,-1,-1,True,196,64
115806,5ad4a115ba00c4001a268e6d,How many licenses did Microsoft sell in the fi...,"[In, March, 2013,, Microsoft, also, amended, i...",,-1,-1,True,96,62


In [None]:
max_seq_length = 256
print("Percentage of context's less than max_seq_length = %s%%" % (len([l for l in train_data['paragraph_len'] if l <= max_seq_length])/len(train_data) * 100))

Percentage of context's less than max_seq_length = 98.19289589392184%


In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
doc_stride = 128
max_seq_length = 256
max_query_length = 64
# batch size of 64 if RAM available.
batch_size = 14

In [None]:
cached_features_file = '/drive/My Drive/final_project/cache_train'

In [None]:
if not os.path.exists(cached_features_file):
  features = convert_examples_to_features(examples=examples,
                                        tokenizer=tokenizer,
                                        max_seq_length=max_seq_length,
                                        doc_stride=doc_stride,
                                        max_query_length=max_query_length,
                                        is_training=True)
  torch.save(features, cached_features_file)
else:
  features = torch.load(cached_features_file)

In [None]:
def set_seed(seed=42):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

In [None]:
# Convert to Tensors and build dataset
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)
all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)

all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)
all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
                        all_start_positions, all_end_positions,
                        all_cls_index, all_p_mask)

In [None]:
train_sampler = RandomSampler(dataset)
train_dataloader = DataLoader(dataset, sampler=train_sampler, batch_size=batch_size, drop_last=True)

In [None]:
import glob
checkpoints = sorted(glob.glob('/drive/My Drive/final_project/checkpoint*-[0-9]*'))

In [None]:
def to_list(tensor):
    return tensor.detach().cpu().tolist()

In [None]:
if len(checkpoints) > 0:
  global_step = checkpoints[-1].split('-')[-1]
  ckpt_name = '/drive/My Drive/final_project/checkpoint-{}'.format(global_step)
  print("Loading model from checkpoint %s" % ckpt_name)
  model = BertForQuestionAnswering.from_pretrained(ckpt_name)
  train_loss_set_ckpt = torch.load(ckpt_name + '/training_loss.pt')
  train_loss_set = to_list(train_loss_set_ckpt)
  tr_loss = train_loss_set[-1]
else:
  global_step = 0
  train_loss_set = []
  tr_loss = 0.0
  model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

model.cuda()

Loading model from checkpoint /drive/My Drive/final_project/checkpoint-9000


BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_

In [None]:
param_optimizer = list(model.named_parameters())
print(param_optimizer[-2])
print(param_optimizer[-1])

('qa_outputs.weight', Parameter containing:
tensor([[-1.2649e-02, -6.6199e-03, -3.1451e-03,  ...,  6.7657e-05,
         -3.6961e-02,  4.1573e-03],
        [-5.7847e-02, -2.8350e-02, -1.7493e-02,  ..., -3.5200e-02,
         -2.6129e-03,  1.4186e-02]], device='cuda:0', requires_grad=True))
('qa_outputs.bias', Parameter containing:
tensor([0.0150, 0.0154], device='cuda:0', requires_grad=True))


In [None]:
learning_rate = 5e-5
adam_epsilon=1e-8
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)

In [None]:
torch.cuda.empty_cache()
num_train_epochs = 1

print("***** Running training *****")
print("  Num examples = %d" % len(dataset))
print("  Num Epochs = %d" % num_train_epochs)
print("  Batch size = %d" % batch_size)
print("  Total optimization steps = %d" % (len(train_dataloader) // num_train_epochs))

model.zero_grad()
train_iterator = trange(num_train_epochs, desc="Epoch")
set_seed()

global_step = int(global_step)
for _ in train_iterator:
    epoch_iterator = tqdm(train_dataloader, desc="Iteration")
    for step, batch in enumerate(epoch_iterator):
      if step < global_step + 1:
        continue

      model.train()
      batch = tuple(t.to(device) for t in batch)

      inputs = {'input_ids':       batch[0],
                'attention_mask':  batch[1], 
                'token_type_ids':  batch[2],  
                'start_positions': batch[3], 
                'end_positions':   batch[4]}

      inputs = inputs.to(device)

      outputs = model(**inputs)

      loss = outputs[0]
      train_loss_set.append(loss)
      loss.backward()
      torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

      tr_loss += loss.item()
      optimizer.step()
      model.zero_grad()
      global_step += 1
    
      if global_step % 1000 == 0:
        print("Train loss: {}".format(tr_loss/global_step))
        output_dir = '/drive/My Drive/final_project/checkpoint-{}'.format(global_step)
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
        model_to_save.save_pretrained(output_dir)
        print("Saving training_loss.pt to %s" % os.path.join(output_dir, '/training_loss.pt'))
        torch.save(torch.tensor(train_loss_set), os.path.join(output_dir, '/training_loss.pt'))
        print("Saving model checkpoint to %s" % output_dir)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/10304 [00:00<?, ?it/s][A


***** Running training *****
  Num examples = 144262
  Num Epochs = 1
  Batch size = 14
  Total optimization steps = 10304


Iteration:   0%|          | 19/10304 [00:00<01:13, 139.53it/s][A
Iteration:   3%|▎         | 287/10304 [00:00<00:51, 194.98it/s][A
Iteration:   6%|▌         | 574/10304 [00:00<00:35, 270.65it/s][A
Iteration:   9%|▉         | 911/10304 [00:00<00:25, 373.78it/s][A
Iteration:  12%|█▏        | 1241/10304 [00:00<00:17, 509.23it/s][A
Iteration:  15%|█▍        | 1516/10304 [00:00<00:13, 673.92it/s][A
Iteration:  18%|█▊        | 1827/10304 [00:00<00:09, 880.92it/s][A
Iteration:  21%|██        | 2161/10304 [00:00<00:07, 1130.55it/s][A
Iteration:  24%|██▍       | 2450/10304 [00:00<00:05, 1376.46it/s][A
Iteration:  27%|██▋       | 2771/10304 [00:01<00:04, 1660.94it/s][A
Iteration:  30%|██▉       | 3087/10304 [00:01<00:03, 1936.46it/s][A
Iteration:  33%|███▎      | 3437/10304 [00:01<00:03, 2235.42it/s][A
Iteration:  37%|███▋      | 3764/10304 [00:01<00:02, 2469.29it/s][A
Iteration:  40%|███▉      | 4084/10304 [00:01<00:02, 2638.10it/s][A
Iteration:  43%|████▎     | 4402/10304 [00:01<

AttributeError: ignored

In [None]:
output_dir = '/drive/My Drive/final_project/checkpoint-final'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(output_dir)

In [None]:
train_loss_set_ckpt = torch.load('/drive/My Drive/final_project/checkpoint-final/training_loss.pt')
train_loss_set = to_list(train_loss_set_ckpt)

In [None]:
plt.figure(figsize=(15,8))
plt.title("Training loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()

**Load test dataset**

In [None]:
input_file = '/drive/My Drive/final_project/dev-v2.0.json'
val_examples = read_squad_examples(input_file=input_file,
                                is_training=False,
                                version_2_with_negative=True)
doc_stride = 128
max_seq_length = 256
max_query_length = 64
cached_features_file = '/drive/My Drive/final_project/cache_validation'

# Cache features for faster loading
if not os.path.exists(cached_features_file):
  features = convert_examples_to_features(examples=val_examples,
                                        tokenizer=tokenizer,
                                        max_seq_length=max_seq_length,
                                        doc_stride=doc_stride,
                                        max_query_length=max_query_length,
                                        is_training=False)
  torch.save(features, cached_features_file)
else:
  features = torch.load(cached_features_file)

In [None]:
# Convert to Tensors and build dataset
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)
all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)

all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
                        all_example_index, all_cls_index, all_p_mask)

In [None]:
validation_sampler = SequentialSampler(dataset)
validation_dataloader = DataLoader(dataset, sampler=validation_sampler, batch_size=batch_size, drop_last=True)

**Evaluate test dataset**

In [None]:

def evaluate(model, tokenizer):
  print("***** Running evaluation *****")
  print("  Num examples = %d" % len(dataset))
  print("  Batch size = %d" % batch_size)
  all_results = []
  predict_file = '/drive/My Drive/final_project/dev-v2.0.json'
  for batch in tqdm(validation_dataloader, desc="Evaluating", miniters=100, mininterval=5.0):
    model.eval()
    batch = tuple(t.to(device) for t in batch)
    with torch.no_grad():
      inputs = {'input_ids':      batch[0],
                'attention_mask': batch[1],
                'token_type_ids': batch[2]
                }
      example_indices = batch[3]
      outputs = model(**inputs)

    for i, example_index in enumerate(example_indices):
      eval_feature = features[example_index.item()]
      unique_id = int(eval_feature.unique_id)

      result = RawResult(unique_id    = unique_id,
                         start_logits = to_list(outputs[0][i]),
                         end_logits   = to_list(outputs[1][i]))
      all_results.append(result)

  # Compute predictions
  output_prediction_file = "/drive/My Drive/final_project/predictions.json"
  output_nbest_file = "/drive/My Drive/final_project/nbest_predictions.json"
  output_null_log_odds_file = "/drive/My Drive/final_project/null_odds.json"
  output_dir = "/drive/My Drive/final_project/predict_results"

  write_predictions(val_examples, features, all_results, 10,
                  30, True, output_prediction_file,
                  output_nbest_file, output_null_log_odds_file, False,
                  True, 0.0)

  # Evaluate with the official SQuAD script
  evaluate_options = EVAL_OPTS(data_file=predict_file,
                               pred_file=output_prediction_file,
                               na_prob_file=output_null_log_odds_file,
                               out_image_dir=None)
  results = evaluate_on_squad(evaluate_options)
  return results

In [None]:
!touch "/drive/My Drive/final_project/custom_input.json"

In [None]:
results = evaluate(model, tokenizer)

In [None]:
import json
results_json = []
for k in enumerate(results.keys()):
  result_dict = {k[1] : results[k[1]]}
  results_json.append(result_dict)
print(results_json)
with open('results.json', 'w') as f:
  json.dump(results_json, f)

**Evaluate on any text**

In [None]:
from transformers import BertTokenizer

def getAnswerToQuestion(question,paragraph,model):

  tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

  encoding = tokenizer.encode_plus(text=question,text_pair=paragraph,add_special=True)

  inputs = encoding['input_ids']
  sentence_embedding = encoding['token_type_ids']
  tokens = tokenizer.convert_ids_to_tokens(inputs)


  model = model.to(device)



  start_scores, end_scores = model(input_ids=torch.tensor([inputs], device='cuda'), token_type_ids=torch.tensor([sentence_embedding], device='cuda'))



  start_index = torch.argmax(start_scores)

  end_index = torch.argmax(end_scores)

  answer = ' '.join(tokens[start_index:end_index+1])

  corrected_answer = ''

  for word in answer.split():
      
      #If it's a subword token
      if word[0:2] == '##':
          corrected_answer += word[2:]
      else:
          corrected_answer += ' ' + word

  return corrected_answer



In [None]:
#If there is no good answer, the bot wont output an answer.

question = '''What is Cartography?'''

paragraph = '''Cartography studies the representation of the Earth's surface with abstract symbols (map making). Although other subdisciplines of geography rely on maps for presenting their analyses, the actual making of maps is abstract enough to be regarded separately. Cartography has grown from a collection of drafting techniques into an actual science.

Cartographers must learn cognitive psychology and ergonomics to understand which symbols convey information about the Earth most effectively, and behavioural psychology to induce the readers of their maps to act on the information. They must learn geodesy and fairly advanced mathematics to understand how the shape of the Earth affects the distortion of map symbols projected onto a flat surface for viewing. It can be said, without much controversy, that cartography is the seed from which the larger field of geography grew. Most geographers will cite a childhood fascination with maps as an early sign they would end up in the field.''' 

print(getAnswerToQuestion(question,paragraph,model))

Code was heavily inspired by the following blog-post: https://towardsdatascience.com/bert-nlp-how-to-build-a-question-answering-bot-98b1d1594d7b