<a href="https://colab.research.google.com/github/Jhansoll/nlp_tutorials/blob/main/open_qa_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install --quiet faiss-cpu
!pip install --quiet tensorflow
!pip install --quiet tensorflow_hub
!pip install --quiet tf-models-official

In [2]:
import pickle
import faiss
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from official.nlp.bert import tokenization

In [3]:
BERT_MODEL = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3"

tfhub_handle_passage_encoder = hub.load(BERT_MODEL)
tfhub_handle_question_encoder = hub.load(BERT_MODEL)
tfhub_handle_reader_encoder = hub.load(BERT_MODEL)

vocab_file = tfhub_handle_passage_encoder.vocab_file.asset_path.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case=tfhub_handle_passage_encoder.do_lower_case)

In [4]:
tfhub_handle_passage_encoder.do_lower_case

<tf.Variable 'Variable:0' shape=() dtype=bool, numpy=True>

`Uncased` means that the text has been lowercased before WordPiece tokenization, e.g., John Smith becomes john smith. The `Uncased` model also strips out any accent markers. `Cased` means that the true case and accent markers are preserved. Typically, the `Uncased` model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging).

When using a cased model, make sure to pass `--do_lower=False` to the training scripts.


In [5]:
!gdown --id {"1UmhRecUPwug7djN2gkNtfjDol2Pm3qhv"} -O dpr_weights.pkl

Downloading...
From: https://drive.google.com/uc?id=1UmhRecUPwug7djN2gkNtfjDol2Pm3qhv
To: /content/dpr_weights.pkl
1.31GB [00:10, 120MB/s]


# 1. Data


In [6]:
question1 = "How many Harry Potter books are there?"
question2 = "Who is the author of Harry Potter?"
question3 = "What is the name of Harry's best friends?"

passage1 = "Harry Potter is a series of seven fantasy novels written by British author, J. K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and Muggles (non-magical people)."
passage2 = "Since the release of the first novel, Harry Potter and the Philosopher's Stone, on 26 June 1997, the books have found immense popularity, positive reviews, and commercial success worldwide. They have attracted a wide adult audience as well as younger readers and are often considered cornerstones of modern young adult literature.[2] As of February 2018, the books have sold more than 500 million copies worldwide, making them the best-selling book series in history, and have been translated into eighty languages.[3] The last four books consecutively set records as the fastest-selling books in history, with the final installment selling roughly eleven million copies in the United States within twenty-four hours of its release."
passage3 = "The series was originally published in English by two major publishers, Bloomsbury in the United Kingdom and Scholastic Press in the United States. A play, Harry Potter and the Cursed Child, based on a story co-written by Rowling, premiered in London on 30 July 2016 at the Palace Theatre, and its script was published by Little, Brown. The original seven books were adapted into an eight-part namesake film series by Warner Bros. Pictures, which is the third highest-grossing film series of all time as of February 2020. In 2016, the total value of the Harry Potter franchise was estimated at $25 billion,[4] making Harry Potter one of the highest-grossing media franchises of all time."
passage4 = "A series of many genres, including fantasy, drama, coming of age, and the British school story (which includes elements of mystery, thriller, adventure, horror, and romance), the world of Harry Potter explores numerous themes and includes many cultural meanings and references.[5] According to Rowling, the main theme is death.[6] Other major themes in the series include prejudice, corruption, and madness."
passage5 = "The success of the books and films has allowed the Harry Potter franchise to expand with numerous derivative works, a travelling exhibition that premiered in Chicago in 2009, a studio tour in London that opened in 2012, a digital platform on which J. K. Rowling updates the series with new information and insight, and a pentalogy of spin-off films premiering in November 2016 with Fantastic Beasts and Where to Find Them, among many other developments. Most recently, themed attractions, collectively known as The Wizarding World of Harry Potter, have been built at several Universal Parks & Resorts amusement parks around the world."

questions = [question1, question2, question3]
passages = [passage1, passage2, passage3, passage4, passage5]


In [7]:
def build_input(tokenizer, sentence1, sentence2=None, max_seq_length=512):
  """Generate (input_ids, input_mask, segment_ids) for a single sentence."""
  # Tokenize and
  tokens = tokenizer.tokenize(sentence1)
  tokens = ["[CLS]"] + tokens + ["[SEP]"]
  if sentence2:
    len_token1 = len(tokens)
    tokens2 = tokenizer.tokenize(sentence2)
    len_token2 = len(tokens2)+1
    tokens = tokens + tokens2 + ["[SEP]"]
  ids = tokenizer.convert_tokens_to_ids(tokens)
  
  # Pad the ids to max sequence length
  pad_len = max_seq_length - len(ids)
  input_ids = ids + [0]*pad_len
  input_mask = [1]*len(ids) + [0]*pad_len

  # Single sentence segment_ids are all 0
  segment_ids = [0]*max_seq_length
  if sentence2:
    pad_len = max_seq_length - len(tokens)
    segment_ids = [0]*len_token1+ [1]*len_token2 + [0]*pad_len
  return (input_ids, input_mask, segment_ids)

In [8]:
# Convert the sentences to bert inputs
question_inputs = [build_input(tokenizer, s) for s in questions]

# Slice to batch each input tensor
question_input_ids = np.array([x[0] for x in question_inputs], dtype=np.int32)
question_input_masks = np.array([x[1] for x in question_inputs], dtype=np.int32)
question_segment_ids = np.array([x[2] for x in question_inputs], dtype=np.int32)

# Convert the sentences to bert inputs
passage_inputs = [build_input(tokenizer, s) for s in passages]

# Slice to batch each input tensor
passage_input_ids = np.array([x[0] for x in passage_inputs], dtype=np.int32)
passage_input_masks = np.array([x[1] for x in passage_inputs], dtype=np.int32)
passage_segment_ids = np.array([x[2] for x in passage_inputs], dtype=np.int32)

In [9]:
question_input_ids.shape

(3, 512)

In [10]:
question_input_ids

array([[ 101, 2129, 2116, ...,    0,    0,    0],
       [ 101, 2040, 2003, ...,    0,    0,    0],
       [ 101, 2054, 2003, ...,    0,    0,    0]], dtype=int32)

In [11]:
tokenizer.convert_ids_to_tokens(question_input_ids[0][:15])

['[CLS]',
 'how',
 'many',
 'harry',
 'potter',
 'books',
 'are',
 'there',
 '?',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

In [12]:
question_input_masks[0][:15]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0], dtype=int32)

In [13]:
passage_segment_ids[0][:15]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

# 2. Retreiver model

In [14]:
passage_encoder = hub.KerasLayer(tfhub_handle_passage_encoder)
question_encoder = hub.KerasLayer(tfhub_handle_question_encoder)

with open("dpr_weights.pkl", "rb") as f:
  get_passage, get_question, get_reader = pickle.load(f)

In [15]:
passage_encoder.set_weights(get_passage)
question_encoder.set_weights(get_question)

In [16]:
passage_outputs = passage_encoder({"input_word_ids":passage_input_ids, "input_mask":passage_input_masks, "input_type_ids":passage_segment_ids})
question_outputs = question_encoder({"input_word_ids":question_input_ids, "input_mask":question_input_masks, "input_type_ids":question_segment_ids})

passage_vectors = passage_outputs['sequence_output'][:, 0, :] # [CLS]
question_vectors = question_outputs['sequence_output'][:, 0, :]

# 3. Find a passage !

In [17]:
vectors_size = 768
index = faiss.IndexFlatL2(vectors_size)

p_vectors = np.array(passage_vectors)
q_vectors = np.array(question_vectors)

In [18]:
print(p_vectors.shape, q_vectors.shape)

(5, 768) (3, 768)


In [19]:
index.add(p_vectors) 

In [20]:
index.ntotal

5

In [21]:
k = 1
D, I = index.search(q_vectors, k)

In [22]:
D

array([[66.06739 ],
       [81.45181 ],
       [87.274216]], dtype=float32)

In [23]:
I

array([[2],
       [0],
       [0]])

# 4. Reader model

In [24]:
reader_inputs = [build_input(tokenizer, questions[q], passages[p[0]]) for q, p in enumerate(I)]

In [25]:
# Slice to batch each input tensor
reader_input_ids = np.array([x[0] for x in reader_inputs], dtype=np.int32)
reader_input_masks = np.array([x[1] for x in reader_inputs], dtype=np.int32)
reader_segment_ids = np.array([x[2] for x in reader_inputs], dtype=np.int32)


In [26]:
print(reader_input_ids.shape)
print(reader_input_masks.shape)
print(reader_segment_ids.shape)

(3, 512)
(3, 512)
(3, 512)


In [27]:
tokenizer.convert_ids_to_tokens(reader_input_ids[0][:180])

['[CLS]',
 'how',
 'many',
 'harry',
 'potter',
 'books',
 'are',
 'there',
 '?',
 '[SEP]',
 'the',
 'series',
 'was',
 'originally',
 'published',
 'in',
 'english',
 'by',
 'two',
 'major',
 'publishers',
 ',',
 'blooms',
 '##bury',
 'in',
 'the',
 'united',
 'kingdom',
 'and',
 'scholastic',
 'press',
 'in',
 'the',
 'united',
 'states',
 '.',
 'a',
 'play',
 ',',
 'harry',
 'potter',
 'and',
 'the',
 'cursed',
 'child',
 ',',
 'based',
 'on',
 'a',
 'story',
 'co',
 '-',
 'written',
 'by',
 'row',
 '##ling',
 ',',
 'premiered',
 'in',
 'london',
 'on',
 '30',
 'july',
 '2016',
 'at',
 'the',
 'palace',
 'theatre',
 ',',
 'and',
 'its',
 'script',
 'was',
 'published',
 'by',
 'little',
 ',',
 'brown',
 '.',
 'the',
 'original',
 'seven',
 'books',
 'were',
 'adapted',
 'into',
 'an',
 'eight',
 '-',
 'part',
 'namesake',
 'film',
 'series',
 'by',
 'warner',
 'bros',
 '.',
 'pictures',
 ',',
 'which',
 'is',
 'the',
 'third',
 'highest',
 '-',
 'grossing',
 'film',
 'series',
 'of'

In [28]:
reader_input_masks[0][:180]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0], dtype=int32)

In [29]:
reader_segment_ids[0][:180]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0], dtype=int32)

In [30]:
reader_encoder = hub.KerasLayer(tfhub_handle_reader_encoder)
reader_encoder.set_weights(get_reader[:200])

In [31]:
class Reader(tf.keras.Model):
  def __init__(self):
    super(Reader, self).__init__()
    self.encoder = reader_encoder
    self.qa_outputs_weight = get_reader[-4]
    self.qa_outputs_bias = get_reader[-3]
    self.qa_classifier_weight = get_reader[-2]
    self.qa_classifier_bias = get_reader[-1]
  
  def call(self, x):
    batch_size = x["input_word_ids"].shape[0]

    x = self.encoder(x)
    sequence_output = x["sequence_output"]

    qa_outputs = tf.matmul(sequence_output, self.qa_outputs_weight) + self.qa_outputs_bias
    start_logits, end_logits = tf.split(qa_outputs, 2, axis=-1)
    start_logits = tf.reshape(start_logits, [batch_size, -1])
    end_logits = tf.reshape(end_logits, [batch_size, -1])

    cls_output = x["sequence_output"][:, 0, :]
    relevance_logits = tf.matmul(cls_output, self.qa_classifier_weight) + self.qa_classifier_bias

    return start_logits, end_logits, relevance_logits


In [32]:
reader = Reader()

In [33]:
inputs = {"input_word_ids":reader_input_ids, "input_mask":reader_input_masks, "input_type_ids":reader_segment_ids}

start_logits, end_logits, relevance_logits = reader(inputs)

# 5. Inference

In [34]:
def get_original(tokens_list):
  rm_special = [token[2:] if token.startswith("##") else " "+token for token in tokens_list]
  text = "".join(rm_special)
  return text[1:]

In [35]:
for i in range(len(questions)):
  tokens = tokenizer.convert_ids_to_tokens(reader_input_ids[i])

  start_index = tf.argmax(start_logits, 1)[i].numpy()
  end_index = tf.argmax(end_logits, 1)[i].numpy() + 1
  print('Question : {}'.format(questions[i]))
  print('Anwser : {}\n'.format(get_original(tokens[start_index:end_index])))

Question : How many Harry Potter books are there?
Anwser : seven

Question : Who is the author of Harry Potter?
Anwser : j . k . rowling

Question : What is the name of Harry's best friends?
Anwser : hermione granger

