## BERT & Extration
- [參考](https://keras.io/examples/nlp/text_extraction_with_bert/)

### 介紹

簡介
- 這是一個QA task，用了論文最常、標準使用的SQuAD資料集，含有 question, paragraph for context。

目標
- 找到答案(span ---> start position, end postion)去回答Question，其中透過"Exact Match" 當作metrics去衡量模型效果。(對比於ground-truth, 有多少百分比是正確的)

流程
1. 將context and question 輸入 BERT模型
2. 學習2個vector(S, T)相同維度於BERT的hidden state(才能計算相似性)
3. 計算每一個token是start or end的機率，透過將S dot product BERT 最後一層的hidden state 去計算softmax。然而T也是相同，只是代表的是end。
4. Fine-tune BERT and learn S and T along the way.

In [1]:
!pip install tokenizers transformers

Collecting tokenizers
  Downloading tokenizers-0.11.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 5.5 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 49.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 57.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 464 kB/s 
[?25hCollecting tokenizers
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.7 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x

In [2]:
import os
import re
import json
import string
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer, TFBertModel, BertConfig

In [3]:
max_len = 384
config = BertConfig()       # default 是論文參數

### BERT tokenzer

In [4]:
slow_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
save_path = 'bert_bast_uncased/'

if not os.path.exists(save_path):
    os.makedirs(save_path)
slow_tokenizer.save_pretrained(save_path)

# 重新load, 快速版本
tokenizer = BertWordPieceTokenizer(vocab=save_path + 'vocab.txt', lowercase=True)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Load the dataset

In [5]:
train_data_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json"
train_path = keras.utils.get_file("train.json", train_data_url)
eval_data_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
eval_path = keras.utils.get_file("eval.json", eval_data_url)

Downloading data from https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Downloading data from https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json


### 處理資料
- 將json檔案儲存到物件中
- 透過物件建立x_train, y_train, x_eval, y_eval

In [6]:
# 巢狀 dict

with open(train_path) as f:
    raw_train_data = json.load(f)

with open(eval_path) as f:
    raw_eval_data = json.load(f)

In [7]:
len(raw_train_data['data']), len(raw_train_data['data'][0])

(442, 2)

In [8]:
raw_train_data.keys()

dict_keys(['data', 'version'])

In [9]:
raw_train_data['data'][0].keys()

dict_keys(['title', 'paragraphs'])

In [10]:
raw_train_data['data'][0]['title']

'University_of_Notre_Dame'

In [11]:
len(raw_train_data['data'][0]['paragraphs']), raw_train_data['data'][0]['paragraphs'][0].keys()

(55, dict_keys(['context', 'qas']))

In [12]:
raw_train_data['data'][0]['paragraphs'][0]['context']

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

In [13]:
raw_train_data['data'][0]['paragraphs'][0]['qas']

[{'answers': [{'answer_start': 515, 'text': 'Saint Bernadette Soubirous'}],
  'id': '5733be284776f41900661182',
  'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'},
 {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ'}],
  'id': '5733be284776f4190066117f',
  'question': 'What is in front of the Notre Dame Main Building?'},
 {'answers': [{'answer_start': 279, 'text': 'the Main Building'}],
  'id': '5733be284776f41900661180',
  'question': 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?'},
 {'answers': [{'answer_start': 381,
    'text': 'a Marian place of prayer and reflection'}],
  'id': '5733be284776f41900661181',
  'question': 'What is the Grotto at Notre Dame?'},
 {'answers': [{'answer_start': 92,
    'text': 'a golden statue of the Virgin Mary'}],
  'id': '5733be284776f4190066117e',
  'question': 'What sits on top of the Main Building at Notre Dame?'}]

In [14]:
help(tokenizer.encode)

Help on method encode in module tokenizers.implementations.base_tokenizer:

encode(sequence: Union[str, List[str], Tuple[str]], pair: Union[str, List[str], Tuple[str], NoneType] = None, is_pretokenized: bool = False, add_special_tokens: bool = True) -> tokenizers.Encoding method of tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer instance
    Encode the given sequence and pair. This method can process raw text sequences as well
    as already pre-tokenized sequences.
    
    Args:
        sequence: InputSequence:
            The sequence we want to encode. This sequence can be either raw text or
            pre-tokenized, according to the `is_pretokenized` argument:
    
            - If `is_pretokenized=False`: `InputSequence` is expected to be `str`
            - If `is_pretokenized=True`: `InputSequence` is expected to be
                `Union[List[str], Tuple[str]]`
    
        is_pretokenized: bool:
            Whether the input is already pre-tokenized.
    
  

In [15]:
# tokenizer 使用說明

# 將str轉換成encoding
encoding = tokenizer.encode('Aaron will be one of the best data scientist!')
print(encoding)

# 透過attrubute 讀取需要的資訊
encoding.ids

Encoding(num_tokens=12, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


[101, 7158, 2097, 2022, 2028, 1997, 1996, 2190, 2951, 7155, 999, 102]

In [16]:
# sub-tokens: What is offset in tokenizer? For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's start position and end position relative to the original token it was split from. That means that if the first position in the tuple is anything other than 0, we will set its corresponding label to -100

encoding.offsets

[(0, 0),
 (0, 5),
 (6, 10),
 (11, 13),
 (14, 17),
 (18, 20),
 (21, 24),
 (25, 29),
 (30, 34),
 (35, 44),
 (44, 45),
 (0, 0)]

In [17]:
tokenizer.id_to_token(101)

'[CLS]'

In [18]:
# 要跳過第一個ids, 因為第一個是[CLS]
print(raw_train_data['data'][0]['paragraphs'][0]['qas'][0]['question'])
tokenizer.encode(raw_train_data['data'][0]['paragraphs'][0]['qas'][0]['question']).ids

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


[101,
 2000,
 3183,
 2106,
 1996,
 6261,
 2984,
 9382,
 3711,
 1999,
 8517,
 1999,
 10223,
 26371,
 2605,
 1029,
 102]

In [19]:
# SquadExample class: 讀取基本單位組合成可用資料型態

class SquadExample():
    def __init__( self, question, context, start_idx, answer_text, all_answers):
        self.question = question
        self.context = context
        self.start_idx = start_idx
        self.answer_text = answer_text
        self.all_answers = all_answers
        self.skip = False       # *代表此資料無法使用, 跳過
    
    def preprocess(self):
        # 處理 context, answer, question
        context = ' '.join(str(self.context).split())
        question = ' '.join(str(self.question).split())
        answer = ' '.join(str(self.answer_text).split())

        # 找到end idx
        end_idx = self.start_idx + len(answer)
        if end_idx >= len(context):
            self.skip = True
            return      # 不繼續處理
        
        # 生成一個boolean list to 代表是否為答案: 0-> 不相關, 1-> 相關答案
        is_char_in_ans = [0] * len(context)
        for idx in range(self.start_idx, end_idx):
            is_char_in_ans[idx] = 1
        
        # tokenize the context
        tokenized_context = tokenizer.encode(context)

        # *Find tokens that were created from answer characters
        ans_token_idx = []
        for idx, (start, end) in enumerate(tokenized_context.offsets):
            if sum(is_char_in_ans[start:end]) > 0:
                ans_token_idx.append(idx)
        
        # 如果沒有答案, 跳過
        if len(ans_token_idx) == 0:
            self.skip = True
            return
        
        # 找到start以及end的token idx
        start_token_idx = ans_token_idx[0]
        end_token_idx = ans_token_idx[-1]

        # tokenize the question
        tokenized_question = tokenizer.encode(question)

        # 將資料轉換成BERT可輸入形式
        input_ids = tokenized_context.ids + tokenized_question.ids[1:]      # 跳過[CLS], 只要context有即可。
        token_type_ids = [0] * len(tokenized_context.ids) + [1] * len(tokenized_question.ids[1:])
        attention_mask = [1] * len(input_ids)

        # Pad and create attention masks
        # 如果需要truncation, skip
        padding_length = max_len - len(input_ids)
        if padding_length > 0:
            input_ids += [0] * padding_length
            attention_mask += [0] * padding_length
            token_type_ids += [0] * padding_length      # 因為attention_mask 是否為1沒有差
        elif padding_length < 0:
            self.skip = True
            return
        
        self.input_ids = input_ids
        self.token_type_ids = token_type_ids
        self.attention_mask = attention_mask
        self.start_token_idx = start_token_idx
        self.end_token_idx = end_token_idx
        self.context_token_to_char = tokenized_context.offsets
    

In [20]:
def create_squad_examples(raw_data):
    """
        將讀入的dict轉換成SquasExample物件並處理
    """
    squad_examples = []
    for item in raw_data['data']:
        for para in item['paragraphs']:
            # 上下文只有一則
            # 但qa有多個
            context = para['context']
            for qa in para['qas']:
                question = qa['question']
                answer_text = qa['answers'][0]['text']
                all_answers = [_['text'] for _ in qa['answers']]     # answer_text 的集合
                start_idx = qa['answers'][0]['answer_start']
                squad_eg = SquadExample(
                    question, context, start_idx, answer_text, all_answers
                )
                squad_eg.preprocess()
                squad_examples.append(squad_eg)
    return squad_examples


def create_inputs_targets(squad_examples: list):
    """
        將SquadExample物件 讀取並得到標準輸入
    """
    dataset_dict = {
        'input_ids': [],
        'token_type_ids': [],
        'attention_mask': [],
        'start_token_idx': [],
        'end_token_idx': []
    }
    for item in squad_examples:
        if item.skip == False:
            for key in dataset_dict:
                dataset_dict[key].append(getattr(item, key))        # getattr
    for key in dataset_dict:
        dataset_dict[key] = np.array(dataset_dict[key])             # 將list 轉型 ndarray, why?下面又變成list了呀...

    x = [
        dataset_dict["input_ids"],
        dataset_dict["token_type_ids"],
        dataset_dict["attention_mask"],
    ]
    y = [dataset_dict['start_token_idx'], dataset_dict['end_token_idx']]
    return x, y

In [21]:
# 取得資料

train_squad_examples = create_squad_examples(raw_train_data)
x_train, y_train = create_inputs_targets(train_squad_examples)
print(f"{len(train_squad_examples)} training points created.")

eval_squad_examples = create_squad_examples(raw_eval_data)
x_eval, y_eval = create_inputs_targets(eval_squad_examples)
print(f"{len(eval_squad_examples)} evaluation points created.")

87599 training points created.
10570 evaluation points created.


### 透過 Keras Functional API 建造 QA-model

In [24]:
# encoder = TFBertModel.from_pretrained("bert-base-uncased")

def create_model():
    # BERT Encoder
    encoder = TFBertModel.from_pretrained('bert-base-uncased')

    # QA Model
    input_ids = layers.Input(shape=(max_len, ), dtype=tf.int32)
    token_type_ids = layers.Input(shape=(max_len,), dtype=tf.int32)
    attention_mask = layers.Input(shape=(max_len,), dtype=tf.int32)
    embedding = encoder(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask).last_hidden_state  # 原先是[0], 但我不太清楚這個用法

    print(embedding.shape)

    start_logits = layers.Dense(1, name='start_logit')(embedding)
    start_logits = layers.Flatten()(start_logits)

    end_logits = layers.Dense(1, name='end_logit')(embedding)
    end_logits = layers.Flatten()(end_logits)

    start_probs = layers.Activation('softmax')(start_logits)
    end_probs = layers.Activation('softmax')(end_logits)

    model = keras.Model(
        inputs=[input_ids, token_type_ids, attention_mask],
        outputs=[start_probs, end_probs]
    )
    loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
    optimizer = keras.optimizers.Adam(lr=5e-5)
    model.compile(optimizer=optimizer, loss=[loss, loss])
    return model

In [25]:
use_tpu = True
if use_tpu:
    # Create distribution strategy
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)

    # Create model
    with strategy.scope():
        model = create_model()
else:
    model = create_model()

model.summary()

INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.


INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.






INFO:tensorflow:Initializing the TPU system: grpc://10.81.217.146:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.81.217.146:8470


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


(None, 384, 768)
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_4 (InputLayer)           [(None, 384)]        0           []                               
                                                                                                  
 input_6 (InputLayer)           [(None, 384)]        0           []                               
                                                                                                  
 input_5 (InputLayer)           [(None, 384)]        0           []                               
                                                                                                  
 tf_bert_model_1 (TFBertModel)  TFBaseModelOutputWi  109482240   ['input_4[0][0]',                
                                thPoolingAndCrossAt               'input_6[0]

  super(Adam, self).__init__(name, **kwargs)


### Create evaluation Callback
This callback will compute the exact match score using the validation data after every epoch.

In [26]:
def normalize_text(text):
    text = text.lower()

    # Remove punctuations
    exclude = set(string.punctuation)
    text = "".join(ch for ch in text if ch not in exclude)

    # Remove articles
    regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
    text = re.sub(regex, " ", text)

    # Remove extra white space
    text = " ".join(text.split())
    return text


class ExactMatch(keras.callbacks.Callback):
    """
    Each `SquadExample` object contains the character level offsets for each token
    in its input paragraph. We use them to get back the span of text corresponding
    to the tokens between our predicted start and end tokens.
    All the ground-truth answers are also present in each `SquadExample` object.
    We calculate the percentage of data points where the span of text obtained
    from model predictions matches one of the ground-truth answers.
    """

    def __init__(self, x_eval, y_eval):
        self.x_eval = x_eval
        self.y_eval = y_eval

    def on_epoch_end(self, epoch, logs=None):
        pred_start, pred_end = self.model.predict(self.x_eval)      # 機率
        count = 0
        eval_examples_no_skip = [_ for _ in eval_squad_examples if _.skip == False]
        for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
            squad_eg = eval_examples_no_skip[idx]
            offsets = squad_eg.context_token_to_char
            start = np.argmax(start)        # 取得 token idx
            end = np.argmax(end)            # ..
            if start >= len(offsets):       # 無效的情況
                continue
            pred_char_start = offsets[start][0]
            if end < len(offsets):
                pred_char_end = offsets[end][1]
                pred_ans = squad_eg.context[pred_char_start:pred_char_end]
            else:
                pred_ans = squad_eg.context[pred_char_start:]

            normalized_pred_ans = normalize_text(pred_ans)
            normalized_true_ans = [normalize_text(_) for _ in squad_eg.all_answers]
            if normalized_pred_ans in normalized_true_ans:
                count += 1
        acc = count / len(self.y_eval[0])
        print(f"\nepoch={epoch+1}, exact match score={acc:.2f}")

In [27]:
exact_match_callback = ExactMatch(x_eval, y_eval)
model.fit(
    x_train,
    y_train,
    epochs=3,  # For demonstration, 3 epochs are recommended
    batch_size=64,
    callbacks=[exact_match_callback],
)

Epoch 1/3


INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'cond/Identity:0' shape=(None, 384) dtype=int64>, <tf.Tensor 'cond/Identity_8:0' shape=(None, 384) dtype=int64>, <tf.Tensor 'cond/Identity_16:0' shape=(None, 384) dtype=int64>, <tf.Tensor 'cond/Identity_24:0' shape=(None,) dtype=int64>, <tf.Tensor 'cond/Identity_32:0' shape=(None,) dtype=int64>]






Instructions for updating:
use `experimental_local_results` instead.


Instructions for updating:
use `experimental_local_results` instead.
INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'cond/Identity:0' shape=(None, 384) dtype=int64>, <tf.Tensor 'cond/Identity_8:0' shape=(None, 384) dtype=int64>, <tf.Tensor 'cond/Identity_16:0' shape=(None, 384) dtype=int64>, <tf.Tensor 'cond/Identity_24:0' shape=(None,) dtype=int64>, <tf.Tensor 'cond/Identity_32:0' shape=(None,) dtype=int64>]












INFO:absl:TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'cond/Identity:0' shape=(None, 384) dtype=int64>, <tf.Tensor 'cond/Identity_8:0' shape=(None, 384) dtype=int64>, <tf.Tensor 'cond/Identity_16:0' shape=(None, 384) dtype=int64>]



epoch=1, exact match score=0.77
Epoch 2/3
epoch=2, exact match score=0.79
Epoch 3/3
epoch=3, exact match score=0.78


<keras.callbacks.History at 0x7ff024029a10>

In [28]:
# 預測

y_pred = model.predict(x_eval)

In [29]:
y_pred[0],  y_pred[1]

(array([[5.9186207e-09, 1.9206613e-05, 1.4013402e-07, ..., 1.3489361e-11,
         1.5574445e-11, 1.6293548e-11],
        [1.6131739e-08, 3.4378456e-05, 1.8297284e-07, ..., 1.4263093e-10,
         1.6351513e-10, 1.7102693e-10],
        [2.9683221e-07, 3.5957509e-04, 2.6142588e-06, ..., 9.4501384e-09,
         8.7614227e-09, 9.4266692e-09],
        ...,
        [2.4830035e-07, 2.0170484e-03, 5.9491657e-03, ..., 1.6196097e-09,
         1.6146300e-09, 1.6107262e-09],
        [2.3781251e-07, 5.2778289e-04, 2.2249920e-03, ..., 5.3055133e-10,
         5.2479976e-10, 5.1583554e-10],
        [4.1277110e-07, 1.4055843e-03, 3.8796843e-03, ..., 1.0579613e-09,
         1.0704446e-09, 1.0332432e-09]], dtype=float32),
 array([[2.5357842e-09, 3.0078019e-07, 1.0804186e-07, ..., 2.4031123e-11,
         2.2678592e-11, 2.2076561e-11],
        [4.2492228e-09, 2.9729611e-07, 9.7134119e-08, ..., 1.2735134e-10,
         1.2118233e-10, 1.1722004e-10],
        [5.5991296e-07, 3.9844649e-06, 5.0460953e-06, ...,

In [30]:
# start 機率
y_pred[0].shape

(10331, 384)

In [31]:
# end 機率
y_pred[1].shape

(10331, 384)

In [32]:
# 抽出一個驗證

print(sum(y_pred[0][0]))
print(sum(y_pred[1][0]))

1.0000001233202513
1.0000001436599704


In [33]:
# 直接表示, 使用最後一個輸入來看看效果



pred_start, pred_end = model.predict(x_eval)      # 機率
count = 0
eval_examples_no_skip = [_ for _ in eval_squad_examples if _.skip == False]

for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
    squad_eg = eval_examples_no_skip[idx]
    offsets = squad_eg.context_token_to_char
    start = np.argmax(start)        # 取得 token idx
    end = np.argmax(end)            # ..
    if start >= len(offsets):       # 無效的情況
        continue
    pred_char_start = offsets[start][0]
    if end < len(offsets):
        pred_char_end = offsets[end][1]
        pred_ans = squad_eg.context[pred_char_start:pred_char_end]
    else:
        pred_ans = squad_eg.context[pred_char_start:]

    normalized_pred_ans = normalize_text(pred_ans)
    normalized_true_ans = [normalize_text(_) for _ in squad_eg.all_answers]
    if normalized_pred_ans in normalized_true_ans:
        count += 1
acc = count / len(y_eval[0])
print(f"exact match score={acc:.2f}")

exact match score=0.78


In [34]:
# 上下文
print(squad_eg.context)
# 問題
print(squad_eg.question)
# 答案
print(squad_eg.answer_text)

The pound-force has a metric counterpart, less commonly used than the newton: the kilogram-force (kgf) (sometimes kilopond), is the force exerted by standard gravity on one kilogram of mass. The kilogram-force leads to an alternate, but rarely used unit of mass: the metric slug (sometimes mug or hyl) is that mass that accelerates at 1 m·s−2 when subjected to a force of 1 kgf. The kilogram-force is not a part of the modern SI system, and is generally deprecated; however it still sees use for some purposes as expressing aircraft weight, jet thrust, bicycle spoke tension, torque wrench settings and engine output torque. Other arcane units of force include the sthène, which is equivalent to 1000 N, and the kip, which is equivalent to 1000 lbf.
What is the seldom used force unit equal to one thousand newtons?
sthène


In [35]:
# 整理

def look():
    squad_egs, pred_text_answers = [], []
    pred_start, pred_end = model.predict(x_eval)      # 機率
    count = 0
    eval_examples_no_skip = [_ for _ in eval_squad_examples if _.skip == False]

    for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
        squad_eg = eval_examples_no_skip[idx]
        offsets = squad_eg.context_token_to_char
        start = np.argmax(start)        # 取得 token idx
        end = np.argmax(end)            # ..
        if start >= len(offsets):       # 無效的情況
            continue
        pred_char_start = offsets[start][0]
        if end < len(offsets):
            pred_char_end = offsets[end][1]
            pred_ans = squad_eg.context[pred_char_start:pred_char_end]
        else:
            pred_ans = squad_eg.context[pred_char_start:]

        normalized_pred_ans = normalize_text(pred_ans)
        normalized_true_ans = [normalize_text(_) for _ in squad_eg.all_answers]
        if normalized_pred_ans in normalized_true_ans:
            count += 1
        
        # 儲存
        squad_egs.append(squad_eg)
        pred_text_answers.append(normalized_pred_ans)
    acc = count / len(y_eval[0])

    return squad_egs, pred_text_answers
    print(f"exact match score={acc:.2f}")

squad_egs, pred_text_answers = look()

In [36]:
for i in range(10):
    squad_eg = squad_egs[i]
    print(f'第 {i} 個樣本')
    # 上下文
    print(f'上下文: {squad_eg.context}')
    # 問題
    print(f'問題: {squad_eg.question}')
    # 答案
    print(f'答案: {squad_eg.answer_text}')
    # 預測結果
    print(f'預測結果: {pred_text_answers[i]}')
    print('-'*30)

第 0 個樣本
上下文: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
問題: Which NFL team represented the AFC at Super Bowl 50?
答案: Denver Broncos
預測結果: denver broncos
------------------------------
第 1 個樣本
上下文: Super Bowl 50 was an American football game to determine the champion 