# BERT Input Pipeline

In [1]:
import tensorflow.compat.v1 as tf

import utils
import bert_utils

In [2]:
# parameters

MAX_SEQ_LEN = 512
BERT_PATH = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

## Loading Training Set

The `load_ag_news_dataset` function performs the following:

1. Fetch dataset from the Internet
2. Load into memory
3. Perform basic preprocessing and shuffling

The test set can be loaded in the same way by setting the argument `test=True`.

In [3]:
train_text, train_label, num_classes = utils.load_ag_news_dataset(max_seq_len=MAX_SEQ_LEN,
                                                                  test=False)

Loaded training set from: /home/jovyan/.keras/datasets/ag_news
Examples: 120000 Classes: 4


For demo purposes in this notebook, we will only take the first 4 training examples.

In [4]:
train_text, train_label = train_text[:4], train_label[:4]

Display some examples:

In [5]:
for i, text in enumerate(train_text):
    print(train_label[i], text[0], "\n")

1 They really do drive for show Theirs is a golfing world without bunkers and hazards, sloping greens and Sunday pins. Only one thing matters to the professional long driver, and it's measured in yards, not strokes. 

3 Google Plans Desktop Search Tool for Apple PCs LOS ANGELES (Reuters) - Google Inc. &lt;A HREF="http://www.reuters.co.uk/financeQuoteLookup.jhtml?ticker=GOOG.O qtype=sym infotype=info qcat=news"&gt;GOOG.O&lt;/A&gt; plans to release a version of its desktop search tool for computers running on the Mac operating system from Apple Computer Inc. &lt;A HREF="http://www.reuters.co.uk/financeQuoteLookup.jhtml?ticker=AAPL.O qtype=sym infotype=info qcat=news"&gt;AAPL.O&lt;/A&gt;, Google chief executive Eric Schmidt said on Friday. 

3 Sony #39;s Vaio X: Like TiVo on Steroids CHIBA, JAPAN -- Sony will begin selling in Japan in November a combination personal computer and video server that can record up to seven channels of television simultaneously, it said at the Ceatec 2004 exhi

## Create the BERT Tokenizer

In [6]:
tokenizer = bert_utils.create_tokenizer_from_hub_module(BERT_PATH,
                                                        tf.Session())

Tokenize the text and convert into `InputExample` objects:

In [7]:
train_examples = bert_utils.convert_text_to_examples(train_text, train_label)

In [8]:
print(train_examples)

[<bert_utils.InputExample object at 0x7f338c733710>, <bert_utils.InputExample object at 0x7f338c733748>, <bert_utils.InputExample object at 0x7f338c733780>, <bert_utils.InputExample object at 0x7f338c7337b8>]


## Creating Training Data

Convert `InputExample` objects back into numpy arrays to fit into the training process.

In [9]:
feat = bert_utils.convert_examples_to_features(tokenizer,
                                               train_examples,
                                               max_seq_length=MAX_SEQ_LEN,
                                               verbose=1)

(train_input_ids, train_input_masks, train_segment_ids, train_labels) = feat

Converting examples to features: 100%|██████████| 4/4 [00:00<00:00, 759.84it/s]


### Input Tokens

Each token represents a unique WordPiece. Each array represents a training example with up to 512 tokens.

In [10]:
train_input_ids

array([[  101,  2027,  2428, ...,     0,     0,     0],
       [  101,  8224,  3488, ...,     0,     0,     0],
       [  101,  8412,  1001, ...,     0,     0,     0],
       [  101, 16565,  2012, ...,     0,     0,     0]])

### Input Masks

In [11]:
train_input_masks

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

### Segment IDs

In [12]:
train_segment_ids

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

### Labels

In [13]:
train_labels

array([[1],
       [3],
       [3],
       [2]])