<a href="https://colab.research.google.com/github/RichardXiao13/Google_Code_In/blob/master/Use_ALBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Albert Classification for TF2**
Albert is  "A Lite" version of BERT which handles many natural language processing tasks. This lite version has greatly reduced parameters allowing for ease of use for more limiting hardware without compromising on performance. The only issue is that Albert requires preprocessed text in order to train or predict new data. Because of this, it may seem very difficult to first setup Albert. We will try to make this task as easy as possible here. Then, in this Colab, we can use Albert from Tensorflow Hub to create a classifier that distinguishes between positive and negative movie reviews.

# **Setup**
To begin this Colab, we must first import tensorflow 2.0. We also have to install albert-tensorflow in order to allow Albert to work in this environment without Tensorflow 1.x. Once those are installed, we can then import Tensorflow Hub to setup Albert. Albert requires preprocessed text in the form of tokens according to its vocab file, so we must import tokenization from bert.

In [0]:
try:
  %tensorflow_version 2.x
except:
  Exception
import tensorflow as tf

TensorFlow 2.x selected.


In [0]:
!pip install bert-for-tf2
!pip install sentencepiece

Collecting bert-for-tf2
  Downloading https://files.pythonhosted.org/packages/93/31/1f9d1d5ccafb5b8bb621b02c4c5bd9e9f6599ec9b305f7307f1b6c5ae0b5/bert-for-tf2-0.12.7.tar.gz
Collecting py-params>=0.7.3
  Downloading https://files.pythonhosted.org/packages/ec/17/71c5f3c0ab511de96059358bcc5e00891a804cd4049021e5fa80540f201a/py-params-0.8.2.tar.gz
Collecting params-flow>=0.7.1
  Downloading https://files.pythonhosted.org/packages/0d/12/2604f88932f285a473015a5adabf08496d88dad0f9c1228fab1547ccc9b5/params-flow-0.7.4.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Created wheel for bert-for-tf2: filename=bert_for_tf2-0.12.7-cp36-none-any.whl size=29176 sha256=a6547dbe342959f2f2b80d675b7f14c978da7496ff6d08e57e46d9762dfc1e63
  Stored in directory: /root/.cache/pip/wheels/87/77/d0/2118abd9686bbeebfde72a494dfbdc012087e3560d9d380ab7
  Building wheel for py-params (setup.py) ... [?25l[?25hdone
  C

In [0]:
import tensorflow_hub as hub
from bert import tokenization

# **Download Albert**
Using Tensorflow Hub, we can extract the download for Albert to find the vocab file for tokenization. Then, we have to edit the vocab file to include the keyword `["UNK"]   0` so that the tokenizer can put unknown words as `UNK` tokens.

In [0]:
tf.keras.utils.get_file("ALBERT.tar.gz", "https://tfhub.dev/google/albert_base/2?tf-hub-format=compressed", extract=True, cache_dir="/content/")

Downloading data from https://tfhub.dev/google/albert_base/2?tf-hub-format=compressed


'/content/datasets/ALBERT.tar.gz'

In [0]:
vocab_path = "/content/datasets/assets/30k-clean.vocab"

# **Tokenizer**
Albert requires preprocessed inputs as tokens to work. This means we have to create a tokenizer from the vocab file we extracted. Then, we have to use this tokenizer to convert sentences to tokens. We also have to create masks and segment ids from the sentences so that Albert can return pooled values. We can establish a max sequence length so that each sentence contains the same amount of tokens.

In [0]:
def create_tokenizer(vocab_file, do_lower_case=True):
  tokenizer = tokenization.albert_tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)

  return tokenizer

In [0]:
tokenizer = create_tokenizer(vocab_path)

In [0]:
def sentence_to_features(sentence, tokenizer, max_sent_len):
    tokens = ["[CLS]"]
    tokens.extend(tokenizer.tokenize(sentence))
    if len(tokens) > max_sent_len-1:
        tokens = tokens[:max_sent_len-1]
    tokens.append("[SEP]")
    
    segment_ids = [0] * len(tokens)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_ids)

    zero_mask = [0] * (max_sent_len-len(tokens))
    input_ids.extend(zero_mask)
    input_mask.extend(zero_mask)
    segment_ids.extend(zero_mask)
    
    return input_ids, input_mask, segment_ids

In [0]:
def convert_sentences_to_features(sentences, tokenizer, max_sent_len):
    new_input_ids = []
    new_input_masks = []
    new_segment_ids = []
    
    for sentence in sentences:
        input_ids, input_mask, segment_ids = sentence_to_features(sentence, tokenizer, max_sent_len)
        new_input_ids.append(input_ids)
        new_input_masks.append(input_mask)
        new_segment_ids.append(segment_ids)
    
    return tf.constant(new_input_ids[0]), tf.constant(new_input_masks[0]), tf.constant(new_segment_ids[0])

# **Load imdb_reviews From Tensorflow Datasets**
This will be the dataset we use to classify if a movie review is negative or positive.

In [0]:
import tensorflow_datasets as tfds

In [0]:
dataset = tfds.load("imdb_reviews", as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews (80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/0.1.0...[0m


HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Completed...', max=1, style=ProgressStyl…

HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Size...', max=1, style=ProgressStyle(des…






HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))



HBox(children=(IntProgress(value=0, description='Shuffling...', max=10, style=ProgressStyle(description_width=…

Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…



HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))



HBox(children=(IntProgress(value=0, description='Shuffling...', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…



HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))



HBox(children=(IntProgress(value=0, description='Shuffling...', max=20, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=1, bar_style='info', description='Reading...', max=1, style=ProgressStyle(des…

HBox(children=(IntProgress(value=0, description='Writing...', max=2500, style=ProgressStyle(description_width=…

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/0.1.0. Subsequent calls will reuse this data.[0m


In [0]:
train_data = dataset["train"]
train_data = train_data.batch(1)

In [0]:
test_data = dataset["test"]
test_data = test_data.batch(1)

# **Convert to Pooled Values**
For our classifier, we have to convert our sentences from the dataset to data that Albert understands. Therefore, we must create a function that takes the dataset and returns the ids, masks, segments, and labels for all the data. Then, we can use these values as inputs for another function that will take the Albert module and convert the information into pooled values which can then be used for classification. Finally, we can create a new dataset using the pooled values and the labels we extracted previously. Note that Albert returns two types of outputs, pooled and sequence outputs. In this case we will only use pooled outputs as this output represents the entire sequence. Sequence outputs represent values for each token in the sequence.

In [0]:
albert_module = hub.load("https://tfhub.dev/google/albert_base/2")

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [0]:
def create_albert(input_ids, input_mask, segment_ids):
  albert_outputs = albert_module.signatures["tokens"](input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids)
  pooled_output = albert_outputs["pooled_output"]
  sequence_output = albert_outputs["sequence_output"]
  return pooled_output, sequence_output

In [0]:
max_seq_len = 128

In [0]:
def preprocess(data, num):
  ids = []
  masks = []
  segs = []
  lbls = []
  for text, label in data.take(num):
    lbl = []
    text = text.numpy()
    in_ids, in_masks, in_segs = convert_sentences_to_features(text, tokenizer, max_seq_len)
    ids.append(in_ids)
    masks.append(in_masks)
    segs.append(in_segs)
    lbl.append(int(label.numpy()[0]))
    lbls.append(lbl)
  return ids, masks, segs, lbls

In [0]:
ids, masks, segs, lbls = preprocess(train_data)

In [0]:
test_ids, test_masks, test_segs, test_lbls = preprocess(test_data)

In [0]:
def get_pools(ids, masks, segs, num):
  pooled_outputs = []
  for i in range(num):
    pools, sequences = create_albert(ids[i], masks[i], segs[i])
    pooled_outputs.append(pools)
  return pooled_outputs

In [0]:
num = 7500

In [0]:
pooled_outputs = get_pools(ids, masks, segs, num)

In [0]:
test_pooled_outputs = get_pools(test_ids, test_masks, test_segs, num)

In [0]:
train_set = tf.data.Dataset.from_tensor_slices((pooled_outputs, lbls))
train_set = train_set.batch(64)

In [0]:
test_set = tf.data.Dataset.from_tensor_slices((test_pooled_outputs, test_lbls))
test_set = test_set.batch(64)

# **Create the Model**
We can take the new dataset we created and feed it into a model that takes an input with shape(None, max_seq_len, 768). Note that we use max_seq_len and 768 as that is the shape of the tensor that Albert returns.

In [0]:
model = tf.keras.Sequential([
                             tf.keras.layers.Input(shape=(max_seq_len, 768)),
                             tf.keras.layers.Flatten(),
                             tf.keras.layers.Dense(256, activation="relu"),
                             tf.keras.layers.Dense(1, activation="sigmoid")
])

In [0]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

In [0]:
history = model.fit(train_set, epochs=20)

Train for 118 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


# **Try it Out**
Here, we can try our own sentence and see what our model predicts!

In [0]:
sentence = ["This movie was pretty bad!", "I would not want to recommend this to a friend."]
test_ids, test_masks, test_segs = convert_sentences_to_features(sentence, tokenizer, max_seq_len)
test_pools, _ = create_albert(test_ids, test_masks, test_segs)

In [0]:
test_pools = tf.expand_dims(test_pools, axis=0)
predictions = model.predict(test_pools)
predictions = tf.math.argmax(predictions, axis=-1)

In [0]:
labels = ["Negative", "Positive"]
for i in predictions:
  i.numpy()
  pred = labels[i]
  print(pred)

Negative
