# Embedding with BERT using GluonNLP

This notebook provides a short example of installing GluonNLP, and then using a pretrained BERT model to encode a sentence.  This can be used as an encoding method for any other downstream learning algorithm, and is an excellent method to use during early stages of product development.  If the application seems sound, the model can be fine-tuned for additional performance.  Scripts to aid in this task can be found [here](https://gluon-nlp.mxnet.io/master/model_zoo/bert/index.html).

In [1]:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
# SPDX-License-Identifier: MIT-0

# Load Relevant Libraries
!pip install --upgrade pip
!pip install --upgrade mxnet gluonnlp

import warnings
warnings.filterwarnings('ignore')

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (20.0.2)
Requirement already up-to-date: mxnet in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (1.5.1.post0)
Requirement already up-to-date: gluonnlp in /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (0.8.3)


In [2]:
import gluonnlp as nlp
import mxnet as mx

# Load a Small BERT model
model, vocab = nlp.model.get_model('bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', use_classifier=False, use_decoder=False);
tokenizer = nlp.data.BERTTokenizer(vocab, lower=True);
transform = nlp.data.BERTSentenceTransform(tokenizer, max_seq_length=512, pair=False, pad=False);

# Transform Text
sample = transform(['AWS Embark provides onboarding, training, and implementation support to launch your machine learning journey!']);
words, valid_len, segments = mx.nd.array([sample[0]]), mx.nd.array([sample[1]]), mx.nd.array([sample[2]]);

# Encode
seq_encoding, cls_encoding = model(words, segments, valid_len);

The first step of using a transformer is to split the sentence into tokens for a vocabulary.  This is handled cleanly by Gluon, but we can look inside to see how it is split.  Notice that the model actually uses a subword vocabulary---where some words are split into constituant parts like "onboarding" becoming `'onboard'` and `'##ing'`.

In [3]:
[vocab.to_tokens(int(w.asscalar())) for w in words[0]]

['[CLS]',
 'aw',
 '##s',
 'embark',
 'provides',
 'onboard',
 '##ing',
 ',',
 'training',
 ',',
 'and',
 'implementation',
 'support',
 'to',
 'launch',
 'your',
 'machine',
 'learning',
 'journey',
 '!',
 '[SEP]']

We can now look at an embedding that can be used for downstream tasks like classification, called the `cls_embedding`.  The other term `seq_encoding` gives an encoding for each token in the sentence and can be used for tasks like machine translation or part-of-speach tagging.

In [4]:
cls_encoding


[[-0.93319327 -0.64595526 -0.9838873   0.8745588   0.83370477 -0.27255702
   0.83646435  0.5017094  -0.96900374 -0.99999803 -0.80741197  0.9707317
   0.9694664   0.8220918   0.94080067 -0.8032321  -0.5185906  -0.68784434
   0.5078429  -0.19376715  0.82072085  0.99999994 -0.4470562   0.46086797
   0.72556126  0.9984201  -0.91346705  0.9421183   0.9667642   0.8444159
  -0.7319919   0.4337138  -0.98731196 -0.42980403 -0.9875463  -0.99363977
   0.6884772  -0.78827167 -0.02625772 -0.22823371 -0.9275603   0.49193987
   0.99999875  0.3563488   0.8028779  -0.44576868 -1.          0.39309335
  -0.8957847   0.99088544  0.9376737   0.97079736  0.4507445   0.71947384
   0.67045027 -0.55917674  0.11889828  0.2813297  -0.45418862 -0.73720706
  -0.6546041   0.5714999  -0.9635186  -0.91710484  0.98809004  0.9682401
  -0.34150386 -0.41875562 -0.35155183  0.26857936  0.9030835   0.39205298
  -0.4644852  -0.9238823   0.884571    0.5233499  -0.82955945  1.
  -0.7029816  -0.9657539   0.9412182   0.9366446

etc.