# Sentiment Analysis in 10 Minutes with BERT and TensorFlow
- Original Article  
  https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671

- Data Source - Stanford Data Repository:  
  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

- https://ai.stanford.edu/~amaas/data/sentiment/

In [1]:
pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 16.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 48.9MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 52.7MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=b71e9ae098

In [2]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [3]:
myModelPath = '/gdrive/MyDrive/Colab Notebooks/Transformers/LocalModelUsage/bert-base-uncased/'

In [4]:
!ls {myModelPath.replace(' ', '\ ')} -lh

total 1.5G
-rw------- 1 root root  433 Feb 23 18:20 config.json
-rw------- 1 root root 421M Feb 23 18:20 pytorch_model.bin
-rw------- 1 root root 8.8K Feb 23 18:20 README.md
-rw------- 1 root root 510M Feb 23 18:20 rust_model.ot
-rw------- 1 root root 512M Feb 23 18:20 tf_model.h5
-rw------- 1 root root   28 Feb 23 18:20 tokenizer_config.json
-rw------- 1 root root 456K Feb 23 18:20 tokenizer.json
-rw------- 1 root root 227K Feb 23 18:20 vocab.txt


In [5]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained(myModelPath)
tokenizer = BertTokenizer.from_pretrained(myModelPath)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at /gdrive/MyDrive/Colab Notebooks/Transformers/LocalModelUsage/bert-base-uncased/ and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


# IMDB Dataset
IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The [IMDB Reviews](https://ai.stanford.edu/~amaas/data/sentiment/) dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. 

Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study.

In this case study, we will only use the training dataset.

## Initial Imports
We will first have two imports: TensorFlow and Pandas.

In [7]:
import tensorflow as tf
import pandas as pd

In [8]:
myDataPath = '/gdrive/MyDrive/Colab Notebooks/Transformers/data/'
!ls {myDataPath.replace(' ', '\ ')} -lh

total 221M
-rw------- 1 root root 6.4M Feb 25 18:19 aclImdb_v1_test.csv
-rw------- 1 root root  26M Feb 25 18:19 aclImdb_v1_train.csv
-rw------- 1 root root  58M Feb 24 09:30 ToxicComments_test.csv
-rw------- 1 root root  66M Feb 24 09:30 ToxicComments_train_conv.csv
-rw------- 1 root root  66M Feb 24 09:30 ToxicComments_train.csv


In [9]:
test = pd.read_csv(myDataPath + 'aclImdb_v1_test.csv', sep="|")
test

Unnamed: 0,LABEL_COLUMN,DATA_COLUMN
0,0,I can't believe that so much talent can be was...
1,0,This movie blows - let's get that straight rig...
2,0,"The saddest thing about this ""tribute"" is that..."
3,0,I'm only rating this film as a 3 out of pity b...
4,1,Something surprised me about this movie - it w...
...,...,...
4995,1,Maybe one of the most entertaining Ninja-movie...
4996,0,"Sometimes, making something strange and contem..."
4997,1,If you like cars you will love this film!<br /...
4998,1,Our imp of the perverse did good his first tim...


In [10]:
train = pd.read_csv(myDataPath + 'aclImdb_v1_train.csv', sep="|")
train

Unnamed: 0,LABEL_COLUMN,DATA_COLUMN
0,1,Canadian director Vincenzo Natali took the art...
1,1,I gave this film 10 not because it is a superb...
2,1,I admit to being somewhat jaded about the movi...
3,1,"For a long time, 'The Menagerie' was my favori..."
4,0,A truly frightening film. Feels as if it were ...
...,...,...
19995,1,Well this movie was probobly one of the funnie...
19996,1,"I love this movie, but can't get what is in th..."
19997,1,<br /><br />Superb film with no actual spoken ...
19998,1,David Beckham is a British soccer star and the...


## Creating Input Sequences
We have two pandas Dataframe objects waiting for us to convert them into suitable objects for the BERT model. We will take advantage of the InputExample function that helps us to create sequences from our dataset. The InputExample function can be called as follows:

In [11]:
# transformers.InputExample
InputExample(guid=None,
             text_a = "Hello, world",
             text_b = None,
             label = 1)

InputExample(guid=None, text_a='Hello, world', text_b=None, label=1)

Now we will create two main functions:

1 — `convert_data_to_examples`: This will accept our train and test datasets and convert each row into an InputExample object.

2 — `convert_examples_to_tf_dataset`: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.

In [12]:
def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)
  
  return train_InputExamples, validation_InputExamples  

In [13]:
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, 
                                                                           test, 
                                                                           'DATA_COLUMN', 
                                                                           'LABEL_COLUMN')

In [14]:
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


In [15]:
DATA_COLUMN = 'DATA_COLUMN'
LABEL_COLUMN = 'LABEL_COLUMN'

In [16]:
print (str(type(DATA_COLUMN)) + ' ' + DATA_COLUMN)
print (str(type(LABEL_COLUMN)) + ' ' + LABEL_COLUMN)

<class 'str'> DATA_COLUMN
<class 'str'> LABEL_COLUMN


In [17]:
train.head(5)

Unnamed: 0,LABEL_COLUMN,DATA_COLUMN
0,1,Canadian director Vincenzo Natali took the art...
1,1,I gave this film 10 not because it is a superb...
2,1,I admit to being somewhat jaded about the movi...
3,1,"For a long time, 'The Menagerie' was my favori..."
4,0,A truly frightening film. Feels as if it were ...


In [18]:
%%time

train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN)

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)



CPU times: user 1min 54s, sys: 178 ms, total: 1min 55s
Wall time: 1min 55s


## Configuring the BERT model and Fine-tuning
We will use Adam as our optimizer, CategoricalCrossentropy as our loss function, and SparseCategoricalAccuracy as our accuracy metric. Fine-tuning the model for 2 epochs will give us around 95% accuracy, which is great.

In [19]:
%%time

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
Epoch 2/2
CPU times: user 13min 22s, sys: 13min 46s, total: 27min 9s
Wall time: 37min 49s


Training the model might take a while, so ensure you enabled the GPU acceleration from the Notebook Settings. After our training is completed, we can move onto making sentiment predictions.

## Making Predictions
I created a list of two reviews I created. The first one is a positive review, while the second one is clearly negative.

In [20]:
pred_sentences = ['This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good',
                  'One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie',
                  'A truly frightening film.',
                  'What a waste of time.']

We need to tokenize our reviews with our pre-trained BERT tokenizer. We will then feed these tokenized sequences to our model and run a final softmax layer to get the predictions. We can then use the argmax function to determine whether our sentiment prediction for the review is positive or negative. Finally, we will print out the results with a simple for loop. The following lines do all of these said operations:

In [21]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", labels[label[i]] +" with score: "+ str(tf_predictions[i][label[i]].numpy()))
  print ()

This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good : 
 Positive with score: 0.9982886

One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie : 
 Negative with score: 0.99944216

A truly frightening film. : 
 Positive with score: 0.9975387

What a waste of time. : 
 Negative with score: 0.99877983



## Debugging the Final Tensor Shape

In [22]:
tf_predictions.shape

TensorShape([4, 2])

In [23]:
for i in range(len(tf_predictions)):
  print (tf_predictions[i])

tf.Tensor([0.00171144 0.9982886 ], shape=(2,), dtype=float32)
tf.Tensor([9.9944216e-01 5.5781094e-04], shape=(2,), dtype=float32)
tf.Tensor([0.00246128 0.9975387 ], shape=(2,), dtype=float32)
tf.Tensor([0.99877983 0.00122019], shape=(2,), dtype=float32)


In [24]:
for i in range(len(tf_predictions)):
  print (str(tf_predictions[i][0]) + ' - ' + str(tf_predictions[i][1]))

tf.Tensor(0.0017114393, shape=(), dtype=float32) - tf.Tensor(0.9982886, shape=(), dtype=float32)
tf.Tensor(0.99944216, shape=(), dtype=float32) - tf.Tensor(0.00055781094, shape=(), dtype=float32)
tf.Tensor(0.002461283, shape=(), dtype=float32) - tf.Tensor(0.9975387, shape=(), dtype=float32)
tf.Tensor(0.99877983, shape=(), dtype=float32) - tf.Tensor(0.0012201883, shape=(), dtype=float32)


In [25]:
for i in range(len(tf_predictions)):
  print(tf_predictions[i][label[i]].numpy())

0.9982886
0.99944216
0.9975387
0.99877983


Also, with the code above, you can predict as many reviews as possible.

# Congratulations

You have successfully built a transformers network with a pre-trained BERT model and achieved ~95% accuracy on the sentiment analysis of the IMDB reviews dataset! If you are curious about saving your model, I would like to direct you to the Keras Documentation. After all, to efficiently use an API, one must learn how to read and use the documentation.