# MultiClass Classification in 10 Minutes with BERT-TensorFlow and SoftMax
- Based on Article  
  https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671

- Data Source:
  - Unzip files (only one time after downloading tar.gz file)  
  http://qwone.com/~jason/20Newsgroups/

  - Download Link:  
    http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz

## Install Transformers Python Library to run it in CoLab

In [1]:
pip install transformers



## Mount Google Drive to Read Data & Model from Local Storage

In [2]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [3]:
myModelPath = '/gdrive/MyDrive/Colab Notebooks/Transformers/LocalModelUsage/bert-base-uncased/'

In [4]:
!ls {myModelPath.replace(' ', '\ ')} -lh

total 1.5G
-rw------- 1 root root  433 Feb 23 18:20 config.json
-rw------- 1 root root 421M Feb 23 18:20 pytorch_model.bin
-rw------- 1 root root 8.8K Feb 23 18:20 README.md
-rw------- 1 root root 510M Feb 23 18:20 rust_model.ot
-rw------- 1 root root 512M Feb 23 18:20 tf_model.h5
-rw------- 1 root root   28 Feb 23 18:20 tokenizer_config.json
-rw------- 1 root root 456K Feb 23 18:20 tokenizer.json
-rw------- 1 root root 227K Feb 23 18:20 vocab.txt


# IMDB Dataset
IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The [IMDB Reviews](https://ai.stanford.edu/~amaas/data/sentiment/) dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. 

Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study.

In this case study, we will only use the training dataset.

## Initial Imports
We will first have two imports: TensorFlow and Pandas.

In [5]:
import tensorflow as tf
import pandas as pd

In [6]:
myDataPath = '/gdrive/MyDrive/Colab Notebooks/20_Newsgroups TextClassification/data/'
!ls {myDataPath.replace(' ', '\ ')} -lh

total 58M
drwx------ 2 root root 4.0K Feb 12 14:58 20news-bydate
-rw------- 1 root root  14M Feb 12 14:35 20news-bydate.tar.gz
-rw------- 1 root root  23M Mar  2 14:17 20_newsgroups_data.csv
-rw------- 1 root root  22M Mar  2 14:20 20_newsgroups_data_no_filenames.csv
drwx------ 2 root root 4.0K Feb 12 15:46 model


In [7]:
df = pd.read_csv(myDataPath + "20_newsgroups_data_no_filenames.csv", sep='|')

In [8]:
df.head()

Unnamed: 0,category,news
0,rec.sport.baseball,From: cubbie@garnet.berkeley.edu ( ...
1,comp.sys.mac.hardware,From: gnelson@pion.rutgers.edu (Gregory Nelson...
2,sci.crypt,From: crypt-comments@math.ncsu.edu\nSubject: C...
3,alt.atheism,From: keith@cco.caltech.edu (Keith Allan Schne...
4,comp.sys.mac.hardware,From: taihou@chromium.iss.nus.sg (Tng Tai Hou)...


In [9]:
df['label'] = pd.Categorical(df.category, ordered=True).codes
df['label'].unique()

array([ 9,  4, 11,  0,  5, 13, 12, 17, 10,  6,  7,  2,  8, 14,  1,  3, 16,
       18, 19, 15], dtype=int8)

In [10]:
mapLabels = pd.DataFrame(df.groupby(['category', 'label']).count())

#drop count column
mapLabels.drop(['news'], axis = 1, inplace = True)
label2Index = mapLabels.to_dict(orient='index')

print (f"label2Index :{label2Index}")
print (type(label2Index))
#print (f"index2Label :{index2Label}")

label2Index :{('alt.atheism', 0): {}, ('comp.graphics', 1): {}, ('comp.os.ms-windows.misc', 2): {}, ('comp.sys.ibm.pc.hardware', 3): {}, ('comp.sys.mac.hardware', 4): {}, ('comp.windows.x', 5): {}, ('misc.forsale', 6): {}, ('rec.autos', 7): {}, ('rec.motorcycles', 8): {}, ('rec.sport.baseball', 9): {}, ('rec.sport.hockey', 10): {}, ('sci.crypt', 11): {}, ('sci.electronics', 12): {}, ('sci.med', 13): {}, ('sci.space', 14): {}, ('soc.religion.christian', 15): {}, ('talk.politics.guns', 16): {}, ('talk.politics.mideast', 17): {}, ('talk.politics.misc', 18): {}, ('talk.religion.misc', 19): {}}
<class 'dict'>


In [11]:
index2label = {}

for key in label2Index:
  print (f"{key[1]} -> {key[0]}")
  index2label[key[1]] = key[0]

0 -> alt.atheism
1 -> comp.graphics
2 -> comp.os.ms-windows.misc
3 -> comp.sys.ibm.pc.hardware
4 -> comp.sys.mac.hardware
5 -> comp.windows.x
6 -> misc.forsale
7 -> rec.autos
8 -> rec.motorcycles
9 -> rec.sport.baseball
10 -> rec.sport.hockey
11 -> sci.crypt
12 -> sci.electronics
13 -> sci.med
14 -> sci.space
15 -> soc.religion.christian
16 -> talk.politics.guns
17 -> talk.politics.mideast
18 -> talk.politics.misc
19 -> talk.religion.misc


In [12]:
label2Index = {v: k for k, v in index2label.items()}

print (f'label2Index: {label2Index}')
print (f'index2label: {index2label}')

label2Index: {'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4, 'comp.windows.x': 5, 'misc.forsale': 6, 'rec.autos': 7, 'rec.motorcycles': 8, 'rec.sport.baseball': 9, 'rec.sport.hockey': 10, 'sci.crypt': 11, 'sci.electronics': 12, 'sci.med': 13, 'sci.space': 14, 'soc.religion.christian': 15, 'talk.politics.guns': 16, 'talk.politics.mideast': 17, 'talk.politics.misc': 18, 'talk.religion.misc': 19}
index2label: {0: 'alt.atheism', 1: 'comp.graphics', 2: 'comp.os.ms-windows.misc', 3: 'comp.sys.ibm.pc.hardware', 4: 'comp.sys.mac.hardware', 5: 'comp.windows.x', 6: 'misc.forsale', 7: 'rec.autos', 8: 'rec.motorcycles', 9: 'rec.sport.baseball', 10: 'rec.sport.hockey', 11: 'sci.crypt', 12: 'sci.electronics', 13: 'sci.med', 14: 'sci.space', 15: 'soc.religion.christian', 16: 'talk.politics.guns', 17: 'talk.politics.mideast', 18: 'talk.politics.misc', 19: 'talk.religion.misc'}


In [13]:
df.head()

Unnamed: 0,category,news,label
0,rec.sport.baseball,From: cubbie@garnet.berkeley.edu ( ...,9
1,comp.sys.mac.hardware,From: gnelson@pion.rutgers.edu (Gregory Nelson...,4
2,sci.crypt,From: crypt-comments@math.ncsu.edu\nSubject: C...,11
3,alt.atheism,From: keith@cco.caltech.edu (Keith Allan Schne...,0
4,comp.sys.mac.hardware,From: taihou@chromium.iss.nus.sg (Tng Tai Hou)...,4


In [14]:
df.rename(columns = {'label' : 'LABEL_COLUMN', 'news' : 'DATA_COLUMN'}, inplace = True)

In [15]:
# Remoe Email address to avoid additional noise
df.DATA_COLUMN.replace(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', '', regex=True, inplace=True)

In [16]:
df = df[['LABEL_COLUMN','DATA_COLUMN']]

In [17]:
df.head()

Unnamed: 0,LABEL_COLUMN,DATA_COLUMN
0,9,From: ( )\nSubj...
1,4,From: (Gregory Nelson)\nSubject: Thanks Apple...
2,11,From: \nSubject: Cryptography FAQ 10/10 - Refe...
3,0,From: (Keith Allan Schneider)\nSubject: Re: <...
4,4,From: (Tng Tai Hou)\nSubject: ADB and graphic...


In [18]:
df.count()

LABEL_COLUMN    11270
DATA_COLUMN     11270
dtype: int64

In [19]:
splitSize = df.count() * .8
splitSize

LABEL_COLUMN    9016.0
DATA_COLUMN     9016.0
dtype: float64

In [20]:
#people_copy = people.copy()
train = df.sample(frac=0.75, random_state=0)
test = df.drop(train.index)

In [21]:
print (f"{test.count()} \n\n{train.count()}")

LABEL_COLUMN    2818
DATA_COLUMN     2818
dtype: int64 

LABEL_COLUMN    8452
DATA_COLUMN     8452
dtype: int64


In [22]:
uniqueLabels = df['LABEL_COLUMN'].unique()
print (f'Number of Labels: {len(uniqueLabels)},\nLabels:{uniqueLabels}')

Number of Labels: 20,
Labels:[ 9  4 11  0  5 13 12 17 10  6  7  2  8 14  1  3 16 18 19 15]


## Load the Model
See Load and Save notebooks in this repository to understand how Transformers models cen be:
1. Downloaded
2. Stored Locally and
3. be used from Local Storage.

This should be interesting if you work in a cloud environment without Internet connection.

Here we tell the model that we whish to train on **20 label values** instead of the original 1 label (with 1 or 0 values) for which the original model was designed. This is why the test below tells us that we better should train this model. So, training it we will :-)

In [23]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained(myModelPath, num_labels=20)
tokenizer = BertTokenizer.from_pretrained(myModelPath)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at /gdrive/MyDrive/Colab Notebooks/Transformers/LocalModelUsage/bert-base-uncased/ and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [24]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  15380     
Total params: 109,497,620
Trainable params: 109,497,620
Non-trainable params: 0
_________________________________________________________________


## Creating Input Sequences
We have two pandas Dataframe objects waiting for us to convert them into suitable objects for the BERT model. We will take advantage of the InputExample function that helps us to create sequences from our dataset. The InputExample function can be called as follows:

In [25]:
# transformers.InputExample
InputExample(guid=None,
             text_a = "Hello, world",
             text_b = None,
             label = 1)

InputExample(guid=None, text_a='Hello, world', text_b=None, label=1)

Now we will create two main functions:

1 — `convert_data_to_examples`: This will accept our train and test datasets and convert each row into an InputExample object.

2 — `convert_examples_to_tf_dataset`: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.

In [26]:
def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)
  
  return train_InputExamples, validation_InputExamples  

In [27]:
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, 
                                                                           test, 
                                                                           'DATA_COLUMN', 
                                                                           'LABEL_COLUMN')

In [28]:
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


In [29]:
DATA_COLUMN = 'DATA_COLUMN'
LABEL_COLUMN = 'LABEL_COLUMN'

In [30]:
print (str(type(DATA_COLUMN)) + ' ' + DATA_COLUMN)
print (str(type(LABEL_COLUMN)) + ' ' + LABEL_COLUMN)

<class 'str'> DATA_COLUMN
<class 'str'> LABEL_COLUMN


In [31]:
train.head(5)

Unnamed: 0,LABEL_COLUMN,DATA_COLUMN
9705,5,From: (Tom LaStrange)\nSubject: Re: Forcing a...
1180,6,From: (Space Gigolo)\nSubject: Laser Printer ...
2503,6,From: (Steve Holmertz)\nSubject: Parametric E...
9007,11,From: (Ken Arromdee)\nSubject: Re: Once tappe...
6474,15,From: shellgate! (Larry L. Overacker)\nSubject...


In [32]:
%%time

train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN)

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)



CPU times: user 2min 14s, sys: 889 ms, total: 2min 15s
Wall time: 2min 15s


## Configuring the BERT model and Fine-tuning
We will use Adam as our optimizer, CategoricalCrossentropy as our loss function, and SparseCategoricalAccuracy as our accuracy metric. Fine-tuning the model for 2 epochs will give us around 93% accuracy, which is great.

In [33]:
%%time

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
Epoch 2/2
CPU times: user 6min, sys: 5min 32s, total: 11min 32s
Wall time: 16min 30s


Training the model might take a while, so ensure you enabled the GPU acceleration from the Notebook Settings. After our training is completed, we can move onto making sentiment predictions.

## Making Predictions
I created a list of two reviews I created. The first one is a positive review, while the second one is clearly negative.

In [34]:
pred_sentences = ['This season so far, Morgan and Guzman helped to lead the Cubs at top in ERA, even better than THE rotation at Atlanta.',
                  'This is the tenth of ten parts of the sci.crypt FAQ.',
                  'I think that domestication will change behavior to a large degree. Domesticated animals exhibit behaviors not found in the wild.',
                  "If anybody wants these changes, they're welcome to them, but you'll have to have the source available and be comfortable munging with it a bit."]

We need to tokenize our reviews with our pre-trained BERT tokenizer. We will then feed these tokenized sequences to our model and run a final softmax layer to get the predictions. We can then use the argmax function to determine whether our sentiment prediction for the review is positive or negative. Finally, we will print out the results with a simple for loop. The following lines do all of these said operations:

In [35]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
tf_predictions

<tf.Tensor: shape=(4, 20), dtype=float32, numpy=
array([[1.9084792e-04, 3.4088054e-04, 2.0756450e-04, 8.6477579e-05,
        2.9989792e-04, 2.6212877e-04, 1.9374983e-04, 3.5850867e-04,
        1.5933276e-04, 9.9522001e-01, 4.5725811e-04, 8.2916165e-05,
        3.6863686e-04, 1.2279976e-04, 2.3117411e-04, 1.3213107e-04,
        3.1335116e-04, 4.4700602e-04, 2.9876176e-04, 2.2643096e-04],
       [2.1327195e-04, 4.0231278e-04, 1.5158435e-04, 1.2487071e-03,
        1.2888614e-03, 1.4331866e-03, 4.6543666e-04, 1.8467591e-04,
        4.1545741e-04, 3.4950839e-04, 1.3036636e-03, 9.8752570e-01,
        1.5377196e-03, 3.4859296e-04, 4.5611229e-04, 5.4127991e-04,
        2.9390829e-04, 9.1435417e-04, 2.4715456e-04, 6.7858823e-04],
       [8.5387528e-03, 1.3884273e-03, 3.6678996e-02, 1.5499455e-02,
        2.6900356e-03, 4.8267641e-03, 3.8973048e-02, 8.9422697e-03,
        6.8014547e-02, 3.9243731e-03, 9.7063286e-03, 7.5493036e-03,
        1.5494078e-03, 4.4692346e-01, 3.0200190e-03, 1.8781705e-0

In [36]:
tf.argmax(tf_predictions, axis=1).numpy()
index2label[11]

'sci.crypt'

In [37]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)

# Get index of predicted label for each sentence
label = tf.argmax(tf_predictions, axis=1).numpy()

# output human readable label predictions
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", index2label[label[i]] +" with score: "+ str(tf_predictions[i][label[i]].numpy()))
  print ()

This season so far, Morgan and Guzman helped to lead the Cubs at top in ERA, even better than THE rotation at Atlanta. : 
 rec.sport.baseball with score: 0.99522

This is the tenth of ten parts of the sci.crypt FAQ. : 
 sci.crypt with score: 0.9875257

I think that domestication will change behavior to a large degree. Domesticated animals exhibit behaviors not found in the wild. : 
 sci.med with score: 0.44692346

If anybody wants these changes, they're welcome to them, but you'll have to have the source available and be comfortable munging with it a bit. : 
 comp.windows.x with score: 0.450415



## Debugging the Final Tensor Shape

In [38]:
tf_predictions.shape

TensorShape([4, 20])

In [39]:
for i in range(len(tf_predictions)):
  print (tf_predictions[i])

tf.Tensor(
[1.9084792e-04 3.4088054e-04 2.0756450e-04 8.6477579e-05 2.9989792e-04
 2.6212877e-04 1.9374983e-04 3.5850867e-04 1.5933276e-04 9.9522001e-01
 4.5725811e-04 8.2916165e-05 3.6863686e-04 1.2279976e-04 2.3117411e-04
 1.3213107e-04 3.1335116e-04 4.4700602e-04 2.9876176e-04 2.2643096e-04], shape=(20,), dtype=float32)
tf.Tensor(
[2.1327195e-04 4.0231278e-04 1.5158435e-04 1.2487071e-03 1.2888614e-03
 1.4331866e-03 4.6543666e-04 1.8467591e-04 4.1545741e-04 3.4950839e-04
 1.3036636e-03 9.8752570e-01 1.5377196e-03 3.4859296e-04 4.5611229e-04
 5.4127991e-04 2.9390829e-04 9.1435417e-04 2.4715456e-04 6.7858823e-04], shape=(20,), dtype=float32)
tf.Tensor(
[0.00853875 0.00138843 0.036679   0.01549945 0.00269004 0.00482676
 0.03897305 0.00894227 0.06801455 0.00392437 0.00970633 0.0075493
 0.00154941 0.44692346 0.00302002 0.18781705 0.02925803 0.00800841
 0.01692865 0.09976259], shape=(20,), dtype=float32)
tf.Tensor(
[0.00210652 0.32576597 0.08870737 0.05647071 0.01478916 0.450415
 0.0065489

In [40]:
for i in range(len(tf_predictions)):
  print (str(tf_predictions[i][0]) + ' - ' + str(tf_predictions[i][1]))

tf.Tensor(0.00019084792, shape=(), dtype=float32) - tf.Tensor(0.00034088054, shape=(), dtype=float32)
tf.Tensor(0.00021327195, shape=(), dtype=float32) - tf.Tensor(0.00040231278, shape=(), dtype=float32)
tf.Tensor(0.008538753, shape=(), dtype=float32) - tf.Tensor(0.0013884273, shape=(), dtype=float32)
tf.Tensor(0.0021065164, shape=(), dtype=float32) - tf.Tensor(0.32576597, shape=(), dtype=float32)


In [41]:
for i in range(len(tf_predictions)):
  print(tf_predictions[i][label[i]].numpy())

0.99522
0.9875257
0.44692346
0.450415


Also, with the code above, you can predict as many reviews as possible.

# Congratulations

You have successfully built a transformers network with a pre-trained BERT model and achieved ~93% accuracy on the newsgroups classification analysis of the 20 Newsgroup reviews dataset! If you are curious about saving your model, I would like to direct you to the [Keras Documentation](https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model). After all, to efficiently use an API, one must learn how to read and use the documentation.