<a href="https://colab.research.google.com/github/Dirkster99/PyNotes/blob/master/Transformers/40_MultiLabel_MultiClass_Classification_in_10_Minutes_with_BERT_TensorFlow_Sigmoid_LocalModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multilabel MultiClass Classification in 10 Minutes with BERT-TensorFlow and SoftMax
- Based on Article  
  https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671

- Data Source: TBD

## Install Transformers Python Library to run it in CoLab

In [None]:
pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 5.5MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 32.1MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 37.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=08ae7

## Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/gdrive') #, force_remount=True)

Mounted at /gdrive


In [None]:
myModelPath = '/gdrive/MyDrive/Colab Notebooks/Transformers/LocalModelUsage/bert-base-uncased/'
myDataPath = '/gdrive/MyDrive/Colab Notebooks/MultiLabelText/ToxicComments/data/'
!ls {myModelPath.replace(' ', '\ ')} -lh

total 1.5G
-rw------- 1 root root  433 Feb 23 18:20 config.json
-rw------- 1 root root 421M Feb 23 18:20 pytorch_model.bin
-rw------- 1 root root 8.8K Feb 23 18:20 README.md
-rw------- 1 root root 510M Feb 23 18:20 rust_model.ot
-rw------- 1 root root 512M Feb 23 18:20 tf_model.h5
-rw------- 1 root root   28 Feb 23 18:20 tokenizer_config.json
-rw------- 1 root root 456K Feb 23 18:20 tokenizer.json
-rw------- 1 root root 227K Feb 23 18:20 vocab.txt


# Dataset
Here, we use Toxic Comment Classification Challenge dataset from Kaggle. It provided  Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

- toxic
- severe_toxic
- obscene
- threat
- insult
- Identity_hate

## Prepare Dataset

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd

import os

TRAIN_DATA = myDataPath + 'train.csv'
df = pd.read_csv(TRAIN_DATA)

In [None]:
textLabels = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [None]:
# convert all label columns into 1 label column containing a list of values
df['LABEL_COLUMN'] = (df[textLabels].to_numpy().tolist())
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,LABEL_COLUMN
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"


In [None]:
# select label and value columns
df = df[['LABEL_COLUMN', 'comment_text']]
df.head()

Unnamed: 0,LABEL_COLUMN,comment_text
0,"[0, 0, 0, 0, 0, 0]",Explanation\nWhy the edits made under my usern...
1,"[0, 0, 0, 0, 0, 0]",D'aww! He matches this background colour I'm s...
2,"[0, 0, 0, 0, 0, 0]","Hey man, I'm really not trying to edit war. It..."
3,"[0, 0, 0, 0, 0, 0]","""\nMore\nI can't make any real suggestions on ..."
4,"[0, 0, 0, 0, 0, 0]","You, sir, are my hero. Any chance you remember..."


In [None]:
# rename column
df.rename(columns = {'comment_text' : 'DATA_COLUMN'}, inplace = True)

In [None]:
# print data input at the end of stagging
print (textLabels)
df.head()

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


Unnamed: 0,LABEL_COLUMN,DATA_COLUMN
0,"[0, 0, 0, 0, 0, 0]",Explanation\nWhy the edits made under my usern...
1,"[0, 0, 0, 0, 0, 0]",D'aww! He matches this background colour I'm s...
2,"[0, 0, 0, 0, 0, 0]","Hey man, I'm really not trying to edit war. It..."
3,"[0, 0, 0, 0, 0, 0]","""\nMore\nI can't make any real suggestions on ..."
4,"[0, 0, 0, 0, 0, 0]","You, sir, are my hero. Any chance you remember..."


In [None]:
splitSize = df.count() * .8
splitSize

LABEL_COLUMN    127656.8
DATA_COLUMN     127656.8
dtype: float64

In [None]:
#people_copy = people.copy()
train = df.sample(frac=0.75, random_state=0)
test = df.drop(train.index)

In [None]:
print (f"{test.count()} \n\n{train.count()}")

LABEL_COLUMN    39893
DATA_COLUMN     39893
dtype: int64 

LABEL_COLUMN    119678
DATA_COLUMN     119678
dtype: int64


In [None]:
print (f'Number of Labels: {len(textLabels)},\nLabels:{textLabels}')

Number of Labels: 6,
Labels:['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


## Load the Model
See Load and Save notebooks in this repository to understand how Transformers models cen be:
1. Downloaded
2. Stored Locally and
3. be used from Local Storage.

This should be interesting if you work in a cloud environment without Internet connection.

Here we tell the model that we whish to train on **20 label values** instead of the original 1 label (with 1 or 0 values) for which the original model was designed. This is why the test below tells us that we better should train this model. So, training it we will :-)

In [None]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained(myModelPath, num_labels=len(textLabels))
tokenizer = BertTokenizer.from_pretrained(myModelPath)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at /gdrive/MyDrive/Colab Notebooks/Transformers/LocalModelUsage/bert-base-uncased/ and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  4614      
Total params: 109,486,854
Trainable params: 109,486,854
Non-trainable params: 0
_________________________________________________________________


## Creating Input Sequences
We have two pandas Dataframe objects waiting for us to convert them into suitable objects for the BERT model. We will take advantage of the InputExample function that helps us to create sequences from our dataset. The InputExample function can be called as follows:

In [None]:
# transformers.InputExample
InputExample(guid=None,
             text_a = "Hello, world",
             text_b = None,
             label = 1)

InputExample(guid=None, text_a='Hello, world', text_b=None, label=1)

Now we will create two main functions:

1 — `convert_data_to_examples`: This will accept our train and test datasets and convert each row into an InputExample object.

2 — `convert_examples_to_tf_dataset`: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.

In [None]:
def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)
  
  return train_InputExamples, validation_InputExamples  

In [None]:
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, 
                                                                           test, 
                                                                           'DATA_COLUMN', 
                                                                           'LABEL_COLUMN')

In [None]:
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


In [None]:
DATA_COLUMN = 'DATA_COLUMN'
LABEL_COLUMN = 'LABEL_COLUMN'

In [None]:
print (str(type(DATA_COLUMN)) + ' ' + DATA_COLUMN)
print (str(type(LABEL_COLUMN)) + ' ' + LABEL_COLUMN)

<class 'str'> DATA_COLUMN
<class 'str'> LABEL_COLUMN


In [None]:
train.head(5)

Unnamed: 0,LABEL_COLUMN,DATA_COLUMN
74251,"[0, 0, 0, 0, 0, 0]","""\nI haven't paraphrased you at all, Gary. Yo..."
131406,"[1, 0, 0, 0, 0, 0]",I BLOCKED REVERS! I BLOCKED REVERS! I BLOCKED ...
120969,"[0, 0, 0, 0, 1, 0]",I'm sorry. I'd like to unreservedly retract my...
121827,"[0, 0, 0, 0, 0, 0]",I don't know if this is exactly like the Press...
4771,"[0, 0, 0, 0, 0, 0]","Thank you all, we'll all improve the Wikipedia..."


In [None]:
%%time

train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN)

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)



CPU times: user 5min 5s, sys: 504 ms, total: 5min 5s
Wall time: 5min 5s


## Configuring the BERT model and Fine-tuning
We will use Adam as our optimizer, CategoricalCrossentropy as our loss function, and SparseCategoricalAccuracy as our accuracy metric. Fine-tuning the model for 2 epochs will give us around 93% accuracy, which is great.

In [None]:
%%time

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported


InvalidArgumentError: ignored

Training the model might take a while, so ensure you enabled the GPU acceleration from the Notebook Settings. After our training is completed, we can move onto making sentiment predictions.

## Making Predictions
I created a list of two reviews I created. The first one is a positive review, while the second one is clearly negative.

In [None]:
pred_sentences = ['This season so far, Morgan and Guzman helped to lead the Cubs at top in ERA, even better than THE rotation at Atlanta.',
                  'This is the tenth of ten parts of the sci.crypt FAQ.',
                  'I think that domestication will change behavior to a large degree. Domesticated animals exhibit behaviors not found in the wild.',
                  "If anybody wants these changes, they're welcome to them, but you'll have to have the source available and be comfortable munging with it a bit."]

We need to tokenize our reviews with our pre-trained BERT tokenizer. We will then feed these tokenized sequences to our model and run a final softmax layer to get the predictions. We can then use the argmax function to determine whether our sentiment prediction for the review is positive or negative. Finally, we will print out the results with a simple for loop. The following lines do all of these said operations:

In [None]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
tf_predictions

<tf.Tensor: shape=(4, 20), dtype=float32, numpy=
array([[1.9084792e-04, 3.4088054e-04, 2.0756450e-04, 8.6477579e-05,
        2.9989792e-04, 2.6212877e-04, 1.9374983e-04, 3.5850867e-04,
        1.5933276e-04, 9.9522001e-01, 4.5725811e-04, 8.2916165e-05,
        3.6863686e-04, 1.2279976e-04, 2.3117411e-04, 1.3213107e-04,
        3.1335116e-04, 4.4700602e-04, 2.9876176e-04, 2.2643096e-04],
       [2.1327195e-04, 4.0231278e-04, 1.5158435e-04, 1.2487071e-03,
        1.2888614e-03, 1.4331866e-03, 4.6543666e-04, 1.8467591e-04,
        4.1545741e-04, 3.4950839e-04, 1.3036636e-03, 9.8752570e-01,
        1.5377196e-03, 3.4859296e-04, 4.5611229e-04, 5.4127991e-04,
        2.9390829e-04, 9.1435417e-04, 2.4715456e-04, 6.7858823e-04],
       [8.5387528e-03, 1.3884273e-03, 3.6678996e-02, 1.5499455e-02,
        2.6900356e-03, 4.8267641e-03, 3.8973048e-02, 8.9422697e-03,
        6.8014547e-02, 3.9243731e-03, 9.7063286e-03, 7.5493036e-03,
        1.5494078e-03, 4.4692346e-01, 3.0200190e-03, 1.8781705e-0

In [None]:
tf.argmax(tf_predictions, axis=1).numpy()
index2label[11]

'sci.crypt'

In [None]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)

# Get index of predicted label for each sentence
label = tf.argmax(tf_predictions, axis=1).numpy()

# output human readable label predictions
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", index2label[label[i]] +" with score: "+ str(tf_predictions[i][label[i]].numpy()))
  print ()

This season so far, Morgan and Guzman helped to lead the Cubs at top in ERA, even better than THE rotation at Atlanta. : 
 rec.sport.baseball with score: 0.99522

This is the tenth of ten parts of the sci.crypt FAQ. : 
 sci.crypt with score: 0.9875257

I think that domestication will change behavior to a large degree. Domesticated animals exhibit behaviors not found in the wild. : 
 sci.med with score: 0.44692346

If anybody wants these changes, they're welcome to them, but you'll have to have the source available and be comfortable munging with it a bit. : 
 comp.windows.x with score: 0.450415



## Debugging the Final Tensor Shape

In [None]:
tf_predictions.shape

TensorShape([4, 20])

In [None]:
for i in range(len(tf_predictions)):
  print (tf_predictions[i])

tf.Tensor(
[1.9084792e-04 3.4088054e-04 2.0756450e-04 8.6477579e-05 2.9989792e-04
 2.6212877e-04 1.9374983e-04 3.5850867e-04 1.5933276e-04 9.9522001e-01
 4.5725811e-04 8.2916165e-05 3.6863686e-04 1.2279976e-04 2.3117411e-04
 1.3213107e-04 3.1335116e-04 4.4700602e-04 2.9876176e-04 2.2643096e-04], shape=(20,), dtype=float32)
tf.Tensor(
[2.1327195e-04 4.0231278e-04 1.5158435e-04 1.2487071e-03 1.2888614e-03
 1.4331866e-03 4.6543666e-04 1.8467591e-04 4.1545741e-04 3.4950839e-04
 1.3036636e-03 9.8752570e-01 1.5377196e-03 3.4859296e-04 4.5611229e-04
 5.4127991e-04 2.9390829e-04 9.1435417e-04 2.4715456e-04 6.7858823e-04], shape=(20,), dtype=float32)
tf.Tensor(
[0.00853875 0.00138843 0.036679   0.01549945 0.00269004 0.00482676
 0.03897305 0.00894227 0.06801455 0.00392437 0.00970633 0.0075493
 0.00154941 0.44692346 0.00302002 0.18781705 0.02925803 0.00800841
 0.01692865 0.09976259], shape=(20,), dtype=float32)
tf.Tensor(
[0.00210652 0.32576597 0.08870737 0.05647071 0.01478916 0.450415
 0.0065489

In [None]:
for i in range(len(tf_predictions)):
  print (str(tf_predictions[i][0]) + ' - ' + str(tf_predictions[i][1]))

tf.Tensor(0.00019084792, shape=(), dtype=float32) - tf.Tensor(0.00034088054, shape=(), dtype=float32)
tf.Tensor(0.00021327195, shape=(), dtype=float32) - tf.Tensor(0.00040231278, shape=(), dtype=float32)
tf.Tensor(0.008538753, shape=(), dtype=float32) - tf.Tensor(0.0013884273, shape=(), dtype=float32)
tf.Tensor(0.0021065164, shape=(), dtype=float32) - tf.Tensor(0.32576597, shape=(), dtype=float32)


In [None]:
for i in range(len(tf_predictions)):
  print(tf_predictions[i][label[i]].numpy())

0.99522
0.9875257
0.44692346
0.450415


Also, with the code above, you can predict as many reviews as possible.

# Congratulations

You have successfully built a transformers network with a pre-trained BERT model and achieved ~93% accuracy on the newsgroups classification analysis of the 20 Newsgroup reviews dataset! If you are curious about saving your model, I would like to direct you to the [Keras Documentation](https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model). After all, to efficiently use an API, one must learn how to read and use the documentation.