## BERT Fine-tune (for sequence classification)

I'm using TFDistilBertForSequenceClassification from Huggingface as a base classifier. All I have to do is tokenize the inputs and transform them into the right format to fine tune the model. This is very easy as it already contains an extra dense layer with nodes = number_of_labels.

Since the model is big and it's time consuming to fine-tune I'm just doing 1 epoch. Still, the results seem to be good.

In [None]:
import os
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.metrics import classification_report
from transformers import AutoTokenizer, TFDistilBertForSequenceClassification

model_path = '/models/tf_bert_sequence'
tokenizer_path = '/models/tf_bert_sequence_tok'

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
Col

In [None]:
#Load datasets and pre-trained models.
train_dataset = pd.read_csv('/dataset/train_clean.csv',index_col=False,encoding='utf-8')
test_dataset = pd.read_csv('/dataset/test_clean.csv',index_col=False,encoding='utf-8')
val_dataset = pd.read_csv('/dataset/val_clean.csv',index_col=False,encoding='utf-8')

bert_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
bert_base_model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels=6)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [4]:
#Extract token_ids and attention_masks. Then format into BERT input.
emoji_label = {'sadness': 0,
               'anger': 1,
               'joy': 2,
               'love': 3,
               'surprise': 4,
               'fear': 5}

def process_datasets(df_dataset,batches = 32):
    tokens = bert_tokenizer(text=df_dataset['text'].tolist(),
                                  add_special_tokens=True,
                                  max_length=64,
                                  truncation=True,
                                  padding=True,
                                  return_tensors='tf',
                                  return_token_type_ids=False,
                                  return_attention_mask=True,
                                  verbose=True)
    labels = tf.keras.utils.to_categorical([emoji_label[e] for e in df_dataset.emoji.tolist()])
    tf_dataset = tf.data.Dataset.from_tensor_slices((tokens['input_ids'],tokens['attention_mask'],labels)).batch(batches)
    return tf_dataset.map(lambda id,mask,label: ({'input_ids':id, 'attention_mask': mask},label))

tf_trainset = process_datasets(train_dataset)
tf_testset = process_datasets(test_dataset,1)
tf_valset = process_datasets(val_dataset)

In [5]:
# #Configure pre-trained model for fine-tuning.
bert_base_model.trainable = True
bert_base_model.summary()
bert_base_model.compile(optimizer=tf.keras.optimizers.Adam(5e-5), #Value recommeded by paper.
             loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True), #Softmax is applied on loss.
             metrics=[tf.keras.metrics.CategoricalAccuracy('balanced_accuracy')]) #Balanced accuracy since our datset is unbalanced.

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  4614      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66958086 (255.42 MB)
Trainable params: 66958086 (255.42 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [6]:
#Train model. Do only one epoch since my computer is not that powerful.
history = bert_base_model.fit(tf_trainset, validation_data=tf_valset, epochs=1)



In [7]:
#Save model and tokenizer.
bert_base_model.save_pretrained(model_path)
bert_tokenizer.save_pretrained(tokenizer_path)

('/models/tf_bert_sequence_tok/tokenizer_config.json',
 '/models/tf_bert_sequence_tok/special_tokens_map.json',
 '/models/tf_bert_sequence_tok/vocab.txt',
 '/models/tf_bert_sequence_tok/added_tokens.json',
 '/models/tf_bert_sequence_tok/tokenizer.json')

In [8]:
#Evaluate model with test set.
predicted = bert_base_model.predict(tf_testset)
predicted = np.argmax(predicted.logits, axis=1)



In [9]:
y_test = np.array([np.argmax(label[1]) for label in tf_testset.as_numpy_iterator()])
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.97      0.94      0.96       581
           1       0.90      0.93      0.92       275
           2       0.91      0.98      0.95       695
           3       0.99      0.67      0.80       159
           4       0.87      0.68      0.76        66
           5       0.87      0.95      0.91       224

    accuracy                           0.93      2000
   macro avg       0.92      0.86      0.88      2000
weighted avg       0.93      0.93      0.92      2000

