## Fine_tuning_BERT

fine-tuning BERT, one of the most powerful natural language processing (NLP) models available. BERT, which stands for "Bidirectional Encoder Representations from Transformers," has been trained on a massive amount of text data and can be fine-tuned for a variety of NLP tasks, such as text classification, machine translation, and named entity recognition.

### I will be working on pretrained hugging face dataset using TFAutoModel library

In [1]:
# Libraries 
import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer
from datasets import load_dataset

# using TFAutoModel we can load any pre-trained model avaliable in huggging face

2024-06-22 20:09:16.326001: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-22 20:09:16.658155: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-22 20:09:18.030969: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model = TFAutoModel.from_pretrained('bert-base-uncased')
# bert-base consists of 12 transformers encoders stacked on top of each other and uncased simply means upper_case == lower_case

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

### Tokenize
The process of converting a sequence of text into smaller parts, known as tokens. 

In [3]:
tokenize = AutoTokenizer.from_pretrained('bert-base-uncased')


In [4]:
# Now let's see how tokenize works

input = tokenize(['hey, where are you going, do you mind if i come with you'], padding=True, truncation=True, return_tensors='tf')

print(input)

{'input_ids': <tf.Tensor: shape=(1, 17), dtype=int32, numpy=
array([[ 101, 4931, 1010, 2073, 2024, 2017, 2183, 1010, 2079, 2017, 2568,
        2065, 1045, 2272, 2007, 2017,  102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 17), dtype=int32, numpy=array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 17), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}


### 101 and 102 represents the start and end, between that two number they represents the text

In [5]:
output = model(input)
output

TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=<tf.Tensor: shape=(1, 17, 768), dtype=float32, numpy=
array([[[ 0.04948921,  0.08287561, -0.17273793, ..., -0.20462075,
          0.28403854,  0.4113491 ],
        [ 0.36707157,  0.00485902,  0.70743006, ...,  0.10564689,
          0.82837844,  0.23365366],
        [ 0.09414288,  0.10443684,  0.21189845, ..., -0.1996503 ,
          0.7478905 , -0.00196859],
        ...,
        [ 0.54930645,  0.02389764,  0.79320383, ..., -0.06597991,
         -0.25350222,  0.04375348],
        [ 0.1795084 , -0.34013513,  0.00304011, ...,  0.23138331,
         -0.00539792, -0.6717876 ],
        [ 0.64187133, -0.20005888, -0.4429922 , ...,  0.07763167,
         -0.4773076 , -0.13752647]]], dtype=float32)>, pooler_output=<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-0.8102302 , -0.44006595, -0.9292423 ,  0.7642224 ,  0.70369965,
        -0.19765958,  0.8371096 ,  0.3066233 , -0.7498013 , -0.9999701 ,
        -0.4604843 ,  0.947

Output returns two tensors, one is hidden_htates and the other is pooler_output

### Now let's try on hugging face dataset

In [6]:
from datasets import load_dataset

emotion = load_dataset('SetFit/emotion')

Repo card metadata block was not found. Setting CardData to empty.


In [7]:
# let's see the dataset
emotion

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
})

Dataset already has train, test and validation. Train dataset has 16000 samples and test and validation has 2000 sample each

In [8]:
def token(batch):
    return tokenize(batch['text'], padding=True, truncation=True)

In [9]:
emotion_encoded = emotion.map(token, batched=True, batch_size=None)

In [10]:
emotion_encoded

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'label_text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
})

In [11]:
# setting 'input_ids', 'attention_mask', 'token_type_ids', and 'label'
# to the tensorflow format. Now if you access this dataset you will get these
# columns in `tf.Tensor` format

emotion_encoded.set_format('tf', 
                           columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])

# setting batch_size 
batch_size = 64

def order(inp):
    '''
    This function will group all the inputs of BERT
    into a single dictionary and then output it with
    labels.
    '''
    data = list(inp.values())
    return {
        'input_ids': data[1],
        'attention_mask': data[2],
        'token_type_ids': data[3]
    }, data[0]

# converting train split of `emotions_encoded` to tensorflow format
train_dataset = tf.data.Dataset.from_tensor_slices(emotion_encoded['train'][:])
# set batch_size and shuffle
train_dataset = train_dataset.batch(batch_size).shuffle(1000)
# map the `order` function
train_dataset = train_dataset.map(order, num_parallel_calls=tf.data.AUTOTUNE)

# ... doing the same for test set ...
test_dataset = tf.data.Dataset.from_tensor_slices(emotion_encoded['test'][:])
test_dataset = test_dataset.batch(batch_size)
test_dataset = test_dataset.map(order, num_parallel_calls=tf.data.AUTOTUNE)


In [12]:
inp, out = next(iter(train_dataset)) # a batch from train_dataset
print(inp, '\n\n', out)


{'input_ids': <tf.Tensor: shape=(64, 87), dtype=int64, numpy=
array([[ 101, 1045, 2228, ...,    0,    0,    0],
       [ 101, 1045, 2514, ...,    0,    0,    0],
       [ 101, 1045, 2001, ...,    0,    0,    0],
       ...,
       [ 101, 1045, 2514, ...,    0,    0,    0],
       [ 101, 1045, 2031, ...,    0,    0,    0],
       [ 101, 1045, 4963, ...,    0,    0,    0]])>, 'attention_mask': <tf.Tensor: shape=(64, 87), dtype=int64, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])>, 'token_type_ids': <tf.Tensor: shape=(64, 87), dtype=int64, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])>} 

 tf.Tensor(
[1 1 1 5 1 1 5 0 0 2 3 1 2 4 2 4 1 0 3 0 1 1 0 1 1 1 1 1 0 0 0 

In [13]:
class BertClassification(tf.keras.Model): # inheriting
    def __init__(self, bert_model, num_class):
        super().__init__()
        self.bert = bert_model
        self.fc = tf.keras.layers.Dense(num_class, activation='softmax')

    def call(self, inputs):
        x = self.bert(inputs)[1]
        return self.fc(x)


In [14]:
classifier = BertClassification(model, num_class=6)

classifier.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=['accuracy']
)

In [15]:
history = classifier.fit(train_dataset, epochs=2)

Epoch 1/2


KeyboardInterrupt: 

Since my pc couldn't stand to fit the epochs, i stopped the execution. I hope this will do fine in your machine.

In [None]:
classifier.evaluate(test_dataset)
