# BERT for Tweet Sentiment Analysis

In this notebook, we will be exploring three techniques of using BERT for sentimental classification. We would be keeping track of the out of vocabulary words when using pre-trained tokenizers on the dataset.

## Pre-Trained Model with Fine Tuning

We will make use of pre-trained Tokenizers and Classifiers, which by default allow all the layers of the encoder to be trainable (so that the gradients can be backpropogated to the same).

We will also using this section to creating a uiltity called `get_input_from_data` which will convert the passed `Pandas` dataframe to a dictionary containing two keys: `input_ids` and `attention_mask`. Both of them are utilised by the `BERT` models. We would be using `TFBertForSequenceClassification` model, which is a `BERT` Model accompanied by a neural network layer on top of it for classification.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, TFBertForSequenceClassification

# Load the dataset
df = pd.read_csv("data/sample_tweets.csv", sep=",", names=["label", "text"], header=0)

df_train, df_test = train_test_split(df, test_size=0.3, random_state=42)
df_test, df_val = train_test_split(df_test, test_size=0.5, random_state=42)

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
bert = TFBertForSequenceClassification.from_pretrained(
    "bert-base-cased",
    num_labels=2
)
max_len = 128

def get_input_from_data(df):
    X = tokenizer(
        text=df["text"].tolist(),
        add_special_tokens=True,
        truncation=True,
        max_length=max_len,
        padding="max_length",
        return_tensors="tf",
        return_token_type_ids=False,
        return_attention_mask=True,
        verbose=True,
    )

    x = {
        "input_ids": X["input_ids"],
        "attention_mask": X["attention_mask"],
    }
    y = df["label"]

    return x, y

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, GlobalMaxPool1D

# Convert the dataset to the format required by BERT
x_train, y_train = get_input_from_data(df_train)
x_val, y_val = get_input_from_data(df_val)
x_test, y_test = get_input_from_data(df_test)

bert.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
history = bert.fit(
    x_train,
    y_train,
    validation_data=(x_val, y_val),
    epochs=5,
    batch_size=32
)

# Evaluate the model
results = bert.evaluate(x_test, y_test, batch_size=32)
# print("Test loss, Test acc:", results)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [3]:
# Get the out of vocab ratio for the dataset using the pretrained tokenizer
vocab = set(tokenizer.get_vocab())
total_count = 0
out_of_vocab_count = 0

for sentence in df["text"].values:
    for word in sentence.split():
        total_count += 1
        if word not in vocab:
            out_of_vocab_count += 1

print(f"Total words: {total_count}")
print(f"Out of vocab words: {out_of_vocab_count}")
print(f"Out of vocab ratio: {out_of_vocab_count / total_count}")

Total words: 11123
Out of vocab words: 7077
Out of vocab ratio: 0.6362492133417244


As we can see that almost 63% off the words are out-of-vocabulary for the tokenizer that we are using. In the next step, we will use the same tokenizer and the same BERT model, but we would set the layer to be non-trainable.

## Pre-Trained BERT Model

We would be using the same model as in the above example, but instead of fine-tuning it, we would use it directly with the pre-trained weights. We would only allow the neural network layers to update their weights to match the classification task.

In [4]:
bert = TFBertForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

# Freeze the layers to use the pre-trained model
for _layer in bert.layers:
    print(f"Model layer: {_layer.name}")
    if _layer.name == "bert":
        _layer.trainable = False
        print("Freezed this layer")

# Compile the model and perform evaluation
bert.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
history = bert.fit(
    x_train,
    y_train,
    validation_data=(x_val, y_val),
    epochs=5,
    batch_size=32
)

results = bert.evaluate(x_test, y_test, batch_size=32)
print("Test loss, Test acc:", results)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model layer: bert
Freezed this layer
Model layer: dropout_75
Model layer: classifier
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss, Test acc: [3.2134838104248047, 0.3733333349227905]


We would not be performing the vocabulary analysis for the same, as since we are using the same tokenizer, the proportion of out of vocabulary words will remain the same. However we can see the difference fine-tuning has on the performance of the model.

## Training from Scratch

In this example, we would be using the exisitng tokenizer, and training two models from scratch on the given data.
1. Using only the `TFBert` model and adding a neural network classifier on the top of it manually. \
2. Using the `TFBertForSequenceClassification` layer directly.

In [5]:
# Install the required transformers library
%pip install transformers



In [7]:
from transformers import TFBertModel, BertConfig

# Set the BERT config
config = BertConfig(
    vocab_size=len(tokenizer.get_vocab()),
    hidden_size=128,
    num_attention_heads=4,
    initializer_range=0.01,
)
bert = TFBertModel(config)

# Model Architecture
input_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_ids")
input_mask = Input(shape=(max_len,), dtype=tf.int32, name="attention_mask")
# 0 = last hidden state, 1 = poller_output
embeddings = bert(input_ids, attention_mask=input_mask)[0]
out = GlobalMaxPool1D()(embeddings)
out = Dense(128, activation="relu")(out)
out = Dropout(0.1)(out)
out = Dense(1, activation="softmax")(out)

model = tf.keras.Model(inputs=[input_ids, input_mask], outputs=out)
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

# Train and evaluate
model.fit(
    x_train,
    y_train,
    validation_data=(x_val, y_val),
    epochs=5,
    batch_size=32
)
results = model.evaluate(x_test, y_test, batch_size=32)
print("Test loss, Test acc:", results)

Epoch 1/5




Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss, Test acc: [0.6730362176895142, 0.3733333349227905]


In [8]:
# Set the BERT config
config = BertConfig(
    vocab_size=len(tokenizer.get_vocab()),
    hidden_size=128,
    num_attention_heads=4,
    num_labels=2,
    initializer_range=0.01,
)
bert = TFBertForSequenceClassification(
    config
)

bert.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

# Train and evaluate
model.fit(
    x_train, y_train, validation_data=(x_val, y_val), epochs=5, batch_size=32
)
results = model.evaluate(x_test, y_test, batch_size=32)
print("Test loss, Test acc:", results)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss, Test acc: [0.6641815304756165, 0.3733333349227905]


One thing to note that we have considerably lower accuracy in the case of these models, as we have random initialization of weights, and the model isn't able to learn much about the semantics of the langauge quickly. Some of the accuracy can be attributed to using a pretrained tokenizer as well (though it supports only about 37% of the vocabulary, so how much effect it effectively has is still a bit unclear).