Model 4 but with updates:
- Reduces max sequence length from 128->64
- Uses a smaller pre-trained model BERT->DistilBERT
- Increases the batch size from 8->32
- Implements mixed precision training to speed up training

- ACCURACY: .76

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import tensorflow as tf
from sklearn.metrics import confusion_matrix, classification_report
from keras.callbacks import EarlyStopping
from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizerFast
from tensorflow.keras import mixed_precision

In [None]:
# Check for GPU
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-967f8dd2-46d1-a564-b826-2bb6a953a60c)


In [None]:
# Enable mixed precision training
mixed_precision.set_global_policy('mixed_float16')

# Read in CSV Data for Twitter Sentiment Analysis
df = pd.read_csv("twitter_sentiment_data.csv")

# Get sentences and labels as dataframes
sentences = df["message"].to_numpy()
labels = df["sentiment"].to_numpy()

# Remove -1
labels[labels == -1] = 3

# One-hot encode the labels
num_classes = 4
labels = tf.keras.utils.to_categorical(labels, num_classes)

# Split into train and test
train_sentences, test_sentences, train_labels, test_labels = train_test_split(sentences, labels, test_size=0.2, random_state=1)

# Initialize the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Tokenize the data
train_encodings = tokenizer(list(train_sentences), truncation=True, padding=True, max_length=64)
test_encodings = tokenizer(list(test_sentences), truncation=True, padding=True, max_length=64)

# Create datasets
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), test_labels))

# Initialize the model
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=num_classes)

# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

# Create EarlyStopping instance
early_stopping = EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True)

# Fit the model
model_history = model.fit(train_dataset.shuffle(1000).batch(32),
                          epochs=5,
                          validation_data=test_dataset.batch(32),
                          callbacks=[early_stopping])

# Evaluate the model on the test set
model_results = model.evaluate(test_dataset.batch(32))
print(f"Loss: {model_results[0]}, Accuracy: {model_results[1]}")

# Make Predictions
model_pred = model.predict(test_dataset.batch(32))

# Convert predictions to labels
model_pred_labels = np.argmax(model_pred.logits, axis=1)

# Print the confusion matrix
cm = confusion_matrix(np.argmax(test_labels, axis=1), model_pred_labels)
print(cm)

# Print the classification report
cr = classification_report(np.argmax(test_labels, axis=1), model_pred_labels)
print(cr)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_layer_norm', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_19', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Loss: 0.5809661149978638, Accuracy: 0.7627716660499573
[[ 542  786  144   92]
 [  80 4136  279   37]
 [  14  230 1617    5]
 [  45  333   40  409]]
              precision    recall  f1-score   support

           0       0.80      0.35      0.48      1564
           1       0.75      0.91      0.83      4532
           2       0.78      0.87      0.82      1866
           3       0.75      0.49      0.60       827

    accuracy                           0.76      8789
   macro avg       0.77      0.66      0.68      8789
weighted avg       0.77      0.76      0.74      8789

