Creating a neural net model to work with the Butters web application.

I followed the following tutorial to use transfer learning:
https://www.youtube.com/watch?v=6LXKugY5bFU

If I weren't so lazy I would label my own youtube dataset. But I am lazy...

In [1]:
# Data Processing
import pandas as pd
import numpy as np

# Train test split
from sklearn.model_selection import train_test_split

# Modeling
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# Huggingface Dataset
from datasets import Dataset

# Import accuracy_score to check performance
from sklearn.metrics import accuracy_score

In [2]:
# Read file
df_amz = pd.read_csv('amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])

# Train test split
X_train, X_test, y_train, y_test = train_test_split(df_amz['review'],
                                                    df_amz['label'],
                                                    test_size=0.20,
                                                    random_state=42)

print(f'The train dataset has a length of {len(X_train)} records.')
print(f'The test dataset has a length of {len(X_test)} records.')

The train dataset has a length of 800 records.
The test dataset has a length of 200 records.


In [3]:
# Tokenizer from pretrained model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Tokenize the reviews
tokenized_data_train = tokenizer(X_train.to_list(), return_tensors='np', padding=True)
tokenized_data_test = tokenizer(X_test.to_list(), return_tensors='np', padding=True)

# Labels are one-dimensional numpy array
labels_train = np.array(y_train)
labels_test = np.array(y_test)

# Tokenized ids
print(tokenized_data_train['input_ids'][0])

[  101 17554   112   189  2080  2965   119   102     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0]


In [4]:
# Load model
model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile model
model.compile(optimizer=Adam(5e-6), loss=loss, metrics=['accuracy'])

In [6]:
# Fit the model
model.fit(dict(tokenized_data_train),
    labels_train,
    validation_data=(dict(tokenized_data_test), labels_test),
    batch_size=4,
    epochs=2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x2410ce17640>

In [21]:
# Predictions
y_test_predict = model.predict(dict(tokenized_data_test))['logits']

# First 5 predictions
y_test_predict[:5]



array([[-1.3897868,  1.4423205],
       [-1.2231984,  1.3496759],
       [-1.5523567,  1.5151092],
       [ 2.1606205, -1.8553412],
       [-1.4364722,  1.4405607]], dtype=float32)

In [10]:
# Predicted probablities (apply softmax to get probabliites that add up to 1)
y_test_probabilities = tf.nn.softmax(y_test_predict)

# First 5 probabiltiies
y_test_probabilities[:5]

<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[0.05561361, 0.94438636],
       [0.07090472, 0.92909527],
       [0.04446939, 0.9555306 ],
       [0.98229355, 0.01770644],
       [0.05330066, 0.9466993 ]], dtype=float32)>

In [11]:
# Predicted Label
y_test_class_preds = np.argmax(y_test_probabilities, axis=1)

# First 5 labels
y_test_class_preds[:5]

array([1, 1, 1, 0, 1], dtype=int64)

In [12]:
# Accuracy of validation data
accuracy_score(y_test_class_preds, y_test)

0.94

In [13]:
# Save tokenizer
tokenizer.save_pretrained('./sentiment_transfer_learning_tensorflow/')

# Save model
model.save_pretrained('./sentiment_transfer_learning_tensorflow/')

In [16]:
# Verify model works

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('./sentiment_transfer_learning_tensorflow/')

# Load model
loaded_model = TFAutoModelForSequenceClassification.from_pretrained('./sentiment_transfer_learning_tensorflow/')

Some layers from the model checkpoint at ./sentiment_transfer_learning_tensorflow/ were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at ./sentiment_transfer_learning_tensorflow/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [17]:
# Predict logit using the loaded model
y_test_predict = loaded_model.predict(dict(tokenized_data_test))['logits']

# First 5 predictions
y_test_predict[:5]



array([[-1.3897868,  1.4423205],
       [-1.2231984,  1.3496759],
       [-1.5523567,  1.5151092],
       [ 2.1606205, -1.8553412],
       [-1.4364722,  1.4405607]], dtype=float32)