Ken Perry attribution
- The following code is adapted from the Course (as of late May 2022) example notebook `Fine_tune_HuggingFace_model_in_Keras_with_plain_datasets.ipynb`
- Change dataset to Financial Phrasebank
  - The "official" version of the data is hidden behind a download linke
  - To use it, you need to
   - go to the link, manually download it to your local machine
   - upload it to the `/content` directory on Colab
  - I give an alternate source, with more examples


In [11]:
try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

if IN_COLAB:
  print("We're running Colab")

We're running Colab


In [12]:
import tensorflow as tf

print("Running TensorFlow version ",tf.__version__)

# Parse tensorflow version
import re

version_match = re.match("([0-9]+)\.([0-9]+)", tf.__version__)
tf_major, tf_minor = int(version_match.group(1)) , int(version_match.group(2))
print("Version {v:d}, minor {m:d}".format(v=tf_major, m=tf_minor) )

Running TensorFlow version  2.8.2
Version 2, minor 8


In [13]:
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
if gpu_devices:
    print('Using GPU')
    tf.config.experimental.set_memory_growth(gpu_devices[0], True)
else:
    print('Using CPU')

Using GPU


In [14]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import numpy as np

In [15]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
!pip install  datasets
from datasets import load_dataset



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# The dataset seems to be hidden behind a `download` link (`dataset_downlolad_url`). 

Invoke the link (`dataset_download_url`) manually to get the zip file instead

In [17]:
(dataset_name, dataset_subset_name) = ("financial_phrasebank", "sentences_allagree")
dataset_downlolad_url = "https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10/link/0c96051eee4fb1d56e000000/download"
dataset_zipfile_url = "https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip"

download_path = "/content/FinancialPhraseBank-v1.0.zip"

from datasets import load_dataset

# dataset = load_dataset("financial_phrasebank",'sentences_allagree')


The download URL is hidden behind a "download" link
- You must manually visit the URL
- Click on "Download"
- And then upload to the `/content` directory on Colab

In [18]:
import os

if not os.path.exists(download_path):
  print("You must manually go to the URL: ", dataset_zipfile_url, "\n\tdownload the file and upload it to Colab")
  # !wget $dataset_zipfile_url

  

In [19]:
unzipped_dir=download_path.replace(".zip", "")
unzipped_file=os.path.join(unzipped_dir, "Sentences_AllAgree.txt")


if not os.path.exists(unzipped_file):
  ! unzip $download_path

print("Loading: ", unzipped_file)

Archive:  /content/FinancialPhraseBank-v1.0.zip
   creating: FinancialPhraseBank-v1.0/
  inflating: FinancialPhraseBank-v1.0/License.txt  
   creating: __MACOSX/
   creating: __MACOSX/FinancialPhraseBank-v1.0/
  inflating: __MACOSX/FinancialPhraseBank-v1.0/._License.txt  
  inflating: FinancialPhraseBank-v1.0/README.txt  
  inflating: __MACOSX/FinancialPhraseBank-v1.0/._README.txt  
  inflating: FinancialPhraseBank-v1.0/Sentences_50Agree.txt  
  inflating: FinancialPhraseBank-v1.0/Sentences_66Agree.txt  
  inflating: FinancialPhraseBank-v1.0/Sentences_75Agree.txt  
  inflating: FinancialPhraseBank-v1.0/Sentences_AllAgree.txt  
Loading:  /content/FinancialPhraseBank-v1.0/Sentences_AllAgree.txt


Unfortunately, the unzipped file is not encoded as utf-8, so `load_dataset` failes when it encounters a non-Unicode character.

Can read it as a CSV file by passing the proper encoding argument and separator.
Then write it back out as a CSV file in "standard" encoding.

In [20]:
import pandas as pd
df = pd.read_csv(unzipped_file, encoding='latin1', delimiter='@', header=None)


In [21]:
unzipped_file_mod = unzipped_file.replace(".txt", "_mod.csv")
(text_hdr, label_hdr) = ("text", "labels")
df.to_csv(unzipped_file_mod, sep="\t", header=[text_hdr, label_hdr], index=None)

In [22]:
raw_datasets = load_dataset("csv", data_files=unzipped_file_mod, delimiter="\t")

Using custom data configuration default-cd5e39cacf45ea2b


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-cd5e39cacf45ea2b/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-cd5e39cacf45ea2b/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [23]:

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 2264
    })
})

In [24]:
raw_datasets["train"][:2]

{'labels': ['neutral', 'positive'],
 'text': ['According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .',
  "For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m ."]}

In [25]:
label_to_int = { 'negative': 0, 'neutral': 1, 'positive': 2}
def process_example(example):
  text, label = example[text_hdr], example[label_hdr]

  # Replace label with integer:
  label_int = label_to_int[label]

  return { text_hdr: text, label_hdr: label_int }

In [26]:
example = raw_datasets["train"][0]
print("Raw example: ", example)

print("Processed example: ", process_example(example) )


Raw example:  {'text': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'labels': 'neutral'}
Processed example:  {'text': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .', 'labels': 1}


In [27]:
processed_dataset = raw_datasets.map(process_example)

  0%|          | 0/2264 [00:00<?, ?ex/s]

In [28]:
processed_dataset["train"][:2]

{'labels': [1, 2],
 'text': ['According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .',
  "For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m ."]}

# Alternate (pre-processed) data source

In [29]:
dataset = load_dataset("nickmuchi/financial-classification")


Downloading:   0%|          | 0.00/768 [00:00<?, ?B/s]

Using custom data configuration nickmuchi--distilroberta-finetuned-finclass-abb0f3d5c2987b89


Downloading and preparing dataset None/None (download: 412.04 KiB, generated: 678.23 KiB, post-processed: Unknown size, total: 1.06 MiB) to /root/.cache/huggingface/datasets/nickmuchi___parquet/nickmuchi--distilroberta-finetuned-finclass-abb0f3d5c2987b89/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/378k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.1k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/nickmuchi___parquet/nickmuchi--distilroberta-finetuned-finclass-abb0f3d5c2987b89/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [30]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 4551
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 506
    })
})

In [31]:
dataset["train"][:2]

{'labels': [0, 1],
 'text': ['Finnish airline Finnair is starting the temporary layoffs of cabin crews in February 2010 .',
  'The corresponding increase in the share capital , in total EUR 300,00 was registered in the Finnish Trade Register on May 8 , 2008 .']}

# Re-using a `DistilBert` model with a task specific Classifer head

`BERT` is a *very large* Language Model.

`DistilBert` is a *much smaller* model obtained from `BERT` via a process known as distillation


Let's take a look at the model configuration of each model

In [32]:
from transformers import DistilBertConfig, BertConfig

DistilBertConfig()

DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.21.0",
  "vocab_size": 30522
}

In [33]:
BertConfig()

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.21.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

A couple of comparisons between the two models
- both models produce a sequence of latent vectors (sequence length equal to length of input sequence)
- the latent dimension of BERT (`hidden_size`) and `DistilBert` (`dim`) are both 768
- the number of layers of BERT ('num_hiden_layers`) is 12; `DistilBert` (`n_layers`) is 6
- both has 12 attention heads per layer


# Instantiating the pre-trained model

We are going to adapt `DistilBert` to a new "Target" task
- `DistilBert` was trained on the Masked Language Modelling task
- So the complete model includes a Classification head for that task
- Our task is different: Text Sequence Classification
- We will therefore invoke a "headless" version of the model and graft on our own head
  - which will need to be trained



By invoking the model with the `*Model` architecture: we get a model that returns (the sequence of) hidden states.  That is: a model without a head.

Had we invoked it with the `*AutoModelForSequenceClassification` architecture, we get a model with a *binary* classification head.
- But this dataset has *three* classes, so have to design a head with 3 outputs


In [34]:
from transformers import DistilBertTokenizerFast, TFDistilBertModel
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
bert = TFDistilBertModel.from_pretrained("distilbert-base-uncased")

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'vocab_transform', 'activation_13', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


We get warning messages because
- `DistilBert` was trained for Masked Language Modelling and we are invoking a "headless" model
  - because we need to use it for a different task: Text Sequence Classification

We will accomplish this by deriving a sub-class of `kera.Model`
- that *contains* a `DistilBert` model
  - refered to as the "encoder"
- override the `call` method
  - to invoke the encoder
  - post-process the output (obtain the encoding of the special `[CLS]` input token
  - use the encoding of `[CLS]` as input to a task-specfiic Classifier head

In [35]:
class TextClassificationModel(keras.Model):
  def __init__(self, encoder, train_encoder=True):
    super(TextClassificationModel, self).__init__()
    self.encoder = encoder
    self.encoder.trainable = train_encoder
    self.dropout1 = layers.Dropout(0.1)
    self.dropout2 = layers.Dropout(0.1)
    self.dense1 = layers.Dense(20, activation="relu")
    self.dense2 = layers.Dense(3, activation='softmax')
  
  def call(self, input):
    x = self.encoder(input)
    x = x['last_hidden_state'][:, 0, :]
    x = self.dropout1(x)
    x = self.dense1(x)
    x = self.dropout2(x)
    x = self.dense2(x)
    return x

# Prepare the data
- split into train and test datasets
- tokenize train and test datasets
- create TensorFlow `tf.data.Dataset`

In [36]:
len( processed_dataset["train"]["labels"] )

2264

In [37]:
target_labels = list( set(raw_datasets["train"]["labels"]) )

print(f"Target task labels: {', '.join(target_labels)}")


Target task labels: positive, negative, neutral


In [38]:
print(f"Target task labels: {', '.join(target_labels)}")

Target task labels: positive, negative, neutral


First try:
- place all examples/labels in memory, rather than HF dataset
- then tokenize them and place them in a TF Dataset and free memory

In [39]:
train_texts, train_labels = processed_dataset["train"]["text"], processed_dataset["train"]["labels"]

In [40]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [41]:
len(val_texts)

453

In [42]:
train_encodings = tokenizer(train_texts, truncation=True, padding="max_length", max_length=512)
val_encodings = tokenizer(val_texts, truncation=True, padding="max_length", max_length=512)

In [43]:
type(train_encodings)

transformers.tokenization_utils_base.BatchEncoding

In [44]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

In [45]:
type( dict(train_encodings))

dict

In [46]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

Free up memory
- the data is now in the `tf.data.Datasets`, don't need to keep the original in memory

In [47]:
num_train = len(train_texts)

del(train_texts)
del(val_texts)

In [48]:
del(train_encodings)
del(val_encodings)

# Transfer Learning
- Just train the Classification head
- **do not** modify the weights of the "Encoder" (`DistilBert`) model contained within `text_classification_model`

Create `text_classification_model` by adding a trainable Classifiation head to a frozen `DistilBert` model 

In [49]:
text_classification_model = TextClassificationModel(bert, train_encoder=False)

We would like to do `text_classification_model.summary()` right now
- but it will fail because "the model hasn't been built"
  - this means that the size of the inputs are unknown as of yet
  - either we invokde `build` on the model and specify the input shape
  - or we call the model with some data, thus indicating the input shape
    - we do the latter
    - create a batched dataset
    - process the first batch through the model

In [50]:
first_batch_outputs = text_classification_model(next(iter(train_dataset.batch(4))))

print(f"First batch outputs -- number of examples in batch: {first_batch_outputs.shape[0]}")

print(f"First batch outputs -- number of classes: {first_batch_outputs.shape[1]}")

print(f"First batch outputs -- sum of outputs of each row: {tf.reduce_sum(first_batch_outputs, axis=1)}.")

print()
print(f"First batch outputs:")
first_batch_outputs

First batch outputs -- number of examples in batch: 4
First batch outputs -- number of classes: 3
First batch outputs -- sum of outputs of each row: [1. 1. 1. 1.].

First batch outputs:


<tf.Tensor: shape=(4, 3), dtype=float32, numpy=
array([[0.4192836 , 0.19302693, 0.38768947],
       [0.40941504, 0.28450367, 0.3060813 ],
       [0.4528499 , 0.18922348, 0.35792667],
       [0.38919824, 0.22196409, 0.38883772]], dtype=float32)>

We can see from the above output shape
- 4 rows = batch size 4
- 3 columns: corresponds to the 3 classes
- column values appear to be probabilities (sum to 1), not logits

OK, time to get the model summary

In [51]:
def count_weights(weights_per_layer, prefix=None):
  count_weights = 0

  for layer, weights in enumerate(weights_per_layer):
    num_weights = np.prod(weights.shape)

    if prefix is not None:
      print(f"Trainable layer {layer} has {num_weights} weights")

    count_weights += num_weights

  return count_weights

def count_model_weights(model):
  all_weights = model.weights
  trainable_weights = model.trainable_weights

  num_layers = len( model.layers )

  # Control detailed output: supress if number of layers (length of trainable_weights) is too big
  out_prefix = "trainable" if len(trainable_weights) < 10 else None

  num_weights, num_trainable_weights = count_weights(all_weights, None), count_weights(trainable_weights, out_prefix)                                                                         

  return num_weights, num_trainable_weights
  


In [52]:
num_weights, num_trainable_weights = count_model_weights(text_classification_model)

print()
print(f"Total number of weights {num_weights:,}, number of trainable weights {num_trainable_weights:,}")

Trainable layer 0 has 15360 weights
Trainable layer 1 has 20 weights
Trainable layer 2 has 60 weights
Trainable layer 3 has 3 weights

Total number of weights 66,378,323, number of trainable weights 15,443


In [53]:
text_classification_model.summary()

Model: "text_classification_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 tf_distil_bert_model (TFDis  multiple                 66362880  
 tilBertModel)                                                   
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
 dropout_20 (Dropout)        multiple                  0         
                                                                 
 dense (Dense)               multiple                  15380     
                                                                 
 dense_1 (Dense)             multiple                  63        
                                                                 
Total params: 66,378,323
Trainable params: 15,443
Non-trainable params: 66,362,880
________________________

Let's examine the last 2 layers

In [54]:
for i, layer in enumerate( text_classification_model.layers[-2:] ):
  print(f"Layer {-2 + i}: {type(layer)} weights {layer.weights[0].shape}, biases {layer.weights[1].shape}")


Layer -2: <class 'keras.layers.core.dense.Dense'> weights (768, 20), biases (20,)
Layer -1: <class 'keras.layers.core.dense.Dense'> weights (20, 3), biases (3,)


You can see from the above
- the latent representation size (of the single `[CLS]` token) is 768
- the next to last layer is `Dense`, converts this to 20 features
- the last layer (Classifier) converts to 20 features to 3 classes



## Train: head only

In [55]:
text_classification_model.compile(
    tf.keras.optimizers.Adam(learning_rate=5e-5), 
    "sparse_categorical_crossentropy", 
    metrics=["accuracy"])


In [56]:

from tensorflow.python.ops.gen_logging_ops import histogram_summary

def train_model(model, train_dataset, val_dataset, num_epochs=4):
    history = model.fit(
      train_dataset.shuffle(1000).batch(16), 
      epochs=num_epochs, 
      validation_data=val_dataset.batch(16)
      #callbacks=[tensorboard_callback]
    )
    
    return history
    
def train_model_in_chunks(model, train_dataset, val_dataset, num_chunks=4, num_epochs=1):
  # Divide training set into chunks
  chunk_size = num_train // num_chunks

  print(f"training on {num_train} examples in chunks of size {chunk_size}")

  for epoch_num in range(num_epochs):
    for chunk_num in range(num_chunks):
      print(f"Epoch {epoch_num}, chunk {chunk_num}:")
      history = model.fit(
        train_dataset.skip(chunk_num * chunk_size).take(chunk_size).shuffle(1000).batch(16), 
        epochs=1, 
        validation_data=val_dataset.batch(16)
        # validation_data=val_dataset.take(500).batch(16),
        #callbacks=[tensorboard_callback]
    )
      
  return history

In [57]:
train_model(text_classification_model, train_dataset, val_dataset)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f54a922afd0>

## Accuracy: Evaluate accuracy on validation data

In [58]:
from sklearn.metrics import accuracy_score 


In [59]:
def eval_model(model, val_dataset, val_labels, batch_size=16):
  val_logits = model.predict( val_dataset.batch(batch_size) )

  # Depending on the model, the return type of val_logits can vary
  # - ndarray
  # - Hugging Face Sequence Model output type

  try:
    # If it is a Hugging Face return type, the logits are in the result attribute 'logits'
    hf_logits = val_logits.logits
    val_logits = hf_logits
  except:
    pass

  val_preds = np.argmax( val_logits, axis=1)

  acc = accuracy_score( val_labels, val_preds)  
  return acc

In [60]:
print(f"Transfer learning (head-only) accuracy: {eval_model(text_classification_model, val_dataset, val_labels):3.2f}")

Transfer learning (head-only) accuracy: 0.65


## Fine tuning: train **all** layers

Now that the head has been trained, it's safe to update weights for the "Encoder"
- had we not trained the head first
- the gradients in the initial batches would have bee large
- and updateing the Encoder weights with these large gradients would have been harmful


Unfreeze the weights in the embedded "Encoder" `Distilbert`

In [61]:
text_classification_model.encoder.trainable = True

In [62]:

num_weights, num_trainable_weights = count_model_weights(text_classification_model)

print()
print(f"Total number of weights {num_weights:,}, number of trainable weights {num_trainable_weights:,}")


Total number of weights 66,378,323, number of trainable weights 66,378,323


In [63]:
train_model(text_classification_model, train_dataset, val_dataset)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f54a9164c10>

## Accuracy after fine-tuning: Evaluate accuracy on validation data

In [64]:
print(f"Transfer learning (fine-tuning -- all weights) accuracy: {eval_model(text_classification_model, val_dataset, val_labels):3.2f}")

Transfer learning (fine-tuning -- all weights) accuracy: 0.71


# Simpler approach: auto-generated Text Sequence Classification Head

Hugging Face has a generic `TFAutoModelForSequenceClassification` class
- that invoked the `*ForSequenceClassification` variant of a given model
- result is a model that *includes* 
  - the post-processing steps needed to feed a Classification head
  - an (uninitialized) Classification Head
    - we need to tell the head how many classes are possible: `num_labels` argument

Similarly: we can obtain the tokenizer used by a variant of a given model

```
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
```
This is not necessary for us as we have already tokenized the data
- and convert to a `tf.data.Dataset`

In [65]:
from transformers import TFAutoModelForSequenceClassification
text_classification_model_hf = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(target_labels) )

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_transform', 'activation_13', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_40', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [66]:
num_weights, num_trainable_weights = count_model_weights(text_classification_model_hf)

print()
print(f"AutoModel: Total number of weights {num_weights:,}, number of trainable weights {num_trainable_weights:,}")


AutoModel: Total number of weights 66,955,779, number of trainable weights 66,955,779


From the above:
- looks like **all** weights are trainable

Probably not a good idea to Fine-Tune before training the Classifiction Head !

Let's address that:

In [67]:
text_classification_model_hf.layers

[<transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertMainLayer at 0x7f545956bc90>,
 <keras.layers.core.dense.Dense at 0x7f5459fe7190>,
 <keras.layers.core.dense.Dense at 0x7f5459fe7550>,
 <keras.layers.core.dropout.Dropout at 0x7f5459fe7890>]

Model architecture created by `TFAutoModelForSequenceClassification` is just like the one we created by hand.

Let's set the `TFDistilBert` model contained within to non-trainable

In [68]:
text_classification_model_hf.layers[0].trainable = False

num_weights, num_trainable_weights = count_model_weights(text_classification_model_hf)

print()
print(f"AutoModel -- head only: Total number of weights {num_weights:,}, number of trainable weights {num_trainable_weights:,}")

Trainable layer 0 has 589824 weights
Trainable layer 1 has 768 weights
Trainable layer 2 has 2304 weights
Trainable layer 3 has 3 weights

AutoModel -- head only: Total number of weights 66,955,779, number of trainable weights 592,899


In [69]:
text_classification_model_hf.compile(
    tf.keras.optimizers.Adam(learning_rate=5e-5), 
    "sparse_categorical_crossentropy", 
    metrics=["accuracy"])


In [70]:
train_model(text_classification_model_hf, train_dataset, val_dataset)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f5459f5b310>

In [71]:
print(f"Transfer learning (head-only) accuracy: {eval_model(text_classification_model_hf, val_dataset, val_labels):3.2f}")

Transfer learning (head-only) accuracy: 0.59


# Extra material: Understand datasets

The model takes a `dict` as argument, **not** an array of examples

The `dict` has keys for
- `input_ids`, `attention_mask`
- the value associated with each key is an array (of length equal to number of examples)

A batch of "examples" is thus a `dict` of arrays, **not** and array of `dict`'s !

`val_dataset` batch is:
- a tuple of length 2
  - features
    - a dict of key/value pairs
      - the values associated with a key is an array of size `batch_size`
  - labels
    - one label per example, hence an array of size `batch_size`

In [72]:
batch_size = 16

e = next( iter(val_dataset.batch(batch_size)) )
e_features, e_labels = e
e_features

{'attention_mask': <tf.Tensor: shape=(16, 512), dtype=int32, numpy=
 array([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>,
 'input_ids': <tf.Tensor: shape=(16, 512), dtype=int32, numpy=
 array([[  101,  1996,  3643, ...,     0,     0,     0],
        [  101,  1996,  4423, ...,     0,     0,     0],
        [  101,  2002,  2003, ...,     0,     0,     0],
        ...,
        [  101,  4082,  2765, ...,     0,     0,     0],
        [  101,  2041, 12184, ...,     0,     0,     0],
        [  101,  2045,  2097, ...,     0,     0,     0]], dtype=int32)>}

In [73]:
e_labels

<tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1], dtype=int32)>

To **manually** create a batch of examples (features only)
- Need to create a `dict`
- with the same keys
- whose arrays are sub-arrays (of length `batch_size`) of the entire set of examples

In [74]:
b = { k: e_features[k][:batch_size] for k in e_features.keys() }

# Try to predict using our manually created batch
text_classification_model.predict( b )

array([[0.1334348 , 0.7005338 , 0.16603138],
       [0.0846777 , 0.8001338 , 0.11518849],
       [0.08216178, 0.82754105, 0.0902971 ],
       [0.07863206, 0.8191407 , 0.10222724],
       [0.31440872, 0.0747483 , 0.610843  ],
       [0.09229304, 0.76575446, 0.14195257],
       [0.07644745, 0.8087645 , 0.11478806],
       [0.14345492, 0.66183734, 0.19470772],
       [0.0822947 , 0.81013393, 0.10757133],
       [0.10929512, 0.7545688 , 0.13613598],
       [0.1902226 , 0.4116093 , 0.3981681 ],
       [0.26054555, 0.19864237, 0.54081213],
       [0.14728542, 0.5823515 , 0.27036315],
       [0.27928934, 0.25023046, 0.47048017],
       [0.05744514, 0.8606822 , 0.08187271],
       [0.05708189, 0.83358884, 0.10932927]], dtype=float32)

In [75]:
# Compare to predict using the batch created by Dataset operations
# The dataset returns a pair: (features, labels).  Don't need labels to predict so the "[0]" is selecting the features from the pair
text_classification_model.predict( next( iter(val_dataset.batch(batch_size)) )[0] )


array([[0.1334348 , 0.7005338 , 0.16603138],
       [0.0846777 , 0.8001338 , 0.11518849],
       [0.08216178, 0.82754105, 0.0902971 ],
       [0.07863206, 0.8191407 , 0.10222724],
       [0.31440872, 0.0747483 , 0.610843  ],
       [0.09229304, 0.76575446, 0.14195257],
       [0.07644745, 0.8087645 , 0.11478806],
       [0.14345492, 0.66183734, 0.19470772],
       [0.0822947 , 0.81013393, 0.10757133],
       [0.10929512, 0.7545688 , 0.13613598],
       [0.1902226 , 0.4116093 , 0.3981681 ],
       [0.26054555, 0.19864237, 0.54081213],
       [0.14728542, 0.5823515 , 0.27036315],
       [0.27928934, 0.25023046, 0.47048017],
       [0.05744514, 0.8606822 , 0.08187271],
       [0.05708189, 0.83358884, 0.10932927]], dtype=float32)

In [76]:
num_val = 10

