# __Test another way to finetune BERT__

Based on [Classify text with BERT](https://www.tensorflow.org/text/tutorials/classify_text_with_bert) from TensorflowHub.

## __Setup__

### _Install_

```bash
!pip install -q -U "tensorflow-text==2.8.*"
!pip install -q tf-models-official==2.7.0
````

### _Import_

In [1]:
import os
import json
import shutil
import pandas as pd
from pathlib import Path

from sklearn import model_selection

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

### _Configuration info_

In [29]:
# Reproducibility
seed = 20220609

# Setting paths
work_dir          = Path.home() / "projects/plant_sci_hist/"
corpus_combo_file = work_dir / "2_text_classify/corpus_combo"

# Dataset
batch_size     = 32
shuffle_buffer = 2

# https://stackoverflow.com/questions/56613155/tensorflow-tf-data-autotune
# tf.data builds a performance model of the input pipeline and runs an 
# optimization algorithm to find a good allocation of its CPU budget across all
# parameters specified as AUTOTUNE
AUTOTUNE = tf.data.AUTOTUNE

# maximum number of tokens in a document
max_length = 512


## __Get text ready__

### _Read json to dataframe_

In [3]:
def split_train_validate_test(corpus_combo_file, rand_state):
  '''Load data and split train, validation, test subsets for the cleaned texts
  Args:
    corpus_combo_file (str): path to the json data file
    rand_state (int): for reproducibility
  Return:
    train, test, test (pandas dataframes): training, validation, testing sets
  '''
  # Load json file
  with corpus_combo_file.open("r+") as f:
      corpus_combo_json = json.load(f)

  # Convert json back to dataframe
  corpus_combo = pd.read_json(corpus_combo_json)

  # Cleaned corpus
  corpus = corpus_combo[['label','txt']]

  # Split train test
  train, test = model_selection.train_test_split(corpus, 
      test_size=0.2, stratify=corpus['label'], random_state=rand_state)

  # Split train validate
  train, valid = model_selection.train_test_split(train, 
      test_size=0.25, stratify=train['label'], random_state=rand_state)

  return train, valid, test

In [4]:
train, valid, test = split_train_validate_test(corpus_combo_file, seed)

### _Convert training dataframe to dataset_

- See the [pd_dataframe_to_tf_dataset](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/pd_dataframe_to_tf_dataset) function, but this needs tf 2.9, conflict with tensorflow_text.
- See [this](https://www.tensorflow.org/tutorials/load_data/pandas_dataframe): See the shuffle and batch functions. Does not work...
- See [this post](https://medium.com/when-i-work-data/converting-a-pandas-dataframe-into-a-tensorflow-dataset-752f3783c168):  Was able to create SicedDataset, then BatchDatabase after applying the batch function, then PrefetchDataset. But trying to retreive a test example from trainin dataset lead to:
  - InvalidArgumentError: Index out of range using input dim 0; input has only 0 dims [Op:StridedSlice] name: strided_slice/
  - Ok, as I was implmenting the next solution, realize that I did not call the right obj for prefetch. Can be the reason why.
- Ah, see [this post](https://stackoverflow.com/questions/58461609/how-to-convert-pandas-dataframe-to-tensorflow-dataset): key is to turn train_data to dictionary before calling from_tensor_slices.
  - A little comment below say need to do .to_dict() instead which make sense. Because if just do dict(train), the thing finish in 0.1 sec which does not make sense. But this fails and throw:
    - ValueError: Unbatching a tensor is only supported for rank >= 1
  - Found [this post](https://stackoverflow.com/questions/55560620/valueerror-unbatching-a-tensor-is-only-supported-for-rank-1): Now try to uses another syntax.

In [5]:
# This creates a TensorSliceDataset

# The following does not work
#raw_train_ds = (tf.data.Dataset.from_tensor_slices(
#        (tf.cast(train['txt'].values, tf.string),
#         tf.cast(train['label'].values, tf.int32),)))
#raw_train_ds = tf.data.Dataset.from_tensor_slices(train)
#raw_train_ds = tf.data.Dataset.from_tensor_slices(dict(train))
#raw_train_ds = tf.data.Dataset.from_tensor_slices(train.to_dict())

X_train      = train['txt']
y_train      = train['label']
raw_train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
type(raw_train_ds)

2022-06-18 17:25:12.873849: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:08:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-06-18 17:25:12.926073: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:08:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-06-18 17:25:12.926404: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:08:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-06-18 17:25:12.928202: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate

tensorflow.python.data.ops.dataset_ops.TensorSliceDataset

In [6]:
raw_train_ds_batch = raw_train_ds.batch(batch_size)
type(raw_train_ds_batch)

tensorflow.python.data.ops.dataset_ops.BatchDataset

In [7]:
# This throws:
#  AttributeError: 'TensorSliceDataset' object has no attribute 'class_names'
#class_names = raw_train_ds.class_names
class_names = [0,1]

In [8]:
# From BatchDataset, get PrefetchDataset
train_ds = raw_train_ds_batch.cache().prefetch(buffer_size=AUTOTUNE)
type(train_ds)

tensorflow.python.data.ops.dataset_ops.PrefetchDataset

### _Covert validation and test sets_

In [10]:
# Get validation dataset
X_valid      = valid['txt']
y_valid      = valid['label']
raw_valid_ds = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))
raw_valid_ds_batch = raw_valid_ds.batch(batch_size)
valid_ds = raw_valid_ds_batch.cache().prefetch(buffer_size=AUTOTUNE)

# Get testing dataset
X_test      = test['txt']
y_test      = test['label']
raw_test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
raw_test_ds_batch = raw_test_ds.batch(batch_size)
test_ds = raw_test_ds_batch.cache().prefetch(buffer_size=AUTOTUNE)

### _Testing_

In [13]:
for text_batch, label_batch in train_ds.take(1):
  print(text_batch[0])
  print(len(text_batch))
  print(label_batch)
  print(len(label_batch))

tf.Tensor(b'Analysis of leaf microbiome composition of near-isogenic maize lines differing in broad-spectrum disease resistance.. Plant genotype strongly affects disease resistance, and also influences the composition of the leaf microbiome. However, these processes have not been studied and linked in the microevolutionary context of breeding for improved disease resistance. We hypothesised that broad-spectrum disease resistance alleles also affect colonisation by nonpathogenic symbionts. Quantitative trait loci (QTL) conferring resistance to multiple fungal pathogens were introgressed into a disease-susceptible maize inbred line. Bacterial and fungal leaf microbiomes of the resulting near-isogenic lines were compared with the microbiome of the disease-susceptible parent line at two time points in multiple fields. Introgression of QTL from disease-resistant lines strongly shifted the relative abundance of diverse fungal and bacterial taxa in both 3-wk-old and 7-wk-old plants. Neverthel

2022-06-18 17:26:19.455255: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [14]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(1):
    print(f'Review: {text_batch.numpy()[i]}')
    label = label_batch.numpy()[i]
    print(f'Label : {label} ({class_names[label]})')

Review: b'Analysis of leaf microbiome composition of near-isogenic maize lines differing in broad-spectrum disease resistance.. Plant genotype strongly affects disease resistance, and also influences the composition of the leaf microbiome. However, these processes have not been studied and linked in the microevolutionary context of breeding for improved disease resistance. We hypothesised that broad-spectrum disease resistance alleles also affect colonisation by nonpathogenic symbionts. Quantitative trait loci (QTL) conferring resistance to multiple fungal pathogens were introgressed into a disease-susceptible maize inbred line. Bacterial and fungal leaf microbiomes of the resulting near-isogenic lines were compared with the microbiome of the disease-susceptible parent line at two time points in multiple fields. Introgression of QTL from disease-resistant lines strongly shifted the relative abundance of diverse fungal and bacterial taxa in both 3-wk-old and 7-wk-old plants. Nevertheles

2022-06-18 17:26:29.062032: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


## __Define Hub models and initial testing__

### _Hub models to use_

Use BERT trained on MEDLINE/Pubmed:
- https://tfhub.dev/google/experts/bert/pubmed/2

In [16]:
tfhub_encoder = 'https://tfhub.dev/google/experts/bert/pubmed/2'
tfhub_preproc = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'

### _Load and test preprocessing model_

In [17]:
bert_preprocess_model = hub.KerasLayer(tfhub_preproc)
type(bert_preprocess_model)

tensorflow_hub.keras_layer.KerasLayer

In [22]:
# Note this is not a layer but a saved model
bert_preprocess = hub.load(tfhub_preproc)
type(bert_preprocess)

tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject

In [28]:
text_test = ['This paper is about Plant, like maize, rice, and tomato!']

#######################
# CRITICAL STEP!!! NEED TO CHANGE DIMENSION From 128 to 512
#######################

tok = bert_preprocess.tokenize(tf.constant(text_test))
text_preprocessed = bert_preprocess.bert_pack_inputs([tok, tok], 
                                                     tf.constant(max_length))
#text_preprocessed = bert_preprocess_model(text_test)

print(f'Keys       : {list(text_preprocessed.keys())}')

# The size is 128.
print(f'Shape      : {text_preprocessed["input_word_ids"].shape}')
print(f'Word Ids   : {text_preprocessed["input_word_ids"][0, :30]}')
print(f'Input Mask : {text_preprocessed["input_mask"][0, :30]}')
print(f'Type Ids   : {text_preprocessed["input_type_ids"][0, :30]}')

Keys       : ['input_mask', 'input_word_ids', 'input_type_ids']
Shape      : (1, 512)
Word Ids   : [  101  2023  3259  2003  2055  3269  1010  2066 21154  1010  5785  1010
  1998 20856   999   102  2023  3259  2003  2055  3269  1010  2066 21154
  1010  5785  1010  1998 20856   999]
Input Mask : [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Type Ids   : [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


### _Load and test BERT model_

In [30]:
bert_model = hub.KerasLayer(tfhub_encoder)
type(bert_model)

tensorflow_hub.keras_layer.KerasLayer

In [31]:
bert_results = bert_model(text_preprocessed)
print(f'Loaded BERT: {tfhub_encoder}')

Loaded BERT: https://tfhub.dev/google/experts/bert/pubmed/2


2022-06-18 17:56:46.339952: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.


In [33]:
# pooled_output: embedding of the document
# 768: size of the embedding vector
print(f'Pooled Outputs Shape:{bert_results["pooled_output"].shape}')
print(f'Pooled Outputs Values:{bert_results["pooled_output"][0, :7]}')

Pooled Outputs Shape:(1, 768)
Pooled Outputs Values:[ 0.20681225 -0.6118757   0.01431939 -0.94472456 -0.35345343  0.38756847
 -0.90354735]


In [34]:
# sequence_output: embeddings of each token
# 512: number of tokens of text_preprocessed
# 768: size of the embedding vector
print(f'Sequence Outputs Shape:{bert_results["sequence_output"].shape}')
print(f'Sequence Outputs Values:{bert_results["sequence_output"][0, :2]}')

Sequence Outputs Shape:(1, 512, 768)
Sequence Outputs Values:[[ 0.20988233 -0.7119897   0.0143176  ...  0.06562965  0.8226177
   0.42425305]
 [-0.8292844  -1.1429014   0.23615694 ... -0.15752412 -0.5507309
  -2.2226732 ]]


In [35]:
# encoder_outputs: intermediate activation of a transformer block
# Q: Assuming activation is the output value of the activation function.
# 12: number of transformer blocks
print(f'Encoder Outputs length:{len(bert_results["encoder_outputs"])}')

# Saem as sequence output values
print(f'Sequence Outputs shape:{bert_results["encoder_outputs"][0].shape}')
print(f'Sequence Outputs shape:{bert_results["encoder_outputs"][0][0, :2]}')

Encoder Outputs length:12
Sequence Outputs shape:(1, 512, 768)
Sequence Outputs shape:[[ 0.04168119 -0.05507107  0.05931066 ... -0.07429774  0.0130739
   0.04005793]
 [-0.7037214  -0.5592795   0.16818535 ... -0.1520418   0.34704968
  -0.40375572]]


## __Build classification model__

The challenge is how to use bert_pack_inputs as a prprocessing layer.
- [This hub page](https://www.tensorflow.org/hub/common_saved_model_apis/text) has some info.
- Remove the input layer and make it something

In [36]:
def build_classifier_model():
  # Input layer
  #text_input        = tf.keras.layers.Input(shape=(), dtype=tf.string, 
  #                                          name='txt')
  input_segments = [
      tf.keras.layers.Input(shape=(), dtype=tf.string, name=ft)
      for ft in sentence_features]

  # Will this work??  
  tokenizer_layer   = hub.KerasLayer(bert_preprocess.tokenize, name='tokenizer')
  tokenizer_outputs = tokenizer_layer(text_input)

  # Processing layer: This has the key change to allow longer texts.
  preproc_layer     = hub.KerasLayer(bert_preprocess.bert_pack_inputs, 
                                     arguments=dict(seq_lenght=max_length),
                                     name='preprocessing')
  preproc_outputs   = preproc_layer(tokenizer_outputs)

  # Initialize encoder
  encoder           = hub.KerasLayer(tfhub_encoder, trainable=True, 
                                   name='BERT_encoder')
  encoder_outputs   = encoder(preproc_outputs)
  # Q: Wonder if this is the dense layer mentioned above.
  print(type(encoder_outputs))

  # Get just the embeddings for each doc (ignore token level info)
  net            = encoder_outputs['pooled_output']

  # Dropout layer
  net            = tf.keras.layers.Dropout(0.1)(net)

  # output layer: single node, Q: Why??
  net            = tf.keras.layers.Dense(2, activation='softmax', 
                                                        name='classifier')(net)
  return tf.keras.Model(text_input, net)

Let's check that the model runs with the output of the preprocessing model.

In [37]:
classifier_model = build_classifier_model()

# tf.constant: create a Tensor from tensor like objects
tensor_test      = tf.constant(text_test)
bert_raw_result  = classifier_model(tensor_test)

print("Raw result   :", bert_raw_result)
print("Apply sigmoid:", tf.sigmoid(bert_raw_result))

ValueError: Exception encountered when calling layer "preprocessing" (type KerasLayer).

in user code:

    File "/home/shinhan/miniconda3/envs/tf/lib/python3.10/site-packages/tensorflow_hub/keras_layer.py", line 229, in call  *
        result = f()

    ValueError: Could not find matching concrete function to call loaded from the SavedModel. Got:
      Positional arguments (2 total):
        * tf.RaggedTensor(values=tf.RaggedTensor(values=Tensor("inputs:0", shape=(None,), dtype=int32), row_splits=Tensor("inputs_2:0", shape=(None,), dtype=int64)), row_splits=Tensor("inputs_1:0", shape=(None,), dtype=int64))
        * 128
      Keyword arguments: {'seq_lenght': 512}
    
     Expected these arguments to match one of the following 4 option(s):
    
    Option 1:
      Positional arguments (2 total):
        * [RaggedTensorSpec(TensorShape([None, None]), tf.int32, 1, tf.int64)]
        * TensorSpec(shape=(), dtype=tf.int32, name='seq_length')
      Keyword arguments: {}
    
    Option 2:
      Positional arguments (2 total):
        * [RaggedTensorSpec(TensorShape([None, None]), tf.int32, 1, tf.int64), RaggedTensorSpec(TensorShape([None, None]), tf.int32, 1, tf.int64)]
        * TensorSpec(shape=(), dtype=tf.int32, name='seq_length')
      Keyword arguments: {}
    
    Option 3:
      Positional arguments (2 total):
        * [RaggedTensorSpec(TensorShape([None, None, None]), tf.int32, 2, tf.int64), RaggedTensorSpec(TensorShape([None, None, None]), tf.int32, 2, tf.int64)]
        * TensorSpec(shape=(), dtype=tf.int32, name='seq_length')
      Keyword arguments: {}
    
    Option 4:
      Positional arguments (2 total):
        * [RaggedTensorSpec(TensorShape([None, None, None]), tf.int32, 2, tf.int64)]
        * TensorSpec(shape=(), dtype=tf.int32, name='seq_length')
      Keyword arguments: {}


Call arguments received:
  • inputs=tf.RaggedTensor(values=tf.RaggedTensor(values=Tensor("Placeholder:0", shape=(None,), dtype=int32), row_splits=Tensor("Placeholder_1:0", shape=(None,), dtype=int64)), row_splits=Tensor("Placeholder_2:0", shape=(None,), dtype=int64))
  • training=None

In [None]:
classifier_model.summary()

## Model training

You now have all the pieces to train a model, including the preprocessing module, BERT encoder, data, and classifier.

### Loss function

Since this is a binary classification problem and the model outputs a probability (a single-unit layer), you'll use `losses.BinaryCrossentropy` loss function.


In [None]:
loss    = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()

### Optimizer

For fine-tuning, let's use the same optimizer that BERT was originally trained with: the "Adaptive Moments" (Adam). This optimizer minimizes the prediction loss and does regularization by weight decay (not using moments), which is also known as [AdamW](https://arxiv.org/abs/1711.05101).

For the learning rate (`init_lr`), you will use the same schedule as BERT pre-training: linear decay of a notional initial learning rate, prefixed with a linear warm-up phase over the first 10% of training steps (`num_warmup_steps`). In line with the BERT paper, the initial learning rate is smaller for fine-tuning (best of 5e-5, 3e-5, 2e-5).

In [None]:
# train_ds: PrefetchDataset of the training data
# tf.data.experimental.cardinality: return the cadinality of dataset
# cardinality: number of elements in a set
# Here should be the number of batches
cardinality = tf.data.experimental.cardinality(train_ds)
type(cardinality), cardinality.numpy()

In [None]:
epochs = 5
steps_per_epoch  = cardinality.numpy()
num_train_steps  = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

# Initial learning rate
init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

### Loading the BERT model and training

Using the `classifier_model` you created earlier, you can compile the model with the loss, metric and optimizer.

In [None]:
classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)

Note: training time will vary depending on the complexity of the BERT model you have selected.

In [None]:
print(f'Training model with {tfhub_encoder}')
history = classifier_model.fit(x=train_ds,
                               validation_data=val_ds,
                               epochs=epochs)

### Evaluate the model

Let's see how the model performs. Two values will be returned. Loss (a number which represents the error, lower values are better), and accuracy.

In [None]:
loss, accuracy = classifier_model.evaluate(test_ds)

print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')

### Plot the accuracy and loss over time

Based on the `History` object returned by `model.fit()`. You can plot the training and validation loss for comparison, as well as the training and validation accuracy:

In [None]:
history_dict = history.history
print(history_dict.keys())

acc = history_dict['binary_accuracy']
val_acc = history_dict['val_binary_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)
fig = plt.figure(figsize=(10, 6))
fig.tight_layout()

plt.subplot(2, 1, 1)
# r is for "solid red line"
plt.plot(epochs, loss, 'r', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
# plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(2, 1, 2)
plt.plot(epochs, acc, 'r', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

In this plot, the red lines represent the training loss and accuracy, and the blue lines are the validation loss and accuracy.

## Export for inference

Now you just save your fine-tuned model for later use.

In [None]:
dataset_name = 'imdb'
saved_model_path = './{}_bert'.format(dataset_name.replace('/', '_'))

classifier_model.save(saved_model_path, include_optimizer=False)

Let's reload the model, so you can try it side by side with the model that is still in memory.

In [None]:
reloaded_model = tf.saved_model.load(saved_model_path)

Here you can test your model on any sentence you want, just add to the examples variable below.

In [None]:
def print_my_examples(inputs, results):
  result_for_printing = \
    [f'input: {inputs[i]:<30} : score: {results[i][0]:.6f}'
                         for i in range(len(inputs))]
  print(*result_for_printing, sep='\n')
  print()


examples = [
    'this is such an amazing movie!',  # this is the same sentence tried earlier
    'The movie was great!',
    'The movie was meh.',
    'The movie was okish.',
    'The movie was terrible...'
]

reloaded_results = tf.sigmoid(reloaded_model(tf.constant(examples)))
original_results = tf.sigmoid(classifier_model(tf.constant(examples)))

print('Results from the saved model:')
print_my_examples(examples, reloaded_results)
print('Results from the model in memory:')
print_my_examples(examples, original_results)

If you want to use your model on [TF Serving](https://www.tensorflow.org/tfx/guide/serving), remember that it will call your SavedModel through one of its named signatures. In Python, you can test them as follows:

In [None]:
serving_results = reloaded_model \
            .signatures['serving_default'](tf.constant(examples))

serving_results = tf.sigmoid(serving_results['classifier'])

print_my_examples(examples, serving_results)

## Next steps

As a next step, you can try [Solve GLUE tasks using BERT on a TPU tutorial](https://www.tensorflow.org/text/tutorials/bert_glue), which runs on a TPU and shows you how to work with multiple inputs.