# Finetuning and Deploying Hugging Face Models on Vertex AI

For this tutorial it is recommended to use 1 GPU to speed up processes, this notebooks was run using the machinetype n1-highcpu-8 (8 vCPUs, 7.199 GB RAM) on Tensorflow. Visit the following tutorial to set up notebooks that utilize: GPUs [Spinning up a Vertex AI Notebook](../../../docs/vertexai.md).

This tutorial will focus on utilizing Hugging Face which is a repository for user to share and download machine learning models, datasets, and demos. For this tutorial we will load in a model and dataset from Hugging Face and train and test our model before deploying it on Vertex AI. The model we will be deploying is Flan T5 and the datasets is [ccdv/pubmed-summarization](https://HuggingFace.co/datasets/ccdv/pubmed-summarization). Steps will show how to hypertune a model locally and how to launch our custom training job on Vertex AI Training, these steps are based on Keras NLP Tutorials for [abstractive summarization](https://keras.io/examples/nlp/t5_hf_summarization/).

You may be wondering why are we training a pretrained model? The reason for this is because we are fine tuning our pretrained model for optimal performance on a particular application, in our case summarizing scientific documents. This is not a necessary step anymore as new methods have been made to enhance model performance like zero-shot learning which we will go over in our next tutorial.

## Install Tools

Hugging Face **transformers** are an open-source framework that allows you to utilize APIs and tools to download pretrained models, set hyperparameters, tokenize datasets, and further tune them to suite your needs. Here we are updating Vertex AI as well as installing the transformers package and **datasets** so that we can have access to Hugging Face datasets and as a bonus we are adding the S3 feature to help download datasets that may already be in a S3 bucket.

In [1]:
!pip install "transformers" "datasets" "rouge_score" "evaluate" "keras_nlp"

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/c1/bd/f64d67df4d3b05a460f281defe830ffab6d7940b7ca98ec085e94e024781/transformers-4.34.1-py3-none-any.whl.metadata
  Downloading transformers-4.34.1-py3-none-any.whl.metadata (121 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.5/121.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/7c/55/b3432f43d6d7fee999bb23a547820d74c48ec540f5f7842e41aa5d8d5f3a/datasets-2.14.6-py3-none-any.whl.metadata
  Downloading datasets-2.14.6-py3-none-any.whl.metadata (19 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting evaluate
  Obtaining dependency information for evaluate from https://files.pythonhosted.org/packages/70/63/7644a1eb7b0297e585a6adec98ed9e575309bb973c

## Download your dataset from Hugging Face

We will be downloading Hugging Face dataset 'ccdv/pubmed-summarization' which contains the full article and their abstracts which will help train our model to summarize scientific articles. Once the dataset is loaded we'll split the data into train, test, and validation datasets. Since these are large datasets we will only be using 5% of dataset to help our process run faster.

In [211]:
from datasets import load_dataset

# load dataset
train, test, validation = load_dataset("ccdv/pubmed-summarization", split=["train[:5%]", "test[:5%]", "validation[:5%]" ])

Lets list the feaures of one of our datasets to determine what we will need to tokenize in a later step. this dataset features are 'article' and 'abstract'

In [114]:
print(train)

Dataset({
    features: ['article', 'abstract'],
    num_rows: 5996
})


## Finetuning our Model Locally

Now that we have our datasets we can upload our model which will be the small version of Flan T5.


**Flan T5** is a text-to-text generation model and an advancement to the original T5 model and can be run on both CPUs and GPUs. **Text-to-text** is a method of creating text by using a neural network to generate new text from a given input. These T5 models can be fine-tuned for various zero shot NLP tasks that we have seen and heard of before: text classification, summarization, translation, and question-answering. Text-to-text is not to be confused by text2text generation which is a earlier version of T5 that is designed specifically for sequence-to-sequence tasks, such as machine translation and text generation and is limited to these task where as T5 models are more flexible due to the wider range of NPL tasks they can execute.

Because it is a seq2seq class model we will be using the transformer **TFAutoModelForSeq2Seq** (specifically for tensorflow models) to help find a load our pretrained model architecture. Then we will assign an **AutoTokenizer** to preprocess the text of our inputs (the test, train, validation datasets) into an array of numbers.

In [185]:
#model name
CHECKPOINT = "google/flan-t5-small"

In [184]:
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

model = TFAutoModelForSeq2SeqLM.from_pretrained(CHECKPOINT)
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)

2023-11-03 15:13:42.327557: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-03 15:13:42.327603: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-03 15:13:42.327636: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-03 15:13:42.336037: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-03 15:13:44.543851: I tensorflow/compiler/

Now that we have loaded the architecture of our model and configured it to tokenize our inputs we can now implement a tokenization functions to start processing our datasets.
Since we are using a T5 model we will have to prefix the inputs with "summarize:" to know which task to perform. We create a preprocess function to append the prefix to each row within the "article" column of our dataset labeling them as inputs. The inputs are then tokenized, limited by a set max length, and truncated.

A similar process is done for the "abstract" column within our dataset except we do not add the prefix and we labels them as **labels**.

**What is Truncating?**

Our group of inputs or batch will usually be different lengths which makes it hard to be converted to fixed-size tensors. To fix this problem **truncation** removes tokens ensure longer sequences will have the same length as the longest sequence in the batch which we have set to be **1024** for our inputs and **128** for our labels.


In [212]:
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=
            examples["abstract"], max_length=128, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

Now that we have our tokenized function the next step is to implement the **map** function to iterate the function **preprocess_function** over our loaded datasets.

In [210]:
tokenized_train = train.map(preprocess_function, batched=True)

#tokenized_test = test.map(preprocess_function, batched=True)

#tokenized_validation = validation.map(preprocess_function, batched=True)

Lets look at the structure of one of our new tokenized datasets you should see 3 new features (**'input_ids', 'attention_mask', 'labels'**) making 5 features total:

- **input_ids:** As our inputs are being tokenized an ID is assigned for each token, meaning as each text is broken up into sequences (which can be words or subwords) and converted to tokens within our dataset they are assign an ID.
- **attention_masks:** Tokens that should be ignored by the model usually represented by a 0. Masking can be done when some sequences are not the same length so they can not belong in the same tensor and need to be padded.
- **labels:** The new name of the abstract column that has been tokenized.

In [11]:
print(tokenized_train)

Dataset({
    features: ['article', 'abstract', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 5996
})


DataCollators are objects that dynamically pads the inputs and the labels in our batches, reverse to truncating **padding** adds a special padding token to ensure shorter sequences will have the same length as the longest sequence in the batch which a gain we set in out preprocess_function.

In [12]:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=CHECKPOINT, return_tensors="tf")

Then the last step will be to set our data format to be suitable for Tensorflow using the function **'prepare_tf_dataset()'** by automatically inspecting your model and keep only the features that are necessary. As you can see there are only 2 of our features left represented in the dataset: **input_ids and attention_mask**.

In [13]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_train,
    shuffle=True,
    batch_size=10,
    collate_fn=data_collator,
    
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_test,
    shuffle=False,
    batch_size=10,
    collate_fn=data_collator,
    
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_validation,
    shuffle=False,
    batch_size=10,
    collate_fn=data_collator,
    
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [14]:
print (tf_train_set)

<_PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(10, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(10, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(10, None), dtype=tf.int64, name=None))>


**Learning rate** controls how much the model will change in response to the estimated error each time the model weights are updated. Too small of a learning rate could result very slow training process that could eventually get stuck, whereas a value too large may result in an unstable training process. Setting the **weight decay** helps to avoid overfitting, weights small, and avoid exploding gradient. 

In [15]:
from transformers import AdamWeightDecay
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)

Using the function metric_fn will help us calculate the **ROUGE** score between the ground-truth and predictions while training. ROUGE stands for **Recall-Oriented Understudy for Gisting Evaluation** this metric compares a reference sentence with what our model produces see if there is overlap if there is it calculates the precision and recall using the overlap.

As an example say our model produced a sentence like so:

**'the cat was found under the bed'**

but the reference sentence normally written by a human is:

**'the cat was under the bed'**

In [16]:
import keras_nlp

rouge_l = keras_nlp.metrics.RougeL()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result

Using TensorFlow backend


We will use the validation dataset for calculating our ROUGE score. While our ROUGE score is being calculated and our training is running its best to set up a **callback system**. A callback is an object that can perform actions at various stages of training and helps to write logs after every batch of training to monitor your metrics, periodically save your model to disk, and if need be do early stopping. Here we are using Keras call back system.

In [17]:
from transformers.keras_callbacks import KerasMetricCallback
metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=tf_validation_set, predict_with_generate=True, use_xla_generation=True)

Before we start to train our model the last step will be to set how many batches of training we should do, the number of iterations is called **epochs**, we will set ours to 3. Now we can start to train our model using the function **'fit'** and save our artifacts to a directory. The artifact that holds our model will be a file named **tf_model.h5**. 

In [18]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=metric_callback)

model.save_pretrained('saved_model')

Epoch 1/3

  return py_builtins.overload_of(f)(*args)
2023-11-02 13:09:59.053088: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55faf0d80f50 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-02 13:09:59.053132: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-11-02 13:10:00.019242: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator shared/assert_less/Assert/Assert
2023-11-02 13:10:00.163195: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-02 13:10:00.714302: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator shared/assert_less_1/Assert/Assert
2023-11-02 13:10:01.396732: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator shared/assert_less/Assert/Assert
2023-11-02 13:10:

Epoch 2/3
Epoch 3/3


## Testing the Model

Here we will use a sample text that we want our model to summarize.

In [98]:
text = "Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a \
highly transmissible and pathogenic coronavirus that emerged in late 2019 and has \
caused a pandemic of acute respiratory disease, named ‘coronavirus disease 2019’ (COVID-19), \
which threatens human health and public safety. In this Review, we describe the basic virology of \
SARS-CoV-2, including genomic characteristics and receptor use, highlighting its key difference \
from previously known coronaviruses. We summarize current knowledge of clinical, epidemiological and \
pathological features of COVID-19, as well as recent progress in animal models and antiviral treatment \
approaches for SARS-CoV-2 infection. We also discuss the potential wildlife hosts and zoonotic origin \
of this emerging virus in detail."

To predict the following tokenizes the text to gather the inputs, then uses **generate()** generate sequences of token ids for our model. We then decode our output to translate our tokenized output into text.

Below you will see that we have provided a paragraph about SARS-CoV-2 as our output, we also have some parameters that we specify to further tune our model to get a concise summary of what our text is about.

- **Max_Length:** Max number of words to generate.
- **Num_Return_Sequences:** Number of different outputs to generate. For our example we want one sentence or sequence.
- **Temperature:** Controls randomness, higher values increase diversity meaning a more unique response make the model to think harder. Must be a number from 0 to 1.
- **Top_p (nucleus):** The cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus. Must be a number from 0 to 1.
- **Top_k**: Sample from the k most likely next tokens at each step. Lower k focuses on higher probability tokens. This means the model choses the most probable words. Lower values eliminate fewer coherent words.

In [101]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
inputs = tokenizer.encode(text, return_tensors="tf")

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("saved_model")

outputs = model.generate(inputs, 
                         max_length=1000,
                         num_return_sequences = 1,
                         do_sample=True, 
                         temperature = 0.6,
                         top_k = 50, 
                         top_p = 0.95,)

tokenizer.decode(outputs[0], skip_special_tokens=True)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at saved_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


'We describe the basic virology of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and its role in preventing the pandemic of acute respiratory disease, named ‘coronavirus disease 2019’ (COVID-19), which threatens human health and public safety.'

### Optional: Summarizing PDF Files

The process of summarizing scientific PDF files is relatively the same except that we first need to extract the text from the PDF. To do so lets download a PDF file from PubMed.

In [94]:
! wget --user-agent="Chrome" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7784226/pdf/12248_2020_Article_532.pdf

--2023-11-02 20:07:00--  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7784226/pdf/12248_2020_Article_532.pdf
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 130.14.29.110, 2607:f220:41e:4290::110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5757370 (5.5M) [application/pdf]
Saving to: ‘12248_2020_Article_532.pdf’


2023-11-02 20:07:01 (7.25 MB/s) - ‘12248_2020_Article_532.pdf’ saved [5757370/5757370]



We'll be downloading some tools that help us extract only the text from our pdf file.

In [30]:
!pip install "fitz" "PyMuPDF"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting PyMuPDF
  Obtaining dependency information for PyMuPDF from https://files.pythonhosted.org/packages/41/4a/530017aaf0a554aa6d9abd547932a02c0188962d12122fe611bf7a6d0c26/PyMuPDF-1.23.5-cp310-none-manylinux2014_x86_64.whl.metadata
  Downloading PyMuPDF-1.23.5-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.23.5 (from PyMuPDF)
  Obtaining dependency information for PyMuPDFb==1.23.5 from https://files.pythonhosted.org/packages/cf/14/de59687368ad2c047b038b5b9b04e40bd5d486d5b36c6aef42c18c35ea2c/PyMuPDFb-1.23.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata
  Downloading PyMuPDFb-1.23.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Downloading PyMuPDF-1.23.5-cp310-none-manylinux2014_x86_64.whl (4.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading PyMuPDFb-1.23.5-py3-none-manylinux2014_x86_64.manylinux_2_

Now we can make a function **extract_text_from_pdf** to extract the text from the pdf and save it as a variable.

In [95]:
import fitz
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ''
    for page in doc:
        text += page.get_text()
    return text

text_pdf=extract_text_from_pdf('12248_2020_Article_532.pdf')

Finally we'll follow the same steps we did before to encode our inputs, pass it to our model, and then decode our output. Notice how we increased the max_length of what is expected of our input.

In [97]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
inputs = tokenizer.encode(text_pdf, max_length=1000, truncation=True, return_tensors="tf")

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("saved_model")

outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

tokenizer.decode(outputs[0], skip_special_tokens=True)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at saved_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


TypeError: Cannot convert 'Summary:' to EagerTensor of dtype int32

## Finetuning our Model via Vertex AI Training API

### Setting up our Datasets for Training 

Although we have our datasets saved locally inorder to utilize the Vertex AI Training API we will need to store our datasets in a bucket.

In [233]:
from datasets import load_dataset

# load dataset
train, test, validation = load_dataset("ccdv/pubmed-summarization", split=["train[:5%]", "test[:5%]", "validation[:5%]" ])

In [1]:
#load in the storage package and name our bucket
from google.cloud import storage
BUCKET='flan-t5-model-resources'
client = storage.Client()

In [105]:
#Create bucket
bucket = client.bucket(BUCKET)
bucket.create()

Conflict: 409 POST https://storage.googleapis.com/storage/v1/b?project=cit-oconnellka-9999&prettyPrint=false: Your previous request to create the named bucket succeeded and you already own it.

Convert our datasets to csv and upload to our bucket in one step!

In [60]:
from io import BytesIO

#convert train dataset to csv and push to GCS bucket
csv_buffer = BytesIO()
train.to_csv(csv_buffer)
client = storage.Client()
bucket = client.get_bucket(BUCKET)
bucket.blob('train.csv').upload_from_file(csv_buffer, 'text/csv')

Creating CSV from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

In [61]:
#convert test dataset to csv and push to GCS bucket
csv_buffer = BytesIO()
test.to_csv(csv_buffer)
client = storage.Client()
bucket = client.get_bucket(BUCKET)
bucket.blob('test.csv').upload_from_file(csv_buffer, 'text/csv')

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

In [62]:
#convert validation dataset to csv and push to GCS bucket
csv_buffer = BytesIO()
validation.to_csv(csv_buffer)
client = storage.Client()
bucket = client.get_bucket(BUCKET)
bucket.blob('validation.csv').upload_from_file(csv_buffer, 'text/csv')

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Here we will be saving the location of our datasets be used when we execute the training of our model.

In [257]:
# save train_dataset to s3
training_input_path = f'gs://{BUCKET}/train.csv'

# save test_dataset to s3
test_input_path = f'gs://{BUCKET}/test.csv'

validation_input_path = f'gs://{BUCKET}/validation.csv'

### Training our Model via Vertex AI Training API

To train our model on Vertex AI Training API you must first create a custom AI job, this is done by creating a autopkg that holds your requirements.txt and task.py files is a specific structure like so: 

```
autopkg-summarizer /
    + requirements.txt
    + trainer/
        + task.py
```

In [103]:
#Creates the following directories and files
!mkdir autopkg-summarizer
!touch autopkg-summarizer/requirements.txt
!mkdir autopkg-summarizer/trainer
!touch autopkg-summarizer/trainer/task.py

Add your requirements.txt file by adding the packages below:
```
nltk
transformers
keras_nlp
datasets
rouge_score
```

To create our training script we will be adding all the steps that we ran from the 'Finetuning our Model Locally' section of this tutorial to a file named task.py:

```
import nltk
import argparse
from datasets import load_dataset
#import evaluate
import numpy as np
from transformers import create_optimizer, AdamWeightDecay, TFAutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, set_seed
import tensorflow as tf
from tensorflow import keras
from transformers.keras_callbacks import KerasMetricCallback
import keras_nlp

def get_args():
    '''Parses args.'''
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument(
        '--model_name_or_path',
        required=True,
        type=str,
        help='name of model or path to load into tokenizer and class')
    parser.add_argument(
        '--train_file',
        required=True,
        type=str,
        help='train dataset in csv or json format')
    parser.add_argument(
        '--test_file',
        required=True,
        type=str,
        help='test dataset in csv or json format')
    parser.add_argument(
        '--validation_file',
        required=True,
        type=str,
        help='validation dataset in csv or json format used to calculate ROUGE score')
    parser.add_argument(
        '--text_column',
        required=True,
        type=str,
        help='The name of the column in the datasets containing the full texts (for summarization)')
    parser.add_argument(
        '--summary_column',
        required=True,
        type=str,
        help='The name of the column in the datasets containing the abstracts or summary of the full text')
    parser.add_argument(
        '--num_train_epochs',
        required=False,
        type=int,
        default=3,
        help='number of complete passes through the training dataset')
    parser.add_argument(
        '--source_prefix',
        required=False,
        type=str,
        help='A prefix to add before every source text (needed for T5 models)')
    parser.add_argument(
        '--inputs_max_length',
        required=False,
        type=int,
        default=1024,
        help='max token length for model inputs')
    parser.add_argument(
        '--labels_max_length',
        required=False,
        type=int,
        default=128,
        help='max token length for model labels or targets')
    parser.add_argument(
        '--batch_size',
        required=False,
        type=int,
        default=10,
        help='max token length for model labels or targets')
    parser.add_argument(
        '--output_dir',
        required=True,
        type=str,
        help='bucket to store saved model, include gs://')
    
    args = parser.parse_args()
    return args

def main():
    
    args = get_args() 
 
    checkpoint = args.model_name_or_path
    
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    
    text = args.text_column
    summary = args.summary_column
    inputs_max_length = args.inputs_max_length
    labels_max_length = args.labels_max_length
    prefix = args.source_prefix 
    
    model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint) 
    
    data_files = {'train':args.train_file, 'test':args.test_file, 'validation':args.validation_file}
    extension = args.train_file.split(".")[-1]
    
    raw_datasets = load_dataset(
        extension,
        data_files=data_files)
    
    raw_datasets = raw_datasets.filter(lambda x: x[text] is not None) 
    
    train = raw_datasets["train"]
    test = raw_datasets["test"]
    validation = raw_datasets["validation"]
         
    def preprocess_function(examples):
        
        inputs = [prefix + doc for doc in examples[text]]
        model_inputs = tokenizer(inputs, max_length=inputs_max_length, truncation=True)

    #    labels = tokenizer(text_target=examples["abstract"], max_length=128, truncation=True)

        labels = tokenizer(text_target=
                examples[summary], max_length=labels_max_length, truncation=True
            )

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs
    
    tokenized_train = train.map(preprocess_function, batched=True)
    tokenized_test = test.map(preprocess_function, batched=True)
    tokenized_validation = validation.map(preprocess_function, batched=True)
    
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")

    optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
    model.compile(optimizer=optimizer)

    tf_train_set = model.prepare_tf_dataset(
        tokenized_train,
        shuffle=True,
        batch_size=args.batch_size,
        collate_fn=data_collator
    )

    tf_test_set = model.prepare_tf_dataset(
        tokenized_test,
        shuffle=False,
        batch_size=args.batch_size,
        collate_fn=data_collator
    )
    
    tf_validation_set = model.prepare_tf_dataset(
        tokenized_validation,
        shuffle=False,
        batch_size=args.batch_size,
        collate_fn=data_collator
    )   
    
    def metric_fn(eval_predictions):
        predictions, labels = eval_predictions
        decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        for label in labels:
            label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
        result = rouge_l(decoded_labels, decoded_predictions)
        # We will print only the F1 score, you can use other aggregation metrics as well
        result = {"RougeL": result["f1_score"]}

        return result
    
    rouge_l = keras_nlp.metrics.RougeL()

    metric_callback = KerasMetricCallback(
        metric_fn, eval_dataset=tf_validation_set, predict_with_generate=True, use_xla_generation=True)


    model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=args.num_train_epochs, callbacks=metric_callback)
    model.save(f'{args.output_dir}/saved_model_artifacts_tf')
    model.save_pretrained(f'{args.output_dir}/saved_model_hf_tf')


if __name__ == "__main__":
    main()
```

### Hyperparameters (for the training script and custom AI job)

The first step to training our model other than setting up our datasets is to set our **hyperparameters**. Hyperparameters depend on your training script and for this one we need to identify our model, the location of our train and test files, etc. 

The batch_size, inputs_max_length, num_train_epochs, and labels_max_length already have defualts setting same as the ones we used in the first section of this tutorial!

In [5]:
#to view options and defaults you can run the command below
!python autopkg-summarizer/trainer/task.py --help

2023-11-03 12:32:26.151679: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-03 12:32:26.151738: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-03 12:32:26.151777: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-03 12:32:26.161962: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/home/j

In [258]:
#Parameters for task.py script
CHECKPOINT = "google/flan-t5-small"
train_file=training_input_path
test_file=test_input_path
validation_file=validation_input_path
text_column="article"
summary_column="abstract"
source_prefix="summarize: " 
output_dir= f'gs://{BUCKET}'

For custom AI we need to set the machine type, the accelerator for GPUs, and prebuilt docker image that will run our training. See here for more available containers: https://cloud.google.com/vertex-ai/docs/training/pre-built-containers.

In [20]:
#Parameters for custom AI job
display_name='flan-t5-training-tf'
BASE_GPU_IMAGE_tf='us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-12.py310:latest'
machine_type='n1-standard-4'
accelerator_type='NVIDIA_TESLA_V100'

### Submit Custom AI Training Job

Finally we can submit our training via a custom job! It will first deploy the container that we specified and then submit our model for training. This custom job can take 15 - 20 min using our sample datasets.

In [262]:
!gcloud ai custom-jobs create \
--region=us-central1 \
--display-name=$display_name \
--args=--model_name_or_path=$CHECKPOINT \
--args=--train_file=$train_file \
--args=--test_file=$test_file \
--args=--validation_file=$validation_file \
--args=--text_column=$text_column \
--args=--summary_column=$summary_column \
--args=--output_dir=gs://$BUCKET \
--args=--source_prefix=$source_prefix \
--worker-pool-spec=machine-type=$machine_type,replica-count=1,accelerator-type=$accelerator_type,executor-image-uri=$BASE_GPU_IMAGE_tf,local-package-path=autopkg-summarizer,python-module=trainer.task

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Using endpoint [https://us-central1-aiplatform.googleapis.com/]
  self.stdin = io.open(p2cwrite, 'wb', bufsize)
  self.stdout = io.open(c2pread, 'rb', bufsize)
Sending build context to Docker daemon  18.99kB
Step 1/10 : FROM us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-12.py310:latest
 ---> bd2bbbab7d71
Step 2/10 : RUN mkdir -m 777 -p /usr/app /home
 ---> Running in 358dbf3724e8
Removing intermediate container 358dbf3724e8
 ---> edf7be7209d7
Step 3/10 : WORKDIR /usr/app
 ---> Running in a23be90e59c5
Removing intermediate container a23be90e59c5
 ---> c35f2baa964c
Step 4/10 : ENV HOME=/home
 ---> Running in 0137537b093b
Removing intermediate container 0137537b093b
 ---> 64af9b387e54
Step 5/10 : ENV PYTHONDONTWRITEBYTECODE=1
 ---> Running in cc5806ee80a2
Removing intermediate container cc5806ee80a2
 ---> dfe914f7ecbc
Step 6/10 : RUN rm -rf /var/sitecustomize
 ---> Running in 3e7c5fa57fe2
Removing intermediate container 3e7c5fa57fe2
 ---> fa997bc68c88
Step 7/10 : COPY ["./requirements.txt

Once you start training the output from the command line should show you the command to use to view the progress of your training via the command `gcloud ai custom-jobs stream-logs <`. You can also monitor and view logs on the console by going to `Vertex AI > Training > Custom Jobs`
select your custom job and click on "View Logs"

## Deploy the Model

### Upload the Model to Vertex AI's Model Registry

Once our model is done training you should see a model_save.pd file in your bucket. We will need this inorder to upload our model to the Model Registry. Here we are specifiying a prebuilt docker image that will run our predictions, the name of our model and the directory in our bucket that holds our **model_save.pd** file.

In [3]:
TF_PREDICTION_IMAGE_URI_RUNTIME = 'us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-12:latest'

In [14]:
from google.cloud import aiplatform as vertexai
from google.cloud import aiplatform

#give your model a name
MODEL_DISPLAY_NAME = "summarizer-tf-runtime"
MODEL_DESCRIPTION = "summarizes scientific texts and pdfs" #optional

#add your project ID and location
project='<PROJECT_ID>'
location='<LOCATION>'

vertexai.init(project=project, location=location, staging_bucket=BUCKET)


model = aiplatform.Model.upload(
    display_name=MODEL_DISPLAY_NAME,
    description=MODEL_DESCRIPTION,
    serving_container_image_uri=TF_PREDICTION_IMAGE_URI_RUNTIME,
    serving_container_args=["--allow_precompilation", "--allow_compression", "--use_tfrt"],
    artifact_uri=f'gs://{BUCKET}/saved_model_artifacts_tf', #directory where our artifacts are in our bucket
)

Creating Model
Create Model backing LRO: projects/144763482491/locations/us-central1/models/3296764669607280640/operations/1237604172191236096
Model created. Resource name: projects/144763482491/locations/us-central1/models/3296764669607280640@1
To use this Model in another session:
model = aiplatform.Model('projects/144763482491/locations/us-central1/models/3296764669607280640@1')


### Create a Endpoint and Deploy it to our Model

A **endpoint** is how the user of the model can communicate with the model. A single model endpoint responds by returning a single inference from at least one model. It can take 20 min or more to establish a endpoint.

In [19]:
ENDPOINT_DISPLAY_NAME = "summarizer-endpoint" 
endpoint = aiplatform.Endpoint.create(display_name=ENDPOINT_DISPLAY_NAME)

model_endpoint = model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=MODEL_DISPLAY_NAME,
    machine_type="n1-standard-8",
    accelerator_type="NVIDIA_TESLA_V100",
    accelerator_count=1,
    traffic_percentage=100,
    deploy_request_timeout=1200,
    sync=True,
)

Creating Endpoint
Create Endpoint backing LRO: projects/144763482491/locations/us-central1/endpoints/5468832298092724224/operations/884634551396073472
Endpoint created. Resource name: projects/144763482491/locations/us-central1/endpoints/5468832298092724224
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/144763482491/locations/us-central1/endpoints/5468832298092724224')
Deploying model to Endpoint : projects/144763482491/locations/us-central1/endpoints/5468832298092724224
Deploy Endpoint model backing LRO: projects/144763482491/locations/us-central1/endpoints/5468832298092724224/operations/5601029261159825408
Endpoint model deployed. Resource name: projects/144763482491/locations/us-central1/endpoints/5468832298092724224


Here we are creating a endpoint and deploying our model to said endpoint. We are deploying our endpoint using 1 GPU which can take 20min to run, feel free to try out other machine types that utilize more GPUs.

## Delete All Resources

**Warning:** Once you are done don't forget to delete your endpoint, model, buckets, and shutdown or delete your Vertex AI notebook to avoid additional charges!

First we will delete our custom job. The command below will list custom jobs allowing you to gather the job id from the field called **'name:projects/<PROJECT_ID>/locations/us-central1/customJobs/<JOB_ID>'**

In [22]:
!gcloud ai custom-jobs list --project=$project --region=$location

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
---
createTime: '2023-11-03T17:43:15.502041Z'
displayName: flan-t5-training-tf3
endTime: '2023-11-03T18:03:29Z'
jobSpec:
  workerPoolSpecs:
  - containerSpec:
      args:
      - --model_name_or_path=google/flan-t5-small
      - --train_file=gs://flan-t5-model-resources/train.csv
      - --test_file=gs://flan-t5-model-resources/test.csv
      - --validation_file=gs://flan-t5-model-resources/validation.csv
      - --text_column=article
      - --summary_column=abstract
      - --output_dir=gs://flan-t5-model-resources/
      - '--source_prefix=summarize:'
      imageUri: gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3:20231103.17.39.12.779660
    diskSpec:
      bootDiskSizeGb: 100
      bootDiskType: pd-ssd
    machineSpec:
      acceleratorCount: 1
      acceleratorType: NVIDIA_TESLA_V100
      machineType: n1-standard-4
    replicaCount: '1'
name: projects/144763482491/locations/us-central1/customJo

In [26]:
from google.cloud import aiplatform
custom_job_id='<Custom_Job_ID_from_List>'

def delete_custom_job_sample(custom_job_id: str,
    project: str = project,
    location: str = location,
    api_endpoint: str = f'{location}-aiplatform.googleapis.com',
    timeout: int = 300,
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.JobServiceClient(client_options=client_options)
    name = client.custom_job_path(
        project=project, location=location, custom_job=custom_job_id
    )
    response = client.delete_custom_job(name=name)
    print("Long running operation:", response.operation.name)
    delete_custom_job_response = response.result(timeout=timeout)
    print("delete_custom_job_response:", delete_custom_job_response)
    
delete_custom_job_sample(custom_job_id)

Long running operation: projects/144763482491/locations/us-central1/operations/3654348322228928512
delete_custom_job_response: 


Now we will undeploy our model, delete endpoints, and delete finally our model!

In [None]:
model_endpoint.undeploy_all()
model_endpoint.delete()
model.delete()

Delete custom container stored in Custom Registry or Artifacr Registry. List the images to gather the tag id.

In [41]:
#list the containers
!gcloud container images list-tags gcr.io/$project/cloudai-autogenerated/$display_name

Listed 0 items.
DIGEST        TAGS                      TIMESTAMP
1240e61185c9  20231103.17.39.12.779660  2023-11-03T17:42:05
ca99b71c4661  20231103.16.13.42.102563  2023-11-03T16:21:43


In [50]:
#Save the tag ID
tag_id='<TAG_ID>'

In [51]:
#delete 
!gcloud container images delete gcr.io/$project/cloudai-autogenerated/$display_name:$tag_id --force-delete-tags --quiet

Digests:
- gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3@sha256:ca99b71c466168f467152e04791710a9e269e767985b22a6cd1702e4fac2f691
  Associated tags:
 - 20231103.16.13.42.102563
Tags:
- gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3:20231103.16.13.42.102563
Deleted [gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3:20231103.16.13.42.102563].
Deleted [gcr.io/cit-oconnellka-9999/cloudai-autogenerated/flan-t5-training-tf3@sha256:ca99b71c466168f467152e04791710a9e269e767985b22a6cd1702e4fac2f691].


And finally delete our bucket

In [None]:
!gcloud storage rm --recursive gs://$BUCKET/