In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI - Llama2 fine-tuning with LoRA and serving on TPUv5e

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/tpuv5e_llama2_pytorch_finetuning_and_serving.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Ftraining%2Ftpuv5e_llama2_pytorch_finetuning_and_serving.ipynb">
      <img width="32px" src="https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/notebooks/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/training/tpuv5e_llama2_pytorch_finetuning_and_serving.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br>
Open in Vertex AI Workbench
    </a> 
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/tpuv5e_llama2_pytorch_finetuning_and_serving.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br>
      View on GitHub
    </a>
  </td>
</table>

## Quota - Make sure this is complete before you start!

In order to run this example, you will need the following TPUv5e quota approved. You can make requests in the console via IAM & Admin > Quotas or by reaching out to your Google account team:

aiplatform.googleapis.com/custom_model_serving_tpu_v5e (4-8 chips. 4 chips minimum for Llama2 7B)
aiplatform.googleapis.com/custom_model_training_tpu_v5e (16 chips minimum)

Check the [TPU pricing page](https://cloud.google.com/tpu/pricing) for the region availability and pricing

## Overview

This notebook demonstrates fine-tuning of a Llama2 7B model with [LoRA](https://huggingface.co/docs/peft/v0.9.0/en/package_reference/lora#peft.LoraConfig), and uses a TPUv5e for both fine-tuning and serving. The fine-tuning is based on a [Hugging Face example](https://huggingface.co/google/gemma-7b/blob/main/examples/example_fsdp.py) that uses [fully sharded data parallel with PyTorch XLA](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/), and [SPMD](https://pytorch.org/blog/pytorch-xla-spmd/). Follow the links to learn more.


Fine-tuning occurs with a [Vertex AI Custom Training Job](https://cloud.google.com/vertex-ai/docs/training/create-custom-job). A Vertex AI Custom Training Job allows for a higher level of customization and control over the fine-tuning job. All of the examples in this notebook use parameter Low-Rank Adaption [LoRA](https://huggingface.co/docs/peft/en/package_reference/lora) to reduce training and storage costs.

This notebook deploys the model using Hex-LLM, a High-Efficiency Large Language Model serving solution built with XLA that is being developed by Google Cloud

### Objective

- Fine-tune and deploy Llama2 models with a Vertex AI Custom Training Job and Vertex Prediction endpoint.
- Send prediction requests to your fine-tuned Llama2 model.


### Dataset

In this example, you will use an english_quotes dataset from Hugging Face to fine-tune the model. Details of the dataset can be found here: https://huggingface.co/datasets/Abirate/english_quotes

### Costs 

This tutorial uses the following billable components of Google Cloud:

- Vertex AI (Training, Prediction, TPUv5e)
- Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), [Cloud NL API pricing](https://cloud.google.com/natural-language/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Get started

### Install Vertex AI SDK for Python and other required packages


In [1]:
! pip3 install --upgrade --quiet google-cloud-aiplatform

### Restart runtime (Colab only)

To use the newly installed packages, you must restart the runtime on Google Colab.

In [2]:
import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

Authenticate your environment on Google Colab.


In [3]:
import sys

if "google.colab" in sys.modules:

    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK for Python

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [4]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
PROJECT_ID = "vertexai-service-project"  # @param {type:"string"} ##TODO: DELETE LINE

### Location

You can also change the `LOCATION` variable used by Vertex AI. Learn more about [Vertex AI locations](https://cloud.google.com/vertex-ai/docs/general/locations).

TPUv5e is available in the [following locations listed here](https://cloud.google.com/tpu/pricing)

In [5]:
LOCATION = "us-central1"  # @param {type: "string"}

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [6]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}
BUCKET_URI = f"gs://k-bucket-{PROJECT_ID}-tpuv5ellama"  # @param {type:"string"} ##TODO: Delete line


#### Set folder paths for staging, environment, and model artifacts

In [7]:
import os
STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [8]:
! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}

Creating gs://k-bucket-vertexai-service-project-tpuv5ellama/...


### Initialize Vertex AI SDK for Python

In [9]:
from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=STAGING_BUCKET)

### Set up service account
The service account looks like:
`*@.iam.gserviceaccount.com`
Please go to https://cloud.google.com/iam/docs/service-accounts-create#iam-service-accounts-create-console
and create service account with `Vertex AI User` and `Storage Object Admin` roles.


In [10]:
# The service account for deploying fine tuned model.
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}
SERVICE_ACCOUNT = "tupv5e-notebook@vertexai-service-project.iam.gserviceaccount.com"  # @param {type:"string"} ##TODO: DELETE LINE


### Access Llama 2 pretrained and finetuned models
The original models from Meta are converted into the Hugging Face format for finetuning and serving in Vertex AI.

Accept the model agreement to access the models:

1. Navigate to the Vertex AI > Model Garden page in the Google Cloud console
2. Search for Llama 2
3. Review the agreement that pops up on the model card page
4. Accept the agreement of Llama 2
5. On the documentation tab, a Cloud Storage bucket containing Llama 2 pretrained and finetuned models will be shared
5. Paste the Cloud Storage bucket link below and assign it to VERTEX_AI_MODEL_GARDEN_LLAMA2

In [11]:
VERTEX_AI_MODEL_GARDEN_LLAMA2 = "<Bucket path from documentation tab of Llama 2 in Vertex Model Garden>"  # This will be shared once click the agreement of LLaMA2 in Vertex AI Model Garden.
VERTEX_AI_MODEL_GARDEN_LLAMA2 = "gs://vertex-model-garden-public-us-central1/llama2" ##TODO: DELETE LINE
VERTEX_MODEL_ID = "llama2-7b-hf"
HF_MODEL_ID = "meta-llama/Llama-2-7b-hf"

In [13]:
assert (
    VERTEX_AI_MODEL_GARDEN_LLAMA2
), "Please click the agreement of Llama 2 in Vertex AI Model Garden, and get the GCS path of Llama 2 model artifacts."
print(
    "Copy Llama 2 model artifacts from",
    VERTEX_AI_MODEL_GARDEN_LLAMA2,
    "to ",
    f"{BUCKET_URI}/{HF_MODEL_ID}",
)

# Copy model files to your bucket
! gcloud storage cp -R $VERTEX_AI_MODEL_GARDEN_LLAMA2/$VERTEX_MODEL_ID/* $BUCKET_URI/$HF_MODEL_ID

Copy Llama 2 model artifacts from gs://vertex-model-garden-public-us-central1/llama2 to  gs://k-bucket-vertexai-service-project-tpuv5ellama/meta-llama/Llama-2-7b-hf
Copying gs://vertex-model-garden-public-us-central1/llama2/llama2-7b-hf/LICENSE to gs://k-bucket-vertexai-service-project-tpuv5ellama/meta-llama/Llama-2-7b-hf/LICENSE
Copying gs://vertex-model-garden-public-us-central1/llama2/llama2-7b-hf/LLaMA V2 Model Preview User Guide.pdf to gs://k-bucket-vertexai-service-project-tpuv5ellama/meta-llama/Llama-2-7b-hf/LLaMA V2 Model Preview User Guide.pdf
Copying gs://vertex-model-garden-public-us-central1/llama2/llama2-7b-hf/MODEL_CARD.md to gs://k-bucket-vertexai-service-project-tpuv5ellama/meta-llama/Llama-2-7b-hf/MODEL_CARD.md
Copying gs://vertex-model-garden-public-us-central1/llama2/llama2-7b-hf/Notice-File.docx to gs://k-bucket-vertexai-service-project-tpuv5ellama/meta-llama/Llama-2-7b-hf/Notice-File.docx
Copying gs://vertex-model-garden-public-us-central1/llama2/llama2-7b-hf/Respo

### Create the Artifact Registry repository and set the custom docker image uri

In [14]:
REPOSITORY = "tpuv5e-training-repository-unique"

In [15]:
image_name_train = "llama2-7b-hf-lora-tuning-tpuv5e"
hostname = f"{LOCATION}-docker.pkg.dev"
tag = "latest"

In [16]:
# Register gcloud as a Docker credential helper
!gcloud auth configure-docker $LOCATION-docker.pkg.dev --quiet


{
  "credHelpers": {
    "gcr.io": "gcloud",
    "us.gcr.io": "gcloud",
    "eu.gcr.io": "gcloud",
    "asia.gcr.io": "gcloud",
    "staging-k8s.gcr.io": "gcloud",
    "marketplace.gcr.io": "gcloud"
  }
}
Adding credentials for: us-central1-docker.pkg.dev
Docker configuration file updated.


**If your repository doesn't already exist**: Run the following cell to create your Artifact Registry repository.

In [17]:
!gcloud artifacts repositories create $REPOSITORY --repository-format=docker \
--location=$LOCATION --description="Vertex TPUv5e training repository"

Create request issued for: [tpuv5e-training-repository-unique]
Waiting for operation [projects/vertexai-service-project/locations/us-central1/
operations/b7f908ad-b999-4e54-a79d-13613e968190] to complete...done.           
Created repository [tpuv5e-training-repository-unique].


In [18]:
# Define container image name
PYTORCH_TRAIN_DOCKER_URI = (
    f"{hostname}/{PROJECT_ID}/{REPOSITORY}/{image_name_train}:{tag}"
)

### Define common functions

In [19]:
from datetime import datetime, timedelta

def get_job_name_with_datetime(prefix: str) -> str:
    """Gets the job name with date time when triggering training or deployment
    jobs in Vertex AI.
    """
    return prefix + datetime.now().strftime("_%Y%m%d_%H%M%S")

### Build the Docker container files

#### Create the trainer directory

In [20]:
import os

if not os.path.exists("trainer"):
    os.makedirs("trainer")

#### Create the Dockerfile for the custom container. This will install Hugging Face transformers, datasets, trl, and peft for fine-tuning

In [21]:
%%writefile trainer/Dockerfile
# This Dockerfile fine tunes the Llamas2 model using LoRA with PyTorch XLA
# Nightly TPU VM docker image
FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20240324

ENV DEBIAN_FRONTEND=noninteractive

# Install basic libs
RUN apt-get update && apt-get -y upgrade && apt-get install -y --no-install-recommends \
        cmake \
        curl \
        wget \
        sudo \
        gnupg \
        libsm6 \
        libxext6 \
        libxrender-dev \
        lsb-release \
        ca-certificates \
        build-essential \
        git \
        libgl1

# Copy Apache license.
RUN wget https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/LICENSE

# Install required libs
RUN pip install --upgrade pip
RUN pip install --upgrade pip
RUN pip install transformers==4.38.2 -U
RUN pip install datasets==2.18.0
RUN pip install trl==0.8.1 peft==0.10.0
RUN pip install accelerate==0.28.0
RUN pip install --upgrade google-cloud-storage

# Copy other licenses.
RUN wget -O MIT_LICENSE https://github.com/pytest-dev/pytest/blob/main/LICENSE
RUN wget -O BSD_LICENSE https://github.com/pytorch/xla/blob/master/LICENSE
RUN wget -O BSD-3_LICENSE https://github.com/pytorch/pytorch/blob/main/LICENSE

# Copy install libtpu to PATH above
RUN find ./usr/local/lib -name 'libtpu.so' -exec cp {} /lib \;

WORKDIR /
COPY train.py train.py
ENV PYTHONPATH ./

ENTRYPOINT ["python", "train.py"]

Writing trainer/Dockerfile


#### Add the __init__.py file

In [22]:
!touch trainer/__init__.py

#### Add the train.py file
This code is from the LoRA distributed fine-tuning code from this example: https://ai.google.dev/gemma/docs/distributed_tuning

The IMDB TensorFlow dataset is used to fine-tune the Gemma model. Additional logic is added to handle the TPU topology setting required by TPUv5e: https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config


In [23]:
%%writefile trainer/train.py
import os, sys
import argparse

import torch
import torch_xla
import torch_xla.core.xla_model as xm

from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer

from google.cloud import storage

# use spmd
import torch_xla.runtime as xr
xr.use_spmd()

parser = argparse.ArgumentParser()
parser.add_argument(
    "--tpu_topology",
    help="Topology to use for the TPUv5e (2x2, 2x4, 4x4)",
    default="4x4",
    type=str
)
parser.add_argument(
    "--model_name",
    help="Llama2 model name (meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf)",
    default="meta-llama/Llama-2-7b-hf",
    type=str
)
parser.add_argument(
    "--bucket_name",
    help="The name of the bucket you copied the Llama2 model files to",
    required=True,
    type=str
)
parser.add_argument(
    "--output_folder",
    type=str,
    required=True,
    help="Output folder name",
)
parser.add_argument(
    "--checkpoint_directory",
    type=str,
    default="output_ckpt",
    help="Checkpoint Directory name",
)
parser.add_argument(
    "--epochs",
    type=int,
    default=10,
    help="Number of epochs to train",
)
parser.add_argument(
    "--merged_model_folder",
    type=str,
    default="llama2-7b-hf/modelfiles",
    help="Checkpoint Directory name",
)
args = parser.parse_args()

GCS_PREFIX = "gs://"

def is_gcs_path(input_path: str) -> bool:
    return input_path.startswith(GCS_PREFIX)

def download_gcs_dir(gcs_dir: str, local_dir: str):
    """Download files in a GCS directory to a local directory.

    For example:
    download_gcs_dir(gs://bucket/foo, /tmp/bar)
    gs://bucket/foo/a -> /tmp/bar/a
    gs://bucket/foo/b/c -> /tmp/bar/b/c

    Arguments:
    gcs_dir: A string of directory path on GCS.
    local_dir: A string of local directory path.
    """
    if not is_gcs_path(gcs_dir):
        raise ValueError(f"{gcs_dir} is not a GCS path starting with gs://.")

    bucket_name = gcs_dir.split("/")[2]
    prefix = gcs_dir[len(GCS_PREFIX + bucket_name) :].strip("/")
    client = storage.Client()
    blobs = client.list_blobs(bucket_name, prefix=prefix)
    for blob in blobs:
        if blob.name[-1] == "/":
            continue
        file_path = blob.name[len(prefix) :].strip("/")
        local_file_path = os.path.join(local_dir, file_path)
        os.makedirs(os.path.dirname(local_file_path), exist_ok=True)
        blob.download_to_filename(local_file_path)
        print (f'download of {local_file_path} complete')
    print (f'Show all files in directory {os.listdir(local_dir)}')

def upload_directory_with_transfer_manager(bucket_name, source_directory, blob_name_prefix, workers=8):
    """Upload every file in a directory, including all files in subdirectories.

    Each blob name is derived from the filename, not including the `directory`
    parameter itself. For complete control of the blob name for each file (and
    other aspects of individual blob metadata), use
    transfer_manager.upload_many() instead.
    """

    # bucket_name = "your-bucket-name"

    # The directory on your computer to upload. Files in the directory and its
    # subdirectories will be uploaded. An empty string means "the current
    # working directory".
    # source_directory=""

    # blob_name_prefix = prefix for the files being uploaded to GCS
    # example: file1 and file2 in a folder uploaded to my-bucket with blob_name_prefix=my-folder/a/
    # will be uploaded to gs://my-bucket/my-folder/a/file1 and gs://my-bucket/my-folder/a/file2
    
    # The maximum number of processes to use for the operation. The performance
    # impact of this value depends on the use case, but smaller files usually
    # benefit from a higher number of processes. Each additional process occupies
    # some CPU and memory resources until finished. Threads can be used instead
    # of processes by passing `worker_type=transfer_manager.THREAD`.
    # workers=8

    from pathlib import Path

    from google.cloud.storage import Client, transfer_manager

    storage_client = Client()
    bucket = storage_client.bucket(bucket_name)

    # Generate a list of paths (in string form) relative to the `directory`.
    # This can be done in a single list comprehension, but is expanded into
    # multiple lines here for clarity.

    # First, recursively get all files in `directory` as Path objects.
    directory_as_path_obj = Path(source_directory)
    paths = directory_as_path_obj.rglob("*")

    # Filter so the list only includes files, not directories themselves.
    file_paths = [path for path in paths if path.is_file()]

    # These paths are relative to the current working directory. Next, make them
    # relative to `directory`
    relative_paths = [path.relative_to(source_directory) for path in file_paths]

    # Finally, convert them all to strings.
    string_paths = [str(path) for path in relative_paths]

    print("Found {} files.".format(len(string_paths)))

    # Start the upload.
    print (f"source directory {source_directory}")
    results = transfer_manager.upload_many_from_filenames(
        bucket, string_paths, blob_name_prefix=blob_name_prefix, source_directory=source_directory, max_workers=workers
    )

    for name, result in zip(string_paths, results):
        # The results list is either `None` or an exception for each filename in
        # the input list, in order.

        if isinstance(result, Exception):
            print("Failed to upload {} due to exception: {}".format(name, result))
        else:
            print("Uploaded {} to {}/{}.".format(name, bucket.name, blob_name_prefix))
    
def main():
    x = args.tpu_topology.split("x")
    tpu_topology_x = int(x[0])
    tpu_topology_y = int(x[1])
    print (f'TPU topology is ({tpu_topology_x}, {tpu_topology_y})')
    print (f'Model name is {args.model_name}')
    
    # Set batch size to 8 for each chip
    BATCH_SIZE = 8 * tpu_topology_x * tpu_topology_y
    # For anything larger than an 8 chip instance, set the BATCH_SIZE to 128, since we run out of samples
    if (tpu_topology_x * tpu_topology_y) >=16:
        BATCH_SIZE = 128
    
    # Set download directory to a tempory folder
    DL_DIR="/tmp/modelfiles"
    if not os.path.exists(DL_DIR):
        os.makedirs(DL_DIR)

    print ('Downloading data to temporary folder')
    download_gcs_dir (f"gs://{args.bucket_name}/{args.model_name}", DL_DIR)
    
    # Create output folders
    if not os.path.exists(f"/tmp/{args.output_folder}"):
        os.makedirs(f"/tmp/{args.output_folder}")
    if not os.path.exists(f"/tmp/{args.checkpoint_directory}"):
        os.makedirs(f"/tmp/{args.checkpoint_directory}")

    device = xm.xla_device()
    
    # Set tokenizer parallelism to false to avoid warnings
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    tokenizer = AutoTokenizer.from_pretrained(DL_DIR)
    print ('Loaded tokenizer')
    base_model = AutoModelForCausalLM.from_pretrained(DL_DIR, torch_dtype=torch.bfloat16)
    print ('Loaded base model')

    # Set LoRA configuration
    lora_config = LoraConfig(
        r=8,
        lora_alpha=32,
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["k_proj", "v_proj"],
    )
    
    # Required when using Llama2, as the tokenizer has no padding
    tokenizer.pad_token = tokenizer.eos_token

    # Load the dataset and format it for training.
    data = load_dataset("Abirate/english_quotes", split="train")
    max_seq_length = 512
    print ('Loaded dataset')

    # Set up the FSDP config. To enable FSDP via SPMD, set xla_fsdp_v2 to True.
    fsdp_config = {"fsdp_transformer_layer_cls_to_wrap": [
            "LlamaDecoderLayer"
        ],
        "xla": True,
        "xla_fsdp_v2": True,
        "xla_fsdp_grad_ckpt": True}

    OUTPUT_DIR=f"/tmp/{args.output_folder}"
    CHECKPOINT_DIR=f"/tmp/{args.checkpoint_directory}"

    # Finally, set up the trainer and train the model.
    trainer = SFTTrainer(
        model=base_model,
        train_dataset=data,
        args=TrainingArguments(
            per_device_train_batch_size=BATCH_SIZE,  # This is actually the global batch size for SPMD.
            num_train_epochs=args.epochs,
            max_steps=-1,
            output_dir=OUTPUT_DIR,
            optim="adafactor",
            logging_steps=1,
            dataloader_drop_last = True,  # Required for SPMD.
            fsdp="full_shard",
            fsdp_config=fsdp_config,
        ),
        peft_config=lora_config,
        dataset_text_field="quote",
        max_seq_length=max_seq_length,
        packing=True,
    )

    # train
    trainer.train()
    
    adapter_model_id = "adapter_model"
    adapter_path = f"{CHECKPOINT_DIR}/{adapter_model_id}"
    merged_model_id = "merged_model"
    merged_model_path = f"{CHECKPOINT_DIR}/{merged_model_id}"
    
    trainer.model.to('cpu').save_pretrained(adapter_path)
    
    # Save the adapter, merged model, and tokenizer
    base_model = AutoModelForCausalLM.from_pretrained(DL_DIR, torch_dtype=torch.bfloat16)
    peft_model = PeftModel.from_pretrained(base_model, adapter_path)
    merged_model = peft_model.merge_and_unload()
    merged_model.save_pretrained(merged_model_path,safe_serialization=False)
    tokenizer.save_pretrained(merged_model_path)
    
    # Copy merged files to GCS folder
    OUTPUT_PREFIX=f"{args.merged_model_folder}/{merged_model_id}/{xr.process_index()}/"
    upload_directory_with_transfer_manager(bucket_name=args.bucket_name,source_directory=merged_model_path,
                                       blob_name_prefix=OUTPUT_PREFIX)
    print ('Uploaded merged model files')

    # copy adapter files to GCS folder
    OUTPUT_PREFIX=f"{args.merged_model_folder}/{adapter_model_id}/{xr.process_index()}/"
    upload_directory_with_transfer_manager(bucket_name=args.bucket_name,source_directory=adapter_path,
                                       blob_name_prefix=OUTPUT_PREFIX)
    print ('Uploaded adapter model files')

    print ('Exiting job')
    sys.exit(0)

if __name__ == "__main__":
    main()

Writing trainer/train.py


## Fine-tune with Vertex AI Custom Training Jobs

This section demonstrates how to fine-tune and deploy Llama2 models with PEFT LoRA on Vertex AI Custom Training Jobs. LoRA (Low-Rank Adaptation) is one approach of PEFT (Parameter Efficient Fine-tuning), where pretrained model weights are frozen and rank decomposition matrices representing the change in model weights are trained during fine-tuning. Read more about LoRA in the following publication: [Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W., 2021. Lora: Low-rank adaptation of large language models. *arXiv preprint arXiv:2106.09685*](https://arxiv.org/abs/2106.09685).

#### Enable docker to run as a regular user

In [24]:
!sudo usermod -a -G docker ${USER}

#### Change to the trainer directory to build the docker container

In [25]:
%cd trainer

/home/jupyter/vertex-ai-samples/notebooks/official/training/trainer


#### Build the custom docker container and push to artifact registry

In [26]:
import sys

IS_COLAB = "google.colab" in sys.modules
    
if not IS_COLAB:
    !docker build -t $PYTORCH_TRAIN_DOCKER_URI -f Dockerfile .
    !docker push $PYTORCH_TRAIN_DOCKER_URI

else:
   ! gcloud builds submit --region={LOCATION} --tag={IMAGE_URI}



Sending build context to Docker daemon  14.34kB
Step 1/19 : FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20240324
nightly_3.10_tpuvm_20240324: Pulling from tpu-pytorch-releases/docker/xla

[1B5f17d0c7: Pulling fs layer 
[1B675e1918: Pulling fs layer 
[1Bb1746a83: Pulling fs layer 
[1B39aa9b63: Pulling fs layer 
[1B5db05c76: Pulling fs layer 
[1Bd409a431: Pulling fs layer 
[1Be92977b7: Pulling fs layer 
[1Bdbc1dd53: Pulling fs layer 
[1B8b771932: Pulling fs layer 
[1B4f7e71c8: Pulling fs layer 
[1Bef234518: Pulling fs layer 
[1B8a3d6129: Pulling fs layer 
[1B772c88a6: Pulling fs layer 
[1B3386ec0d: Pulling fs layer 
[1B76f7c809: Pulling fs layer 
[1Ba259ce43: Pulling fs layer 
[1Bd6b58ba1: Pulling fs layer 
[1Ba0bd010a: Pull complete 3.3MB/653.3MBB[18A[2K[18A[2K[16A[2K[18A[2K[17A[2K[16A[2K[18A[2K[18A[2K[16A[2K[15A[2K[18A[2K[15A[2K[14A[2K[15A[2K[14A[2K[18A[2K[16A[2K[18A[2K[13A[2K[15A[2K[13A[

#### Change back to your home directory

In [27]:
%cd ..

/home/jupyter/vertex-ai-samples/notebooks/official/training


#### Set GCS folder locations and job configurations settings

In [28]:
# Create a GCS folder to store the merged model with the base model and the
# fine-tuned LORA adapter.
BUCKET_NAME = BUCKET_URI.replace("gs://", "")
OUTPUT_DIR_NAME = "output"
CHECKPOINT_DIR_NAME = "output_chk"
NUM_EPOCHS = 200
NUM_EPOCHS = 1##TODO: DELETE LINE
MERGED_MODEL_FOLDER = "llama2-7b-hf/modelfiles"

# See machines type to match chips being used
# Topologies of 2x2, 2x4, 4x4 = 4, 8, 16 chip settings and use quota from aiplatform.googleapis.com/custom_model_training_tpu_v5e
MACHINE_TYPE = "ct5lp-hightpu-4t"
TPU_TOPOLOGY = "4x4"

DISPLAY_NAME_PREFIX = f"llama2-7b-lora-train-{TPU_TOPOLOGY}"
tpuv5e_llama2_peft_job = {
    "display_name": get_job_name_with_datetime(DISPLAY_NAME_PREFIX),
    "job_spec": {
        "worker_pool_specs": [
            {
                "machine_spec": {
                    "machine_type": MACHINE_TYPE,
                    "tpu_topology": TPU_TOPOLOGY,
                },
                "replica_count": 1,
                "container_spec": {
                    "image_uri": PYTORCH_TRAIN_DOCKER_URI,
                    "args": [
                        f"--tpu_topology={TPU_TOPOLOGY}",
                        f"--model_name={HF_MODEL_ID}",
                        f"--bucket_name={BUCKET_NAME}",
                        f"--output_folder={OUTPUT_DIR_NAME}",
                        f"--checkpoint_directory={CHECKPOINT_DIR_NAME}",
                        f"--epochs={NUM_EPOCHS}",
                        f"--merged_model_folder={MERGED_MODEL_FOLDER}",
                    ],
                },
            },
        ],
    },
}

tpuv5e_llama2_peft_job

{'display_name': 'llama2-7b-lora-train-4x4_20240611_090307',
 'job_spec': {'worker_pool_specs': [{'machine_spec': {'machine_type': 'ct5lp-hightpu-4t',
     'tpu_topology': '4x4'},
    'replica_count': 1,
    'container_spec': {'image_uri': 'us-central1-docker.pkg.dev/vertexai-service-project/tpuv5e-training-repository-unique/llama2-7b-hf-lora-tuning-tpuv5e:latest',
     'args': ['--tpu_topology=4x4',
      '--model_name=meta-llama/Llama-2-7b-hf',
      '--bucket_name=k-bucket-vertexai-service-project-tpuv5ellama',
      '--output_folder=output',
      '--checkpoint_directory=output_chk',
      '--epochs=1',
      '--merged_model_folder=llama2-7b-hf/modelfiles']}}]}}

#### Create job client and run job

In [None]:
job_client = aiplatform.gapic.JobServiceClient(
    client_options=dict(api_endpoint=f"{LOCATION}-aiplatform.googleapis.com")
)

In [None]:
create_tpuv5e_llama2_peft_job_response = job_client.create_custom_job(
    parent="projects/{project}/locations/{location}".format(
        project=PROJECT_ID, location=LOCATION
    ),
    custom_job=tpuv5e_llama2_peft_job,
)
print(create_tpuv5e_llama2_peft_job_response)

#### Check on job progress
This may take 20-60 minutes or more depending on the model size. Run this cell multiple times to check progress

In [None]:
get_tpuv5e_llama2_peft_job_response = job_client.get_custom_job(
    name=create_tpuv5e_llama2_peft_job_response.name
)
get_tpuv5e_llama2_peft_job_response

#### Click on the console log url output from this cell to see your logs

In [None]:
job_id = create_tpuv5e_llama2_peft_job_response.name[
    create_tpuv5e_llama2_peft_job_response.name.rfind("/") + 1 :
]
STARTDATE = datetime.today() - timedelta(days=1)
STARTDATE = STARTDATE.strftime("%Y-%m-%dT%H:%M:%S.%f")
ENDDATE = datetime.today() + timedelta(days=0.1)
ENDDATE = ENDDATE.strftime("%Y-%m-%dT%H:%M:%S.%f")
print(
    f"https://console.cloud.google.com/logs/query;query=resource.labels.job_id=%22{job_id}%22;cursorTimestamp={ENDDATE}Z;startTime={STARTDATE}Z;endTime={ENDDATE}Z?project={PROJECT_ID}"
)

#### Wait until the training job is complete

In [None]:
import time

from google.cloud.aiplatform import gapic as aip

while True:
    response = job_client.get_custom_job(
        name=create_tpuv5e_llama2_peft_job_response.name
    )
    if response.state != aip.JobState.JOB_STATE_SUCCEEDED:
        print(f"Training is not complete and is in state {response.state.name}")
        if response.state == aip.JobState.JOB_STATE_FAILED:
            raise Exception("Training Job Failed")
    else:
        print("Training has completed")
        break
    time.sleep(60)

### Deploy fine tuned models
This section uploads the model to Model Registry and deploys the model using Hex-LLM, a High-Efficiency Large Language Model serving solution built with XLA that is being developed by Google Cloud

The model deployment step will take 15-20 minutes to complete

In [None]:
HEXLLM_DOCKER_URI = "us-docker.pkg.dev/vertex-ai-restricted/vertex-vision-model-garden-dockers/hex-llm-serve:20240328_RC01"

# GCS folder path where the merged model files were saved in you bucket
# MERGED_MODEL_FOLDER="llama2-7b-hf/modelfiles" set during fine-tuning
MERGED_MODEL_PATH = f"{MERGED_MODEL_FOLDER}/merged_model/0"
GCS_MODEL_PATH = f"{BUCKET_URI}/{MERGED_MODEL_PATH}"

DISPLAY_NAME_PREFIX = "llama2-7b-lora-deploy"  # @param {type:"string"}
JOB_NAME = get_job_name_with_datetime(DISPLAY_NAME_PREFIX)
GCS_MODEL_PATH

#### Check the model files in your GCS directory

Your output should show a list of files like this
```
gs://<YOUR-BUCKET>/modelfiles/merged_model/config.json
gs://<YOUR-BUCKET>/modelfiles/merged_model/generation_config.json
gs://<YOUR-BUCKET>/modelfiles/merged_model/pytorch_model-00001-of-00003.bin
gs://<YOUR-BUCKET>/modelfiles/merged_model/pytorch_model-00002-of-00003.bin
gs://<YOUR-BUCKET>/modelfiles/merged_model/pytorch_model-00003-of-00003.bin
gs://<YOUR-BUCKET>/modelfiles/merged_model/pytorch_model.bin.index.json
gs://<YOUR-BUCKET>/modelfiles/merged_model/special_tokens_map.json
gs://<YOUR-BUCKET>/modelfiles/merged_model/tokenizer.json
gs://<YOUR-BUCKET>/modelfiles/merged_model/tokenizer_config.json
```

In [None]:
!gsutil ls $GCS_MODEL_PATH

#### Define function for deploying model

In [None]:
from typing import Tuple


def deploy_model_hexllm(
    model_name: str,
    model_id: str,
    service_account: str,
    machine_type: str = "ct5lp-hightpu-4t",
    max_num_batched_tokens: int = 11264,  # 11264
    tokens_pad_multiple: int = 1024,
    seqs_pad_multiple: int = 32,
    sync: bool = True,
) -> Tuple[aiplatform.Model, aiplatform.Endpoint]:
    """Deploys models with Hex-LLM on TPU in Vertex AI."""
    endpoint = aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint")

    num_tpu_chips = int(machine_type[-2])
    hexllm_args = [
        "--host=0.0.0.0",
        "--port=7080",
        "--log_level=INFO",
        f"--model={model_id}",
        "--load_format=pt",  # Note: Using Pytorch bin format for weights
        f"--tensor_parallel_size={num_tpu_chips}",
        "--num_nodes=1",
        "--use_ray",
        "--batch_mode=continuous",
        f"--max_num_batched_tokens={max_num_batched_tokens}",
        f"--tokens_pad_multiple={tokens_pad_multiple}",
        f"--seqs_pad_multiple={seqs_pad_multiple}",
    ]

    env_vars = {
        "PJRT_DEVICE": "TPU",
        "RAY_DEDUP_LOGS": "0",
        "RAY_USAGE_STATS_ENABLED": "0",
    }

    model = aiplatform.Model.upload(
        display_name=model_name,
        serving_container_image_uri=HEXLLM_DOCKER_URI,
        serving_container_command=["python", "-m", "hex_llm.entrypoints.api_server"],
        serving_container_args=hexllm_args,
        serving_container_ports=[7080],
        serving_container_predict_route="/generate",
        serving_container_health_route="/ping",
        serving_container_environment_variables=env_vars,
        serving_container_shared_memory_size_mb=(16 * 1024),  # 16 GB
        serving_container_deployment_timeout=7200,
    )

    model.deploy(
        endpoint=endpoint,
        machine_type=machine_type,
        deploy_request_timeout=1800,
        service_account=service_account,
        sync=sync,
    )
    return model, endpoint

#### Deploy model to Vertex
The `deploy_model_hexllm` function will return a reference to the model added to the Vertex AI Model Registry as well as a new endpoint where the model will be deployed

In [None]:
print("Using model from: ", GCS_MODEL_PATH)
model, endpoint = deploy_model_hexllm(
    model_name=JOB_NAME,
    model_id=GCS_MODEL_PATH,
    service_account=SERVICE_ACCOUNT,
    sync=False,
)
print("endpoint_name:", endpoint.name)

#### Review the logs after the model has been deployed

In [None]:
ENDPOINT_ID = endpoint.name[endpoint.name.rfind("/") + 1 :]
STARTDATE = datetime.today() - timedelta(days=1)
STARTDATE = STARTDATE.strftime("%Y-%m-%dT%H:%M:%S.%f")
ENDDATE = datetime.today() + timedelta(days=0.1)
ENDDATE = ENDDATE.strftime("%Y-%m-%dT%H:%M:%S.%f")
print(
    f"https://console.cloud.google.com/logs/query;query=resource.type%3D%22aiplatform.googleapis.com%2FEndpoint%22%20resource.labels.endpoint_id%3D%22{ENDPOINT_ID}%22%20resource.labels.location%3D%22{LOCATION}%22;startTime={STARTDATE}Z;endTime={ENDDATE}Z?project={PROJECT_ID}"
)

#### Wait until endpoint is complete

In [None]:
endpoint.wait()

In [None]:
# (optional) Wait 15 minutes while the model is downloaded and setup
if os.getenv("IS_TESTING"):
    time.sleep(900)

NOTE: The overall deployment can take 30-40 minutes or more. After the deployment succeeds (15-20 minutes or so), the fine-tuned model will be downloaded from the GCS bucket used in training above. Thus, an additional ~15-20 minutes (depending on the model sizes) of waiting time is needed **after** the model deployment step above succeeds and before you run the next step below. Otherwise you might see a `ServiceUnavailable: 503 502:Bad Gateway` error when you send requests to the endpoint.

### Once deployment is ready, send a prediction request

Once deployment succeeds, you can send requests to the endpoint with text prompts. The first request will take a minute or two while model warmup occurs

Example:

```
Prompt: Provide a list of the 3 best comedy movies in the 90s in 50 characters or less
Response:  1) The Cable Guy 2) Scooby-Doo 3) Beethoven Requirements
```

In [None]:
PROMPT = (
    "Provide a list of the 3 best comedy movies in the 90s in 50 characters or less"
)

instances = [
    {
        "prompt": PROMPT,
        "max_tokens": 80,
        "temperature": 1.0,
        "top_p": 1.0,
        "top_k": 1.0,
    },
]

response = endpoint.predict(instances=instances)

for prediction in response.predictions:
    print(prediction)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
# Delete the train job.
job_client.delete_custom_job(name=create_tpuv5e_llama2_peft_job_response.name)

# Undeploy model and delete endpoint.
endpoint.delete(force=True)

# Delete models.
model.delete()

import os

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI