06 May 21:31

imreddy13

e7b191a

v1.1.2 Latest

Latest

Highlights

RAG, Ray & Jupyter terraform solutions now support GKE Autopilot as the default cluster type #635
The RAG solution has improved test coverage to (1) validate the notebook that generates vector embeddings as part of the E2E tests #524 (2) validate prompt responses from the LLM with context #511

What's Changed

Cherrypick AP cloud build stockout mitigation onto release-1.1 by @artemvmin in #580
Jupyter notebook cherry pick by @chiayi in #600
quick fix or rag prompt test output by @chiayi in #612
Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket… by @gongmax in #621
Cherry-pick #599 and #618 to release-1.1 by @roberthbailey in #627
Cherry-pick #631 to release-1.1 branch by @roberthbailey in #632
Cherry-pick #635 to release-1.1 branch by @roberthbailey in #637

Full Changelog: v1.1.0...v1.1.2

Contributors

roberthbailey, gongmax, and 2 other contributors

Assets 2

05 Apr 12:39

AlexBulankou

v1.1.0

02f0dd9

v1.1.0

We are excited to announce the release of AI on GKE v1.1! This release brings several new features, improvements, and bug fixes to enhance your experience with running AI workloads on Google Kubernetes Engine (GKE).

Highlights

AI on GKE Quick Starts

Get started with popular AI frameworks and tools using new quickstart guides for RAG, Ray and Jupyter notebooks on GKE.

RAG

Retrieval Augmented Generation (RAG) is a technique used to give Large Language Models (LLMs) additional context related to a prompt. RAG has many benefits including providing external information (e.g. from knowledge repositories) and introducing “grounding”, which helps the LLM generate an appropriate response.

The new quick start deploys a RAG stack on a new or existing GKE cluster using open source tools and frameworks such as Ray, LangChain, HuggingFace TGI, and Jupyter notebooks. The model used for inference is Mistral-7B. The solution uses GCS fuse driver to load the input dataset quickly and the Cloud SQL pgvector extension to store generating vector embeddings for RAG. It includes features like authenticated access for your application via Identity Aware Proxy, sensitive data protection & text moderation. See the README to get started.

Ray

Ray is an open-source framework to easily scale up Python applications across multiple nodes in a cluster. Ray provides a simple API for building distributed, parallelized applications, especially for machine learning.

KubeRay enables Ray to be deployed on Kubernetes. You get the wonderful Pythonic unified experience delivered by Ray, and the enterprise reliability and scale of GKE managed Kubernetes. Together, they offer scalability, fault tolerance, and ease of use for building, deploying, and managing distributed applications.

The new quick start deploys KubeRay on a new or existing GKE cluster along with a sample Ray cluster. See the README to get started.

Jupyter

JupyterHub is a powerful, multi-tenant server-based web application that allows users to interact with and collaborate on Jupyter notebooks. Users can create custom computing environments with custom images and computational resources in which to run their notebooks. “Zero to Jupyterhub for Kubernetes” (z2jh) is a Helm chart that you can use to install Jupyterhub on Kubernetes that provides numerous configurations for complex user scenarios.

The new quick start solution sets up Jupyterhub on GKE. Running your Jupyter notebooks and JupyterHub on Google Kubernetes Engine (GKE) provides a way to prototype your distributed, compute-intensive ML applications with security and scalability built-in as core elements of the platform. See the README to get started.

Ray on GKE guide

Dive deeper into running Ray workloads on GKE with comprehensive guides and tutorials covering various use cases and best practices. See the Ray on GKE README to get started. We’ve also included a new user guide specifically for leveraging TPU Multihost and Multislice Support with Ray.

Inference Benchmarks

Evaluate and compare the performance of different AI models and frameworks on GKE using newly added inference benchmarks. It supports benchmarking popular LLMs like Gemma, Llama 2, Falcon and other models available in Hugging Face. It supports different model servers like Text Generation Inference and Triton with TensorRT-LLM. You can measure the performance of these models and model servers on various GPU types in GKE. To get started, refer to the README.

Guides, Tutorials and Examples

LLM Guides

We’ve introduced the following guides for serving LLMs on GKE:

Guide to Serving Mistral 7B-Instruct v0.1 on GKE Utilizing Nvidia L4-GPUs
Guide to Serving Mixtral 8x7 Model on GKE Utilizing Nvidia L4-GPUs
RAG with Weavite and Vertex AI

GKE ML Platform

Introducing the first MVP in the GKE ML Platform Solution, featuring:

Opinionated GKE Platform for AI/ML workloads
- Comes with a sample deployment of Ray
- Infrastructure automated through Terraform and GitOps for cluster configuration management
Parallel data processing using Ray, accelerating the notebook to cluster experience
- Includes a sample data processing script for a publicly available dataset using Ray.
Resources:
- Automated Deployment via Terraform: github.com/GoogleCloudPlatform/ai-on-gke/tree/main/best-practices/ml-platform

TPU Provisioner

This release introduces the TPU Provisioner. A controller that automatically provisions new TPU node pools based on the requirements on pending pods, then deprovisions them when they are no longer in use. See the README for how to get started.

Bug fixes and improvements

Reorganized folders in the ai-on-gke repo
E2E tests for all quick start deployments are now running on Google Cloud Build
Introduced the modules directory containing commonly used terraform modules used across our different deployments
Renamed the gke-platform directory to infrastructure with additional features and capabilities

Assets 2

17 Nov 22:00

imreddy13

v1.0.2

48e6cfe

v1.0.2

Ray Serve

Introduced support for Ray on Autopilot with 3 predefined worker groups - small (only CPU), medium (1 GPU), and large (8 GPUs): 7082b13

Ray on GKE Storage
#87 provides examples for Ray on GKE storage solutions:

One-click deploy setup for GCS bucket + Kuberay access of control
Leveraging GKE GCS Fuse CSI to access GCS Buckets as a shared filesystem and use standard file semantics (thereby eliminating the need to use specialized fsspec libraries)

Ray Data
The Ray data API tutorial with stable diffusion e2e finetuning example (PR) deploys a Ray training job from a Jupyter notebook to a Ray cluster on GKE, and illustrates the following:

Caching HuggingFace StableDiffusion model checkpoint into a GCS bucket and mount it to Ray workers in the Ray cluster hosted on GKE
Using RayData APIs to perform batch inference to generate regularization images needed for the fine-tuning
Using RayTrain framework for distributed training with multiple GPUs in a multi-node GKE cluster setup

Kuberay

Pin Kuberay version to v0.6.0 and helm chart version to v0.6.1
Install Kuberay operator in a dedicated namespace (ray-system)

Jupyter Notebooks

Secure authentication via Identity-aware proxy (IAP) is now enabled by default for Jupyterhub, for both Standard & Autopilot clusters. Here is the sample user guide to configure the IAP client in your Jupyterhub installation. This ensures the Jupyterhub endpoint is no longer exposed to the public internet.

Distributed training of PyTorch CNN

JobSet example for distributed training of PyTorch CNN handwritten digit classification model using the MNIST dataset.
Indexed Job example for distributed training of a PyTorch CNN handwritten digit classification model the MNIST dataset on NVIDIA T4 GPUs.

Inferencing using Saxml and an HTTP Server

Example to deploy an HTTP Server to handle HTTP requests to Sax, which has support for features such as model publishing, listing, updating, unpublishing, and generating predictions. With an HTTP server, interaction with Sax can also expand further than at the VM-level. For example, integration with GKE and load balancing will enable requests to Sax from inside and outside the GKE cluster.

Finetuning and Serving Llama on L4 GPUs

Example for finetuning Llama 7B model on GKE using 8 x L4 GPUs
Example for serving Llama 70B model on GKE with 2 L4 GPUs

Validation of Changes to Ray on GKE Templates

Pull requests now trigger cloud build tests to detect breaking changes made to the GKE platform and Kuberay solution templates.

Assets 2

15 Sep 00:18

imreddy13

v1.0.1

fad927e

TPU support for Ray, persistant ray logs & metrics, JupyterHub improvements

AI on GKE 1.0.1

The 1.0.1 patch introduces TPU support for Ray, persistent & searchable Ray logs and metrics and pre-configured resource profiles for Jupyterhub.

Support for TPUs with Ray

TPUs are now a first-class citizen in Ray’s resource orchestration layer, making the experience just like using GPUs. The user guide outlines how to get started with TPUs on Ray.

Improvements to Ray observability

Ray on GKE automatically write Ray logs and metrics to GCP, so users can view persistent logs & metrics across multiple clusters. Even if your ray cluster dies, you still have visibility into previous jobs via GCP.
See the Logging & Monitoring section for more details on usage.

Logs are exported via a fluentbit sidecar and tagged with the Ray job submission ID. The job submission ID can be used to filter Ray job logs in Cloud Logging:

Metrics are exported via Prometheus and can be viewed in Cloud Monitoring:

Multiple user profiles support for JupyterHub

JupyterHub comes installed with different user profiles, each profile specifies different types of resources (GPU/CPU, memory, image). This user guide outlines how to get started with JupyterHub and configure profiles for your use case:

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights

What's Changed

Contributors

Highlights

AI on GKE Quick Starts

Ray on GKE guide

Inference Benchmarks

Guides, Tutorials and Examples

LLM Guides

GKE ML Platform

TPU Provisioner

Bug fixes and improvements

AI on GKE 1.0.1

Support for TPUs with Ray

Improvements to Ray observability

Multiple user profiles support for JupyterHub

Releases: GoogleCloudPlatform/ai-on-gke

v1.1.2

Highlights

What's Changed

Contributors

v1.1.0

Highlights

AI on GKE Quick Starts

Ray on GKE guide

Inference Benchmarks

Guides, Tutorials and Examples

LLM Guides

GKE ML Platform

TPU Provisioner

Bug fixes and improvements

v1.0.2

TPU support for Ray, persistant ray logs & metrics, JupyterHub improvements

AI on GKE 1.0.1

Support for TPUs with Ray

Improvements to Ray observability

Multiple user profiles support for JupyterHub