In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Spark on Ray on Vertex AI

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/spark_on_ray_on_vertex_ai.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2spark_on_ray_on_vertex_ai.ipynb">
      <img width="32px" src="https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/spark_on_ray_on_vertex_ai.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/spark_on_ray_on_vertex_ai.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

## Overview

This tutorial demonstrate how Spark can be run on Ray on Vertex AI using [RayDP](https://github.com/oap-project/raydp).

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.10

Learn more about [Ray on Vertex AI](https://cloud.google.com/vertex-ai/docs/open-source/ray-on-vertex-ai/overview) and [Spark on Ray on Vertex AI](https://cloud.google.com/vertex-ai/docs/open-source/ray-on-vertex-ai/run-spark-on-ray).

### Objective

In this tutorial, you learn how to use RayDP to run Spark applications on a Ray cluster on Vertex AI.

This tutorial uses the following Google Cloud services and resources:

- Ray on Vertex AI
- Artifact Registry
- Cloud Storage


The steps performed include:

- Create custom Ray on Vertex AI container image
- Create a Ray cluster on Vertex AI using custom container image
- Run Spark interactively on the cluster using RayDP
- Run Spark application on cluster via Ray Job API
- Read files from Google Cloud Storage in Spark application
- Pandas UDF in Spark application on Ray on Vertex AI
- Delete the Ray cluster on Vertex AI

### Dataset

This tutorial uses the [Guerry dataset](https://www.datavis.ca/gallery/guerry/guerrydat.html) which consists of 86 records in a CSV file.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Get started

### Install Vertex AI SDK for Python and other required packages


In [None]:
! pip3 install --upgrade --quiet google-cloud-aiplatform[ray]==1.59.0
! pip3 install --upgrade --quiet pyspark
! pip3 install --upgrade --quiet ray[all]==2.9.3
! pip3 install --upgrade --quiet raydp

### Restart runtime (Colab only)

To use the newly installed packages, you must restart the runtime on Google Colab.

In [None]:
import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

Authenticate your environment on Google Colab.


In [None]:
import sys

if "google.colab" in sys.modules:

    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize Vertex AI SDK for Python

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

## Create custom container image

[Ray on Vertex AI container images](https://cloud.google.com/vertex-ai/docs/supported-frameworks-list#ray) don't come with RayDP pre-installed, Create a custom Ray on Vertex AI container image to run Spark applications on Ray on Vertex AI. The following section explains how a custom container image for Ray on Vertex AI with RayDP can be built.

Create a directory to store the dockerfile

In [None]:
DOCKER_DIR = "docker_dir"
! mkdir -p {DOCKER_DIR}

Recommended using the latest Ray on Vertex AI prebuilt image for creating the custom container image. Install other Python packages that are expected to be used by the Spark applications. 

Note: pyarrow==14.0 is due to a dependency constraint of Ray 2.9.3 and it also fails to read csv with pandas version >= 2.2.0.

In [None]:
%%writefile {DOCKER_DIR}/Dockerfile

FROM us-docker.pkg.dev/vertex-ai/training/ray-cpu.2-9.py310:latest

RUN apt-get update -y \
    && apt-get install openjdk-21-jdk -y \
    && pip install --no-cache-dir raydp pyarrow==14.0 pandas==2.1.0

### Enable Artifact Registry API
Enable the Artifact Registry API service for the Google cloud project. This tutorial requires [gcloud CLI](https://cloud.google.com/sdk/docs/install) installed.

In [None]:
! gcloud components update --quiet && gcloud services enable artifactregistry.googleapis.com

### Create a private Docker repository
Create a Docker repository in [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview).

In [None]:
DOCKER_REPOSITORY = "my-docker-repo"
IMAGE_NAME = "raydp-rov-image"
! gcloud artifacts repositories create {DOCKER_REPOSITORY} --repository-format=docker --location={LOCATION} --description="Docker repository"

### Build container image
This tutorial requires that [Docker](https://docs.docker.com/engine/install/) is installed and available in the work environment.

In [None]:
! docker build {DOCKER_DIR} -t {LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/{IMAGE_NAME}

### Push container image
Configure the [authentication for Google Artifact Registry's Docker repository](https://cloud.google.com/artifact-registry/docs/docker/pushing-and-pulling#auth) before pushing the container image to the repository.

In [None]:
! gcloud auth configure-docker {LOCATION}-docker.pkg.dev --quiet

#### Push the container image to docker repository

In [None]:
!docker push {LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/{IMAGE_NAME}

## Create Ray cluster on Vertex AI
Use the custom container image to create a Ray cluster on Vertex AI using Vertex AI Python SDK.

In [None]:
from datetime import datetime, timezone

cluster_suffix = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S%f")

CLUSTER_NAME = f"my-rov-cluster-{cluster_suffix}"
CUSTOM_CONTAINER_IMAGE_URI = (
    f"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_REPOSITORY}/{IMAGE_NAME}"
)

In [None]:
from google.cloud.aiplatform import vertex_ray
from vertex_ray import Resources

head_node_type = Resources(
    machine_type="n1-standard-16",
    node_count=1,
    custom_image=CUSTOM_CONTAINER_IMAGE_URI,
)

worker_node_types = [
    Resources(
        machine_type="n1-standard-8",
        node_count=2,
        custom_image=CUSTOM_CONTAINER_IMAGE_URI,
    )
]

ray_cluster_resource_name = vertex_ray.create_ray_cluster(
    head_node_type=head_node_type,
    worker_node_types=worker_node_types,
    cluster_name=CLUSTER_NAME,
)

## Spark on Ray on Vertex AI using Ray client
Ray [Task](https://docs.ray.io/en/latest/ray-core/tasks.html#ray-remote-functions) or [Actor](https://docs.ray.io/en/latest/ray-core/actors.html), is required for creating a Spark session with [Ray client](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html) on the Ray on Vertex AI.  The following code shows how a Ray Actor can be used for creating a Spark Session, running a Spark application, and stopping a Spark session on a Ray on Vertex AI using RayDP.


In [None]:
import ray


@ray.remote
class SparkExecutor:
    import pyspark

    spark: pyspark.sql.SparkSession = None

    def __init__(self):

        import raydp

        self.spark = raydp.init_spark(
            app_name="RAYDP ACTOR EXAMPLE",
            num_executors=1,
            executor_cores=1,
            executor_memory="500M",
        )

    def get_data(self):
        df = self.spark.createDataFrame(
            [
                ("sue", 32),
                ("li", 3),
                ("bob", 75),
                ("heo", 13),
            ],
            ["first_name", "age"],
        )
        return df.toJSON().collect()

    def stop_spark(self):
        import raydp

        raydp.stop_spark()

### Connect to Ray cluster on Vertex AI

Ray server may take a while to accept connection, use a retry to avoid connection timeout error.

In [None]:
import time

RAY_ADDRESS = f"vertex_ray://{ray_cluster_resource_name}"


def ray_init():
    print(f"creating connection with ray. address: {RAY_ADDRESS}")
    return ray.init(address=RAY_ADDRESS)


def retry(func, max_tries=10):
    for i in range(max_tries):
        try:
            print(
                f"Attempting to connect to Ray server {i+1}, sleeping for 30 seconds..."
            )
            time.sleep(30)
            func()
            break
        except Exception:
            continue


retry(ray_init)

### Call Ray Actor to get data

In [None]:
s = SparkExecutor.remote()
data = ray.get(s.get_data.remote())
print(data)

### Stop Spark Session

In [None]:
ray.get(s.stop_spark.remote())

### Disconnect from Ray cluster on Vertex AI

In [None]:
ray.shutdown()

## Spark on Ray on Vertex AI using Ray Job API
Ray client is useful for small experiments that require interactive connection with the Ray cluster, the Ray Job API is the recommended way to run long-running and production jobs on a Ray cluster. This also applies to running Spark applications on the Ray cluster on Vertex AI.

Create a Python script that contains Spark application code.

In [None]:
SCRIPT_DIR = "scripts"
! mkdir -p {SCRIPT_DIR}

In [None]:
%%writefile {SCRIPT_DIR}/my_raydp_job.py

import pyspark
import raydp

def get_data(spark: pyspark.sql.SparkSession):
    df = spark.createDataFrame(
        [
            ("sue", 32),
            ("li", 3),
            ("bob", 75),
            ("heo", 13),
        ],
        ["first_name", "age"],
    )
    return df.toJSON().collect()

def stop_spark():
    raydp.stop_spark()

if __name__ == '__main__':
    spark = raydp.init_spark(
      app_name="RAYDP JOB EXAMPLE",
        num_executors=1,
        executor_cores=1,
        executor_memory="500M",
    )
    print(get_data(spark))
    stop_spark()

### Submit the job using Ray Job API

#### Create Ray Job client

In [None]:
from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient(RAY_ADDRESS)

#### Helper function for submitting Ray job

In [None]:
def submit_ray_job(script_name: str):
    job_id = client.submit_job(
        # Entrypoint shell command to execute
        entrypoint=f"python {script_name}",
        # Path to the local directory that contains the python script file.
        runtime_env={
            "working_dir": SCRIPT_DIR,
        },
    )
    return job_id

#### Submit Ray job

In [None]:
job_id = submit_ray_job("my_raydp_job.py")

### Monitor the job logs

The job logs can also be viewed via Ray on Vertex AI OSS Dashboard in a web browser.

In [None]:
client.get_job_logs(job_id)

## Reading Cloud Storage files from Spark application

The following section shows two different techniques for reading [Cloud Storage](https://cloud.google.com/storage/docs/buckets) files from Spark applications running on Ray on Vertex AI.

### Create a Cloud Storage bucket

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**If your bucket doesn't already exist**: Run the following cell to create your Google Cloud Storage bucket.

In [None]:
! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}

#### Copy [CSV file](https://www.datavis.ca/gallery/guerry/guerry.csv) to Cloud Storage bucket.

In [None]:
! curl https://www.datavis.ca/gallery/guerry/guerry.csv | gsutil cp - {BUCKET_URI}/guerry.csv

### Use the Google Cloud Storage Connector

The [Google Cloud Connector for Hadoop](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/README.md) can be used for reading files from a Google Cloud Storage bucket from a Spark application running on Ray on Vertex AI. This is done using a few configuration parameters when a Spark session is created using RayDP. The following code shows how a CSV file stored in a Google Cloud Storage bucket can be read from a Spark application.

This tutorial assumes that the IAM Service Account used by the Ray cluster on Vertex AI has been granted required IAM permissions to read from the Google Cloud Storage bucket.

#### Create python script for Spark application

In [None]:
%%writefile {SCRIPT_DIR}/spark_gcs_connector.py

import raydp

spark = raydp.init_spark(
  app_name="RayDP GCS Example 1",
  configs={
      "spark.jars": "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.22.jar",
      "spark.hadoop.fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
      "spark.hadoop.fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
  },
  num_executors=2,
  executor_cores=4,
  executor_memory="500M",
)

df = spark.read.csv("GCS_FILE_URI", header = True, inferSchema = True)
print(f"CSV data is: {df.toJSON().collect()}")
raydp.stop_spark()

#### Update CSV file path in the script

In [None]:
! sed -i 's^GCS_FILE_URI^{BUCKET_URI}/guerry.csv^g' {SCRIPT_DIR}/spark_gcs_connector.py

#### Submit the job

In [None]:
submit_ray_job("spark_gcs_connector.py")

### Use Ray data

[Ray data API](https://docs.ray.io/en/latest/data/api/api.html) provides very convenient methods to read files from Google Cloud Storage bucket and it also leverages Ray's distributed processing for reading data.

#### Create python script for Spark application

In [None]:
%%writefile {SCRIPT_DIR}/spark_gcs_ray_data.py

import raydp
import ray

spark = raydp.init_spark(
  app_name="RayDP GCS Example 2",
  num_executors=2,
  executor_cores=4,
  executor_memory="500M",
)

ray_dataset = ray.data.read_csv("GCS_FILE_URI")
df = ray_dataset.to_spark(spark)
print(f"CSV data is: {df.toJSON().collect()}")
raydp.stop_spark()

#### Update CSV file path in the script

In [None]:
! sed -i 's^GCS_FILE_URI^{BUCKET_URI}/guerry.csv^g' {SCRIPT_DIR}/spark_gcs_ray_data.py

#### Submit the job

In [None]:
submit_ray_job("spark_gcs_ray_data.py")

## Pandas UDF on Ray on Vertex AI

The [Pyspark Pandas UDF](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.pandas_udf.html) may sometimes require additional code when they're used in a Spark application running on a Ray cluster on Vertex AI. 

### Handle Python package dependencies

The [Python dependencies](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/sdk.html#dependency-management) of an application can be installed using Runtime Environment with Ray job API when the Ray job is submitted to the cluster, Ray installs those dependencies in the Python virtual environment that it creates for running the job. The Pandas UDF, however, do nt run in the same python virtual environment. It instead is run in the python System environment. If that dependency isn't available in the System environment, that dependency needs to be installed within Pandas UDF.

#### Create Python script

In [None]:
%%writefile {SCRIPT_DIR}/pandas_udf_dependency.py

import pandas as pd
import pyspark
import raydp
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType

def test_udf(spark: pyspark.sql.SparkSession):
    import pandas as pd
    
    df = spark.createDataFrame(pd.read_csv("https://www.datavis.ca/gallery/guerry/guerry.csv"))
    return df.select(func('Lottery','Literacy', 'Pop1831')).collect()


@pandas_udf(StringType())
def func(s1: pd.Series, s2: pd.Series, s3: pd.Series) -> str:
    import numpy as np
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "statsmodels"])
    import statsmodels.api as sm
    import statsmodels.formula.api as smf
    
    d = {'Lottery': s1, 
         'Literacy': s2,
         'Pop1831': s3}
    data = pd.DataFrame(d)

    # Fit regression model (using the natural log of one of the regressors)
    results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=data).fit()
    return results.summary().as_csv()


if __name__ == '__main__':
    
    spark = raydp.init_spark(
      app_name="RayDP UDF Example",
      num_executors=2,
      executor_cores=4,
      executor_memory="1500M",
    )
    
    print(test_udf(spark))
    
    raydp.stop_spark()

#### Submit the job

In [None]:
submit_ray_job("pandas_udf_dependency.py")

### Handle local Python dependencies

The best practice for handling Python dependencies is via Python repository. Therefore, publish your own custom packages to your Python repository and install those packages using pip. In case you are using local python package dependency in Pandas UDF of your Spark application, additional code is required to add the local packages to PYTHONPATH of Python System environment of the Ray cluster on Vertex AI nodes.

#### Create local python module

This is a very simple python file that has one method. This method takes a string argument and prints it to the console.

In [None]:
%%writefile {SCRIPT_DIR}/my_module.py

def print_func(text: str):
    print(text)

#### Create python script

Use the python script of previous section to demonstrate the handling of local dependency.

In [None]:
%%writefile {SCRIPT_DIR}/pandas_udf_local_dependency.py

import pandas as pd
import pyspark
import raydp
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType

def test_udf(spark: pyspark.sql.SparkSession):
    import pandas as pd
    df = spark.createDataFrame(pd.read_csv("https://www.datavis.ca/gallery/guerry/guerry.csv"))
    import pathlib
    module_path = str(pathlib.Path(__file__).parent.resolve())
    return df.select(udf_wrapper_func('Lottery','Literacy', 'Pop1831', module_path)).collect()

def udf_wrapper_func(s1: pd.Series, s2: pd.Series, s3: pd.Series, module_path: str) -> str:

    @pandas_udf(StringType())
    def func(s1: pd.Series, s2: pd.Series, s3: pd.Series) -> str:
        import sys
        sys.path.append(module_path)
        
        # import local module
        import my_module
        my_module.print_func("This is a UDF local dependency test.")

        import numpy as np
        import subprocess
        import sys
        subprocess.check_call([sys.executable, "-m", "pip", "install", "statsmodels"])
        import statsmodels.api as sm
        import statsmodels.formula.api as smf

        d = {'Lottery': s1, 
             'Literacy': s2,
             'Pop1831': s3}
        data = pd.DataFrame(d)

    #     # Fit regression model (using the natural log of one of the regressors)
        results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=data).fit()
        return results.summary().as_csv()

    return func(s1, s2, s3)


if __name__ == '__main__':
    
    spark = raydp.init_spark(
      app_name="UDF_TEST",
      num_executors=2,
      executor_cores=2,
      executor_memory="500M",
    )
    
    print(test_udf(spark))
    
    raydp.stop_spark()

#### Submit the job

In [None]:
submit_ray_job("pandas_udf_local_dependency.py")

The Ray job should complete sucessfully and the job logs should have a line `This is a UDF local dependency test.`

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, delete the resources created in this tutorial.

### Delete Ray cluster on Vertex AI

In [None]:
# Delete the cluster
vertex_ray.delete_ray_cluster(ray_cluster_resource_name)

### Delete Google Cloud Storage bucket

In [None]:
! gcloud storage rm --recursive {BUCKET_URI}

### Delete private docker repository

In [None]:
! gcloud artifacts repositories delete {DOCKER_REPOSITORY} --location={LOCATION} --quiet

### Delete local directories

In [None]:
! rm -rf {DOCKER_DIR}
! rm -rf {SCRIPT_DIR}