# Dataproc Spark Job
- Dataproc Cluster
- Job with BQ data
- Delete Dataproc Cluster

API Reference: https://googleapis.dev/python/dataproc/0.7.0/gapic/v1/api.html

## Setup

inputs:

In [9]:
REGION = 'us-central1'
PROJECT_ID='statmike-mlops'
DATANAME = 'fraud'
NOTEBOOK = 'dataproc'

DATAPROC_COMPUTE = "n1-standard-4"
DATAPROC_MAIN_INSTANCES = 1
DATAPROC_WORK_INSTANCES = 4

packages:

In [10]:
from google.cloud import dataproc_v1
from datetime import datetime

clients:

In [11]:
client_options = {"api_endpoint": f"{REGION}-dataproc.googleapis.com:443"}
clients = {}

In [12]:
clients['cluster'] = dataproc_v1.ClusterControllerClient(client_options = client_options)
clients['job'] = dataproc_v1.JobControllerClient(client_options = client_options)

parameters:

In [13]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{DATANAME}/models/{NOTEBOOK}"
DIR = f"temp/{NOTEBOOK}"

environment:

In [14]:
!rm -rf {DIR}
!mkdir -p {DIR}

## Define Job
- https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml#run_a_linear_regression

In [29]:
%%writefile {DIR}/gm.py
from __future__ import print_function
from pyspark.context import SparkContext
from pyspark.ml.linalg import Vectors
from pyspark.ml.clustering import GaussianMixture
from pyspark.sql.session import SparkSession
# The imports, above, allow us to access SparkML features specific to linear
# regression as well as the Vectors types.


# Define a function that collects the features of interest
# (mother_age, father_age, and gestation_weeks) into a vector.
# Package the vector in a tuple containing the label (`weight_pounds`) for that
# row.
def vector_from_inputs(r):
  return (r["weight_pounds"], Vectors.dense(float(r["mother_age"]),
                                            float(r["father_age"]),
                                            float(r["gestation_weeks"]),
                                            float(r["weight_gain_pounds"]),
                                            float(r["apgar_5min"])))

sc = SparkContext()
spark = SparkSession(sc)

#temp space for bq export used by connector
spark.conf.set('temporaryGcsBucket',"statmike-mlops")

# Read the data from BigQuery as a Spark Dataframe.
natality_data = spark.read.format("bigquery").option("table", "bigquery-public-data.samples.natality").load()
# Create a view so that Spark SQL queries can be run against the data.
natality_data.createOrReplaceTempView("natality")

# subset data rows and columns
sql_query = """
SELECT weight_pounds, mother_age, father_age, gestation_weeks, weight_gain_pounds, apgar_5min
from natality
where weight_pounds is not null
and mother_age is not null
and father_age is not null
and gestation_weeks is not null
and weight_gain_pounds is not null
and apgar_5min is not null
"""
clean_data = spark.sql(sql_query)

# Create an input DataFrame for Spark ML using the above function.
training_data = clean_data.rdd.map(vector_from_inputs).toDF(["label", "features"])
training_data.cache()

# cluster the feature rows with GM
gm = GaussianMixture().setK(4).setSeed(1234567)
model = gm.fit(training_data)

# write data to BigQuery
model.gaussiansDF.write.format('bigquery').option("table", "statmike-mlops.fraud.gm_cluster").mode('overwrite').save()

Overwriting temp/dataproc/gm.py


In [30]:
!gsutil cp {DIR}/gm.py {URI}/{TIMESTAMP}/gm.py

Copying file://temp/dataproc/gm.py [Content-Type=text/x-python]...
/ [1 files][  2.1 KiB/  2.1 KiB]                                                
Operation completed over 1 objects/2.1 KiB.                                      


## Method 1: Submit Serverless (Batch) Dataproc Job

During Private Preview: need to allowlist the project and user...

Note: Dataproc Serveless requires a subnet with Private Google Access. The first three cells below check for the private access, enable private access, check again to confirm.

In [17]:
!gcloud compute networks subnets describe default --region={REGION} --format="get(privateIpGoogleAccess)"

False


In [18]:
!gcloud compute networks subnets update default --region={REGION} --enable-private-ip-google-access

Updated [https://www.googleapis.com/compute/v1/projects/statmike-mlops/regions/us-central1/subnetworks/default].


In [19]:
!gcloud compute networks subnets describe default --region={REGION} --format="get(privateIpGoogleAccess)"

True


In [20]:
!gcloud beta dataproc batches submit pyspark {DIR}/gm.py --project={PROJECT_ID} --region={REGION} --deps-bucket={BUCKET} --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

Batch [d3395366a410496987b5cdccb966fa26] submitted.
Using the default image serverless-spark-default:2.1
CONDA_HOME=/opt/dataproc/opt/conda/default
PYSPARK_PYTHON=/opt/dataproc/opt/conda/default/bin/python
Batch [d3395366a410496987b5cdccb966fa26] finished.
metadata:
  '@type': type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata
  batch: projects/statmike-mlops/locations/us-central1/batches/d3395366a410496987b5cdccb966fa26
  batchUuid: 29ff1546-039b-4c10-a3f7-386e3afd4502
  createTime: '2021-10-21T18:40:15.621906Z'
  description: Batch
  operationType: BATCH
name: projects/statmike-mlops/regions/us-central1/operations/fbaa0995-8859-314c-a065-baf30487edd1


## Method 2: User Managed Dataproc Cluster

### Create Cluster
https://cloud.google.com/dataproc/docs/guides/create-cluster

In [21]:
cluster_specs = {
	"project_id": PROJECT_ID,
    "cluster_name": DATANAME,
    "config": {
    	"master_config": {"num_instances": DATAPROC_MAIN_INSTANCES, "machine_type_uri": DATAPROC_COMPUTE},
    	"worker_config": {"num_instances": DATAPROC_WORK_INSTANCES, "machine_type_uri": DATAPROC_COMPUTE}
    }
}

In [22]:
cluster = clients['cluster'].create_cluster(
    request = {
        "project_id": PROJECT_ID,
        "region": REGION,
        "cluster": cluster_specs
	}
)

In [23]:
cluster.result().cluster_name

'fraud'

### Submit Job
- https://cloud.google.com/dataproc/docs/samples/dataproc-submit-pyspark-job

In [31]:
job_specs = {
	"placement": {"cluster_name": DATANAME},
    "pyspark_job": {
    	"main_python_file_uri": f"{URI}/{TIMESTAMP}/gm.py",
        "jar_file_uris": ["gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar"]
    }
}

In [32]:
job = clients['job'].submit_job(project_id = PROJECT_ID, region = REGION, job = job_specs)

In [33]:
job.reference.job_id

'84c95c0b-cb32-4363-a1a6-641bf1559108'

### Wait On Job

In [34]:
while True:
    ljob = clients['job'].get_job(project_id = PROJECT_ID, region = REGION, job_id = job.reference.job_id)
    if ljob.status.state.name == "ERROR":
        raise Exception(ljob.status.details)
    elif ljob.status.state.name == "DONE":
        print ("Finished")
        break

Finished


### Review Results
- Go to BiqQuery and review the output table: statmike-mlops.fraud.gm_cluster in my case

In [35]:
ljob

reference {
  project_id: "statmike-mlops"
  job_id: "84c95c0b-cb32-4363-a1a6-641bf1559108"
}
placement {
  cluster_name: "fraud"
  cluster_uuid: "80c51baf-d595-4301-a08b-44e799f638ac"
}
pyspark_job {
  main_python_file_uri: "gs://statmike-mlops/fraud/models/dataproc/20211021183942/gm.py"
  jar_file_uris: "gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar"
}
status {
  state: DONE
  state_start_time {
    seconds: 1634896929
    nanos: 78067000
  }
}
yarn_applications {
  name: "gm.py"
  state: FINISHED
  progress: 1.0
  tracking_url: "http://fraud-m:8088/proxy/application_1634850985280_0002/"
}
status_history {
  state: PENDING
  state_start_time {
    seconds: 1634894938
    nanos: 741183000
  }
}
status_history {
  state: SETUP_DONE
  state_start_time {
    seconds: 1634894938
    nanos: 762715000
  }
}
status_history {
  state: RUNNING
  details: "Agent reported job success"
  state_start_time {
    seconds: 1634894938
    nanos: 928553000
  }
}
driver_control_files_uri: "gs:/

### Delete Cluster
https://cloud.google.com/dataproc/docs/guides/manage-cluster#delete_a_cluster

In [36]:
delCluster = clients['cluster'].delete_cluster(
    request = {
        "project_id": PROJECT_ID,
        "region": REGION,
        "cluster_name": cluster.result().cluster_name
	}
)