# Dataproc Spark Job
- Dataproc Cluster
- Job with BQ data
- Delete Dataproc Cluster

API Reference: https://googleapis.dev/python/dataproc/0.7.0/gapic/v1/api.html

## Setup

Enable the Dataproc API: only needed once for a project

In [2]:
!gcloud services enable dataproc.googleapis.com

inputs:

In [3]:
REGION = 'us-central1'
PROJECT_ID='statmike-demo3'
DATANAME = 'fraud'
NOTEBOOK = 'dataproc'

DATAPROC_COMPUTE = "n1-standard-4"
DATAPROC_MAIN_INSTANCES = 1
DATAPROC_WORK_INSTANCES = 4

packages:

In [4]:
from datetime import datetime

clients:

parameters:

In [5]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{DATANAME}/models/{NOTEBOOK}"
DIR = f"temp/{NOTEBOOK}"

environment:

In [6]:
!rm -rf {DIR}
!mkdir -p {DIR}

## Define Job
- https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml#run_a_linear_regression

In [7]:
%%writefile {DIR}/lr.py
from __future__ import print_function
from pyspark.context import SparkContext
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.session import SparkSession


# Define a function that collects the features of interest
def vector_from_inputs(r):
  return (r["Class"], Vectors.dense(float(r["Amount"]),
                                    int(r["Time"]),
                                    float(r["V1"]),
                                    float(r["V2"]),
                                    float(r["V3"]),
                                    float(r["V4"]),
                                    float(r["V5"]),
                                    float(r["V6"]),
                                    float(r["V7"]),
                                    float(r["V8"]),
                                    float(r["V9"]),
                                    float(r["V10"]),
                                    float(r["V11"]),
                                    float(r["V12"]),
                                    float(r["V13"]),
                                    float(r["V14"]),
                                    float(r["V15"]),
                                    float(r["V16"]),
                                    float(r["V17"]),
                                    float(r["V18"]),
                                    float(r["V19"]),
                                    float(r["V20"]),
                                    float(r["V21"]),
                                    float(r["V22"]),
                                    float(r["V23"]),
                                    float(r["V24"]),
                                    float(r["V25"]),
                                    float(r["V26"]),
                                    float(r["V27"]),
                                    float(r["V28"])
                                   ),
          r['splits'], r['transaction_id']
         )

sc = SparkContext()
spark = SparkSession(sc)

#temp space for bq export used by connector
spark.conf.set('temporaryGcsBucket',"statmike-demo3")

# Read the data from BigQuery as a Spark Dataframe.
input_data = spark.read.format("bigquery").option("table", "statmike-demo3.fraud.fraud_prepped").load()
input_data.createOrReplaceTempView("fraud")

# subset data rows and columns
sql_query = """
SELECT *
from fraud 
"""
clean_data = spark.sql(sql_query) 

# Create an input DataFrame for Spark ML using the above function.
all_data = clean_data.rdd.map(vector_from_inputs).toDF(["label", "features", "splits", "transactions_id"])
all_data.cache()

# logistic regression with pyspark.ml
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter = 20)
lrModel = lr.fit(all_data.filter(df.splits=='TRAIN'))
predictions = lrModel.transform(all_data)

# write data to BigQuery
predictions.write.format('bigquery').option("table", "statmike-demo3.fraud.dataproc_lr").mode('overwrite').save()

Writing temp/dataproc/lr.py


In [10]:
!gsutil cp {DIR}/lr.py {URI}/{TIMESTAMP}/lr.py

Copying file://temp/dataproc/lr.py [Content-Type=text/x-python]...
/ [1 files][  2.9 KiB/  2.9 KiB]                                                
Operation completed over 1 objects/2.9 KiB.                                      


## Submit Serverless (Batch) Dataproc Job

During Private Preview: need to allowlist the project and user...

Note: Dataproc Serveless requires a subnet with Private Google Access. The first three cells below check for the private access, enable private access, check again to confirm.

In [17]:
!gcloud compute networks subnets describe default --region={REGION} --format="get(privateIpGoogleAccess)"

True


In [20]:
!gcloud compute networks subnets update default --region={REGION} --no-enable-private-ip-google-access

Updated [https://www.googleapis.com/compute/v1/projects/statmike-demo3/regions/us-central1/subnetworks/default].


In [21]:
!gcloud compute networks subnets describe default --region={REGION} --format="get(privateIpGoogleAccess)"

False


In [22]:
!gcloud dataproc batches submit pyspark {DIR}/lr.py --project={PROJECT_ID} --region={REGION} --deps-bucket={BUCKET} --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

Batch [b73beb67bed74b7aa7a7c2d39fffef11] submitted.
[1;31mERROR:[0m (gcloud.dataproc.batches.submit.pyspark) Batch job is FAILED. Detail: Subnetwork 'default' does not support Private Google Access which is required for Dataproc clusters when 'internal_ip_only' is set to 'true'. Enable Private Google Access on subnetwork 'default' or set 'internal_ip_only' to 'false'.
Running auto diagnostics on the batch. It may take few minutes before diagnostics output is available. Please check diagnostics output by running 'gcloud dataproc batches describe' command.
