Skip to content
Branch: master
Find file History

Cloud Dataproc API Examples

Open in Cloud Shell

Sample command-line programs for interacting with the Cloud Dataproc API.

See the tutorial on the using the Dataproc API with the Python client library for information on a walkthrough you can run to try out the Cloud Dataproc API sample code.

Note that while this sample demonstrates interacting with Dataproc via the API, the functionality demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI. is a simple command-line program to demonstrate connecting to the Cloud Dataproc API and listing the clusters in a region. demonstrates how to create a cluster, submit the job, download the output from Google Cloud Storage, and output the result. uses the Cloud Dataproc InstantiateInlineWorkflowTemplate API to create an ephemeral cluster, run a job, then delete the cluster with one API request.

pyspark_sort.py_gcs is the same as but demonstrates reading from a GCS bucket.

Prerequisites to run locally:

Go to the Google Cloud Console.

Under API Manager, search for the Google Cloud Dataproc API and enable it.

Set Up Your Local Dev Environment

To install, run the following commands. If you want to use virtualenv (recommended), run the commands within a virtualenv.

* pip install -r requirements.txt


Please see the Google cloud authentication guide. The recommended approach to running these samples is a Service Account with a JSON key.

Environment Variables

Set the following environment variables:

REGION=us-central1 # or your region

Running the samples

To run

python $GOOGLE_CLOUD_PROJECT --region=$REGION can create the Dataproc cluster or use an existing cluster. To create a cluster before running the code, you can use the Cloud Console or run:

gcloud dataproc clusters create your-cluster-name

To run, first create a GCS bucket (used by Cloud Dataproc to stage files) from the Cloud Console or with gsutil:

gsutil mb gs://<your-staging-bucket-name>

Next, set the following environment variables:


Then, if you want to use an existing cluster, run:

python --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET

Alternatively, to create a new cluster, which will be deleted at the end of the job, run:

python --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster

The script will setup a cluster, upload the PySpark file, submit the job, print the result, then, if it created the cluster, delete the cluster.

Optionally, you can add the --pyspark_file argument to change from the default included in this script to a new script.

You can’t perform that action at this time.