Cloud Dataproc Initialization Actions

When creating a Dataproc cluster, you can specify initialization actions in executables and/or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

How initialization actions are used

Initialization actions must be stored in a Cloud Storage bucket and can be passed as a parameter to the gcloud command or the clusters.create API when creating a Dataproc cluster. For example, to specify an initialization action when creating a cluster with the gcloud command, you can run:

gcloud dataproc clusters create <CLUSTER_NAME> \
    [--initialization-actions [GCS_URI,...]] \
    [--initialization-action-timeout TIMEOUT]

During development, you can create a Dataproc cluster using Dataproc-provided regional initialization actions buckets (for example goog-dataproc-initialization-actions-us-east1):

REGION=<region>
CLUSTER=<cluster_name>
gcloud dataproc clusters create ${CLUSTER} \
    --region ${REGION} \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/presto/presto.sh

⚠️ NOTICE: For production usage, before creating clusters, it is strongly recommended that you copy initialization actions to your own Cloud Storage bucket to guarantee consistent use of the same initialization action code across all Dataproc cluster nodes and to prevent unintended upgrades from upstream in the cluster:

BUCKET=<your_init_actions_bucket>
CLUSTER=<cluster_name>
gsutil cp presto/presto.sh gs://${BUCKET}/
gcloud dataproc clusters create ${CLUSTER} --initialization-actions gs://${BUCKET}/presto.sh

You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the GitHub repository. Doing this is also useful if you want to modify initialization actions to meet your needs.

Why these samples are provided

These samples are provided to show how various packages and components can be installed on Dataproc clusters. You should understand how these samples work before running them on your clusters. The initialization actions provided in this repository are provided without support and you use them at your own risk.

Actions provided

This repository currently offers the following actions for use with Dataproc clusters.

Install additional Apache Hadoop ecosystem components
Improve data science and interactive experiences
Configure the environment
- Configure a nice shell environment
- To switch to Python 3, use the conda initialization action
Connect to Google Cloud Platform services
- Install alternate versions of the Cloud Storage and BigQuery connectors. Specific versions of these connectors come pre-installed on Cloud Dataproc clusters.
- Share a Cloud SQL Hive Metastore, or simply read/write data from Cloud SQL.
Set up monitoring
- Stackdriver
- Ganglia

Removed actions

Previously, this repo provided init actions for the following, which have since been removed because equivalent functionality is now provided directly by Dataproc:

Apache Tez: This is now pre-installed in all current Dataproc image versions.
Datalab: Datalab has been replaced by Vertex AI Workbench, which integrates with Dataproc.
Jupyter: This has been replaced with the Jupyter Optional Component.

Initialization actions on single node clusters

Single Node clusters have dataproc-role set to Master and dataproc-worker-count set to 0. Most of the initialization actions in this repository should work out of the box because they run only on the master. Examples include notebooks, such as Apache Zeppelin, and libraries, such as Apache Tez. Actions that run on all nodes of the cluster, such as cloud-sql-proxy, also work out of the box.

Some initialization actions are known not to work on Single Node clusters. All of these expect to have daemons on multiple nodes.

Apache Drill
Apache Flink
Apache Kafka
Apache Zookeeper

Feel free to send pull requests or file issues if you have a good use case for running one of these actions on a Single Node cluster.

Using cluster metadata

Dataproc sets special metadata values for the instances that run in your cluster. You can use these values to customize the behavior of initialization actions, for example:

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
else
  ... worker specific actions ...
fi

You can also use the ‑‑metadata flag of the gcloud dataproc clusters create command to provide your own custom metadata:

gcloud dataproc clusters create cluster-name \
    --initialization-actions ... \
    --metadata name1=value1,name2=value2,... \
    ... other flags ...

For more information

For more information, review the Dataproc documentation. You can also pose questions to the Stack Overflow community with the tag google-cloud-dataproc. See our other Google Cloud Platform github repos for sample applications and scaffolding for other frameworks and use cases.

Mailing list

Subscribe to cloud-dataproc-discuss@google.com for announcements and discussion.

Contributing changes

See CONTRIBUTING.md

Licensing

See LICENSE

FAQ

You might see an error message similar to the following when upgrading the agent, installing the agent, or running apt-get update on Debian/Ubuntu Linux:

E: Repository 'https://packages.cloud.google.com/apt google-cloud-monitoring-buster-all InRelease' changed its 'Origin' value from 'google-cloud-monitoring-buster' to 'namespaces/cloud-ops-agents-artifacts/repositories/google-cloud-monitoring-buster-all'
E: Repository 'https://packages.cloud.google.com/apt google-cloud-monitoring-buster-all InRelease' changed its 'Label' value from 'google-cloud-monitoring-buster' to 'namespaces/cloud-ops-agents-artifacts/repositories/google-cloud-monitoring-buster-all'

This message indicates that the package repository cache may have diverged from its source. To resolve this, run the following command:

apt-get --allow-releaseinfo-change update

Then, run the upgrade or install again.

Name		Name	Last commit message	Last commit date
Latest commit History 782 Commits
alluxio		alluxio
atlas		atlas
beam		beam
bigdl		bigdl
bigtable		bigtable
cloud-sql-proxy		cloud-sql-proxy
cloudbuild		cloudbuild
conda		conda
connectors		connectors
dask		dask
datasketches		datasketches
docker		docker
dr-elephant		dr-elephant
drill		drill
flink		flink
ganglia		ganglia
gobblin		gobblin
gpu		gpu
h2o		h2o
hbase		hbase
hive-hcatalog		hive-hcatalog
hive-llap		hive-llap
horovod		horovod
hue		hue
integration_tests		integration_tests
kafka		kafka
knox		knox
livy		livy
mlvm		mlvm
oozie		oozie
opsagent		opsagent
otel		otel
post-init		post-init
presto		presto
prometheus		prometheus
python		python
ranger		ranger
rapids		rapids
rstudio		rstudio
solr		solr
spark-rapids		spark-rapids
sqoop		sqoop
stackdriver		stackdriver
starburst-presto		starburst-presto
tony		tony
toree		toree
user-environment		user-environment
util		util
zeppelin		zeppelin
zookeeper		zookeeper
.gitignore		.gitignore
BUILD		BUILD
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
WORKSPACE		WORKSPACE
favicon.ico		favicon.ico
push-to-gcs.sh		push-to-gcs.sh

License

GoogleCloudDataproc/initialization-actions

Folders and files

Latest commit

History

Repository files navigation

Cloud Dataproc Initialization Actions

How initialization actions are used

Why these samples are provided

Actions provided

Removed actions

Initialization actions on single node clusters

Using cluster metadata

For more information

Mailing list

Contributing changes

Licensing

FAQ

About

Topics

Resources

License

Stars

Watchers

Forks

Languages