Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
Switch branches/tags
Clone or download
functicons Fix RStudio init action. (#398)
1) Add --no-tty to `apt-key adv` for https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913614
2) Update RStudio apt repository to Debian 9
3) Update RStudio version to 1.1.463
Latest commit 06042be Dec 13, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
beam Revert "Fix beam artifact path to use hdfs over the gcs staging bucket ( Sep 20, 2018
bigdl Improve consistency of init action READMEs (#262) May 17, 2018
bigtable Use Python from /usr/bin instead of /opt/conda (#390) Nov 28, 2018
cloud-sql-proxy Fix typos in cloud sql proxy init action README (#366) Oct 20, 2018
conda Pin miniconda version to avoid breaking changes (#311) Aug 6, 2018
connectors Kill Dataproc Agent instead of restarting service (#341) Sep 5, 2018
datalab Support init actions on Debian 9 (#321) Aug 13, 2018
docker Configure Docker to use GCR during initialization (#342) Sep 5, 2018
dr-elephant Dr. Elephant init action (#368) Nov 2, 2018
drill Use recommended apache mirror (#298) Jul 26, 2018
flink Use Python from /usr/bin instead of /opt/conda (#390) Nov 28, 2018
ganglia Ganglia integration tests (#380) Nov 28, 2018
gobblin Improve consistency of init action READMEs (#262) May 17, 2018
hbase Init action with Apache HBase. (#389) Nov 28, 2018
hive-hcatalog added integration tests framework and test for Hive-Hcatalog (#360) Nov 2, 2018
hue Add retry for apt-get install and log the process holding the apt loc… Aug 24, 2018
integration_tests Integration test for Bigtable init action. (#383) Nov 28, 2018
ipython-notebook Deprecate the ipython-notebook init action in favor of Jupyter Notebook Sep 28, 2017
jupyter Install matplotlib first to prevent dependencies conflict (#377) Nov 28, 2018
jupyter2 Revert "Install Jupyterlab in jupyter init actions" (#273) May 29, 2018
kafka Added integration tests for Apache Kafka. (#391) Dec 8, 2018
livy Improve consistency of init action READMEs (#262) May 17, 2018
oozie Use Python from /usr/bin instead of /opt/conda (#390) Nov 28, 2018
openssl Support init actions on Debian 9 (#321) Aug 13, 2018
post-init Add a new post-init helper script. (#74) Sep 12, 2016
presto Use Python from /usr/bin instead of /opt/conda (#390) Nov 28, 2018
prometheus Prometheus init action (#387) Nov 16, 2018
rstudio Fix RStudio init action. (#398) Dec 13, 2018
secure Update secure.sh (#370) Oct 22, 2018
stackdriver Improve consistency of init action READMEs (#262) May 17, 2018
tez Added info how to use pre-installed Tez on Dataproc 1.3+ clusters. (#394 Dec 12, 2018
user-environment Improve consistency of init action READMEs (#262) May 17, 2018
util REF (DS-253): Clean up Jupyter notebook init action Jun 27, 2016
zeppelin Fix Zeppelin doc formatting Nov 1, 2018
zookeeper Add retry for apt-get install and log the process holding the apt loc… Aug 24, 2018
CONTRIBUTING.md Add link to style guide to contributing.md (#257) May 17, 2018
LICENSE Initial Commit Oct 12, 2015
README.md Add info about metadata to the README (#309) Aug 6, 2018
favicon.ico Initial Commit Oct 12, 2015

README.md

Cloud Dataproc Initialization Actions

When creating a Google Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

How initialization actions are used

Initialization actions are stored in a Google Cloud Storage bucket and can be passed as a parameter to the gcloud command or the clusters.create API when creating a Cloud Dataproc cluster. For example, to specify an initialization action when creating a cluster with the gcloud command, you can run:

gcloud dataproc clusters create <CLUSTER_NAME> \
  [--initialization-actions [GCS_URI,...]] \
  [--initialization-action-timeout TIMEOUT]

For convenience, copies of initialization actions in this repository are stored in the publicly accessible Cloud Storage bucket gs://dataproc-initialization-actions. The folder structure of this Cloud Storage bucket mirrors this repository. You should be able to use this Cloud Storage bucket (and the initialization scripts within it) for your clusters.

For example:

gcloud dataproc clusters create my-presto-cluster \
  --initialization-actions gs://dataproc-initialization-actions/presto/presto.sh

You are strongly encouraged to copy initialization actions to your own GCS bucket in automated pipelines to ensure hermetic deployments. For example:

MY_BUCKET=<gcs-bucket>
gsutil cp presto/presto.sh gs://$MY_BUCKET/
gcloud dataproc clusters create my-presto-cluster \
  --initialization-actions gs://$MY_BUCKET/presto.sh

This is also useful if you want to modify initialization actions to fit your needs.

Why these samples are provided

These samples are provided to show how various packages and components can be installed on Cloud Dataproc clusters. You should understand how these samples work before running them on your clusters. The initialization actions provided in this repository are provided without support and you use them at your own risk.

Actions provided

This repository presently offers the following actions for use with Cloud Dataproc clusters.

Initialization actions on single node clusters

Single Node clusters have dataproc-role set to Master and dataproc-worker-count set to 0. Most of the initialization actions in this repository should work out of the box, as they run only on the master. Examples include notebooks (such as Apache Zeppelin) and libraries (such as Apache Tez). Actions that run on all nodes of the cluster (such as cloud-sql-proxy) similarly work out of the box.

Some initialization actions are known not to work on Single Node clusters. All of these expect to have daemons on multiple nodes.

  • Apache Drill
  • Apache Flink
  • Apache Kafka
  • Apache Zookeeper

Feel free to send pull requests or file issues if you have a good use case for running one of these actions on a Single Node cluster.

Using cluster metadata

Cloud Dataproc sets special metadata values for the instances that run in your cluster. You can use these values to customize the behavior of initialization actions, for example:

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
else
  ... worker specific actions ...
fi

You can also use the ‑‑metadata flag of the gcloud dataproc clusters create command to provide your own custom metadata:

gcloud dataproc clusters create cluster-name \
  --initialization-actions ... \
  --metadata name1=value1,name2=value2... \
  ... other flags ...

For more information

For more information, review the Cloud Dataproc documentation. You can also pose questions to the Stack Overflow community with the tag google-cloud-dataproc. See our other Google Cloud Platform github repos for sample applications and scaffolding for other frameworks and use cases.

Mailing list

Subscribe to cloud-dataproc-discuss@google.com for announcements and discussion.

Contributing changes

Licensing