Skip to content
Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
Shell Python Dockerfile
Branch: master
Clone or download
ojarjur Require openjdk to be authenticated (#621)
This commit removes the passing of the `--allow-unauthenticated` flag when installing the package `openjdk-8-jre-headless.

That flag was added in #620 to work around an issue in installing the hive and spark-python packages. Since openjdk-8-jre-headless was being installed in the same command, this flag was applied to it as well, even though it was not necessary.

This change locks that back down by splitting the openjdk-8-jre-headless package in a separate command that does not include the `--allow-unauthenticated` flag.

This change also increases the timeout for the Datalab test
Latest commit 79db60e Sep 5, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
beam Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
bigdl Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
bigtable Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
cloud-sql-proxy Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
cloudbuild Declare TESTS_TO_RUN as a global variable (#596) Jul 10, 2019
conda Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
connectors Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
datalab Require openjdk to be authenticated (#621) Sep 5, 2019
docker Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
dr-elephant Make all scripts executable (#434) Jan 24, 2019
drill Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
flink Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
ganglia Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
gobblin Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
gpu Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
hbase Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
hive-hcatalog Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
hue Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
integration_tests Extract "assert_instance_command" method (#600) Jul 16, 2019
ipython-notebook Deprecate the ipython-notebook init action in favor of Jupyter Notebook Sep 28, 2017
jupyter Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
jupyter2 Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
kafka Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
livy Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
oozie Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
openssl Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
post-init Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
presto Update presto version to 0.224 (#613) Aug 14, 2019
prometheus Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
python Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
ranger Remove dependency on non-existent bucket for ranger. (#609) Aug 9, 2019
rapids Extract "assert_instance_command" method (#600) Jul 16, 2019
rstudio Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
solr Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
spark-nlp Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
stackdriver Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
starburst-presto Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
tez Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
tony Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
user-environment Improve consistency of init action READMEs (#262) May 17, 2018
util Make all scripts executable (#434) Jan 24, 2019
zeppelin Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
zookeeper Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
.gitignore Ignore temp files (#567) Jun 13, 2019
CONTRIBUTING.md Add link to style guide to contributing.md (#257) May 17, 2018
LICENSE Initial Commit Oct 12, 2015
README.md Replace dataproc-initialization-actions with $MY_BUCKET in README (#606) Jul 25, 2019
favicon.ico Initial Commit Oct 12, 2015
push-to-gcs.sh Ignore temp files (#567) Jun 13, 2019

README.md

Cloud Dataproc Initialization Actions

When creating a Google Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.

How initialization actions are used

Initialization actions are stored in a Google Cloud Storage bucket and can be passed as a parameter to the gcloud command or the clusters.create API when creating a Cloud Dataproc cluster. For example, to specify an initialization action when creating a cluster with the gcloud command, you can run:

gcloud dataproc clusters create <CLUSTER_NAME> \
  [--initialization-actions [GCS_URI,...]] \
  [--initialization-action-timeout TIMEOUT]

Before creating clusters, you need to copy initialization actions to your own GCS bucket. For example:

MY_BUCKET=<gcs-bucket>
gsutil cp presto/presto.sh gs://$MY_BUCKET/
gcloud dataproc clusters create my-presto-cluster \
  --initialization-actions gs://$MY_BUCKET/presto.sh

You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the GitHub repository. This is also useful if you want to modify initialization actions to fit your needs.

Why these samples are provided

These samples are provided to show how various packages and components can be installed on Cloud Dataproc clusters. You should understand how these samples work before running them on your clusters. The initialization actions provided in this repository are provided without support and you use them at your own risk.

Actions provided

This repository presently offers the following actions for use with Cloud Dataproc clusters.

Initialization actions on single node clusters

Single Node clusters have dataproc-role set to Master and dataproc-worker-count set to 0. Most of the initialization actions in this repository should work out of the box, as they run only on the master. Examples include notebooks (such as Apache Zeppelin) and libraries (such as Apache Tez). Actions that run on all nodes of the cluster (such as cloud-sql-proxy) similarly work out of the box.

Some initialization actions are known not to work on Single Node clusters. All of these expect to have daemons on multiple nodes.

  • Apache Drill
  • Apache Flink
  • Apache Kafka
  • Apache Zookeeper

Feel free to send pull requests or file issues if you have a good use case for running one of these actions on a Single Node cluster.

Using cluster metadata

Cloud Dataproc sets special metadata values for the instances that run in your cluster. You can use these values to customize the behavior of initialization actions, for example:

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
else
  ... worker specific actions ...
fi

You can also use the ‑‑metadata flag of the gcloud dataproc clusters create command to provide your own custom metadata:

gcloud dataproc clusters create cluster-name \
  --initialization-actions ... \
  --metadata name1=value1,name2=value2... \
  ... other flags ...

For more information

For more information, review the Cloud Dataproc documentation. You can also pose questions to the Stack Overflow community with the tag google-cloud-dataproc. See our other Google Cloud Platform github repos for sample applications and scaffolding for other frameworks and use cases.

Mailing list

Subscribe to cloud-dataproc-discuss@google.com for announcements and discussion.

Contributing changes

Licensing

You can’t perform that action at this time.