Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow install-conda-env.sh to access conda env create functionality #58

Conversation

nehalecky
Copy link
Contributor

This is a substantial refactor of the conda init action:

  • exposes environment variables in bootstrap-conda.sh and install-conda-env.sh, allowing more configuration of both
  • allows for use of conda environment yaml files for specifying dependencies
  • updates documentation about it all, with examples.

nehalecky and others added 6 commits June 26, 2016 20:45
ENH: Add spark-csv as dependency to Jupyter pyspark kernel
BUG: Change from tee to > for jupyter log file redirect
BUG: jupyter command not found, updating PATH from within launch-jupyter.sh the same way setup-jupyter.sh does
ENH: Clean up launch-jupyter-interface.sh
ENH: Use python yaml package to parse cluster metadata
BUG: Fix PATH in kernel-based setup/launch scripts
ENH: Add zone as argument for launch-jupyter-interface.sh
@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request) and all commit authors, but as best as we can tell these commits were authored by someone else. If that's the case, please add them to this pull request and have them confirm that they're okay with these commits being contributed to Google. If we're mistaken and you did author these commits, just reply here to confirm.

@nehalecky
Copy link
Contributor Author

@dennishuo: any advice for how to get @jeffkpayne all squared away with the CLA on this PR?

Thanks!

@dennishuo dennishuo self-assigned this Jun 27, 2016
@dennishuo
Copy link
Contributor

@nehalecky There's still no automated process for the multi-author case, but any "LGTM" or +1 comment from @jeffkpayne on this PR will satisfy the current requirements for a committer to merge the PR despite the googlebot label.

@jeffkpayne
Copy link
Contributor

LGTM

@@ -13,15 +13,17 @@ sudo apt-get install -y git
# > Command failed: /bin/bash -x -e
# > /home/vagrant/miniconda-3.18.3/conda-bld/work/conda_build.sh
# If directory doesn't exist, attempt to create it.
if [ ! -d "~/.local/share/jupyter/" ]; then
if [[ ! -d "~/.local/share/jupyter/" ]]
then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do you currently have a particular shell style guide preference? We haven't quite been consistent ourselves in enforcing shell style, though we've been more strict in some subprojects than others (especially if produced initially within Google) and generally follow the Google bash style guide

For now I'm okay with some variation since we haven't really been providing guidance on style guide decisions yet and certainly no need to block this PR on updating style consistency on unrelated parts of the scripts, but as a heads up, in the near future we may go through and do some cleanup to improve style consistency between the init actions (which in this case would change back to "if [[ ! -d ... ]]; then" all on one line).

Let me know if such refactoring would be a potential pain point for you to re-merge into your forks and/or whether you'd prefer to update any pieces accordingly within this PR before I test and merge it.

@nehalecky
Copy link
Contributor Author

nehalecky commented Jun 28, 2016

Hi @dennishuo, thanks for the comments and feedback above.

In short, we'd be fine with changes you're proposing, and agree that it'd be great to follow a shell style convention—thanks for the link! We'd like to help, but likely better that you do a first pass, which would help us learn the style as applied to this code. At quick glance, I didn't see a convention for how to include default setting for environment variables, and hope you find our convention useful.

Expanding on a few more thoughts:
In reviewing the style guide, I couldn't agree more with the guide's top section When to use Shell, in particular:

If you are writing a script that is more than 100 lines long, you should probably be writing it in Python instead. Bear in mind that scripts grow. Rewrite your script in another language early to avoid a time-consuming rewrite at a later date.

Indeed, the complexity of our init action scripts continues to grow in a number of ways:

  • exposing and managing configurations
  • managing init action dependencies
  • overlapping common functionality (e.g., fetching metadata, retrieving objects from GCS, default launch templates)

We're of the opinion that, before we try and reinvent a provisioning library entirely in bash, we're at a critical point now where we should consider alternatives. We also agree that python would be a straightforward option, and in particular adoption of a higher level provisioning library (like ansible and salt, which already have playbooks and recipes that support installing many), would be a significant benefit (as mentioned in #15). Other alternatives might be in support of containers, etc.

We see this as saving us a considerable amount of effort supporting and expanding capabilities in provisioning a Dataproc cluster. So much so that, we're considering an init action to simply install one of these libraries across the cluster so we can leverage such functionality, however, we'd like to engage with you to ensure our efforts align with whatever roadmap you have established for this. Hope that's a possibility.

Thanks again, @dennishuo, appreciate your efforts. :)

@dennishuo
Copy link
Contributor

@nehalecky good points; the good news is that even though initially we had required strictly shell scripts for init actions, nowadays any executable can be used as an init action, so it should work today to use Python in place of bash, just making sure to have the shebang specifying python #!/usr/bin/python or preferably #!/usr/bin/env python.

Of course, that wouldn't solve the problems of custom shared init-action libraries, managing init action dependencies, etc.

The roadmap for higher-level provisioning libraries is still a bit murky, but we'll be happy to start bouncing ideas back and forth. Ansible might not be a good fit because we'd generally want to make sure the provisioning system plays well with runtime scaling adding/removing VMs safely even in the middle of a job, so running on new nodes and keeping the hosts file in sync could be tricky. On the other hand it may still be possible to reuse Ansible modules to some extent inside of init actions.

If you were to create an init action for installing a provisioning framework across the cluster, would it be more for extensibility of what to deploy on the initial deployment, or would it be as an admin tool for continuing to modify the cluster as it's already running? I filed #59 for further discussion of extensibility models.

@nehalecky
Copy link
Contributor Author

@dennishuo. Thanks. Wow, great to learn about support for python scripts, too! Excited to jump in with further discussions on #59 and hope to do so by the end of this week. :)

@dennishuo
Copy link
Contributor

Looks good, tested cleanly:

gsutil cp jupyter/jupyter.sh gs://hadoop-cloud-dev-dhuo/dataproc-initialization-actions/jupyter/jupyter.sh
gcloud dataproc clusters create dhuo-jupyter --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/bomboradata/dataproc-initialization-actions.git,INIT_ACTIONS_BRANCH=feature/allow-conda-env-yml" --bucket hadoop-cloud-dev-dhuo --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m --initialization-actions gs://hadoop-cloud-dev-dhuo/dataproc-initialization-actions/jupyter/jupyter.sh

@dennishuo
Copy link
Contributor

Closed by a4225b6 3fef7ce be01a58 4f7ad16

@dennishuo dennishuo closed this Jun 28, 2016
@nehalecky nehalecky deleted the feature/allow-conda-env-yml branch June 29, 2016 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants