-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Feature/conda init action/develop #18
ENH: Feature/conda init action/develop #18
Conversation
Hit a failure on initialization:
for some reason |
497a562
to
6c1144a
Compare
b6ed7d8
to
0c72e78
Compare
Hi @dennishuo, for everyone's sanity, I went ahead and cleaned up the commit history in this PR to something more reasonable—hope it helps. :) For reference, I also have a copy of the previous branch, here. Please let me know what else I can do and thanks for your help with this. |
0c72e78
to
7b5b509
Compare
Thanks for the cleanups! How were you testing the scripts as initialization actions? When I create a cluster with the two scripts as initialization actions, and then run the pyspark job to check the paths, I get the following:
That's if I run
When running as root I can confirm that the default python is the miniconda python:
|
@dennishuo, I'm checking on this and will get back to you soon. Thanks for the feedback. :) |
FWIW I just tried changing to use the Python 2 version instead of Python 3 and it seemed to fix it:
|
e692526
to
04373a8
Compare
Hi @dennishuo, thanks again for the feedback. Took me a while longer to get back to you than expected... To test, (besides getting the cluster to launch, Spark shell to run, and executing a few examples), we need to ensure that the worker nodes (executors) reference the correct (conda) python distro. This can be performed with a simple job resolves a list of distinct paths to the python executable found across each partition in an RDD—what you're doing looks right to me. :) I'm working on updating the README with info on this and am adding some files to support this (e.g., With that, addressing a few more topics...: 1. Python3 PYTHONHASHSEED exceptionI was able to reproduce the exception you hit in Python 3 which I somehow I hadn't hit it previously (been working mostly in Python 2). A little sleuthing—turns out this is a known bug in Pyspark with the rdd.py module not setting the The reporter of the jira issue (assuming they have the same name) also posted a fix on his blog: Python 3 on Spark - Return of the PYTHONHASHSEED. I am working on implementing this in this init action, hope to have a fix soon. 2. Different users, different results.Resolving different paths for different users (e.g., What's your thoughts / preferences? 3. Remote (Dataproc API) vs. local job submittalYou may have also noticed that depending on where you submit the job, the results of > spark-submit get-sys-exec.py
...
['/usr/local/bin/miniconda/bin/python']
... however, when submitting a job remotely using the dataproc API, it references the default python distribution: > gcloud beta dataproc jobs submit pyspark --cluster $DATAPROC_CLUSTER_NAME get-sys-exec.py
...
['/usr/bin/python']
... The Dataproc API jobs for sure run under a different shell, and again that shell's Thanks much, look forward to your feedback. |
Thanks for the summary. For (1), if you have more trouble getting Python 3 to work, it seems we can at least just move forward with the Python 2 version of miniconda as the default for now, unless there's a pressing need to strongly prefer Python 3 initially. For (2), I think global profiles would be a good approach; we do a similar thing in the related bdutil scripts since bdutil installs from tarballs, as opposed to Dataproc's distro installation into actual user paths under /usr/bin. We For (3), I think it'd be nice to investigate a bit more how to get |
e98c562
to
d711073
Compare
Hi @dennishuo, thanks for the follow up.
If it sounds good to you, I'll keep this PR moving forward by reverting back to having conda install Python 2 by default, and we can update this with Python 3 once we iron out the other issue. We still need to test whether it (correctly) supports remote jobs (i.e., that remote jobs run against the conda distribution). Thanks again for your time. |
- Add README - Add bootstrap-conda.sh for Miniconda install
- directories on *nix boxes - static defining to /root for expected paht: (fixes failure in resolving $HOME)
- Update gcloud SDK - Install py4j by default with conda install
- Add conda bin to PATH - Export PYTHONHASHSEED env var across cluster (to resolves https://issues.apache.org/jira/browse/SPARK-12100_
d711073
to
aceb38e
Compare
88418a6
to
811ab55
Compare
811ab55
to
132f7dc
Compare
Hey, sorry for the delays, did you have any luck getting remote jobs to run against the conda distribution? I tried a fresh run with your latest updates, and I'm hitting errors on what appears to be the
Any ideas what may have changed recently to produce this error? |
Hi @dennishuo! Sure, I'll take a look now. Could you give me the command line sequence you're executing when launching the cluster, so we can be consistent in how we're testing? |
I just uploaded the two .sh files to GCS and then ran: |
@dennishuo, thanks. The failure stems from installing the gcloud python client via: $ pip install gcloud While that seems like a nice dependency to have as a default, it's not fundamentals, so I've removed it from the Also, in testing, now that we're not using Python 3, things should work for both local and remote jobs, as specified above. See updated testing info in README and let me know if you still have any issues. :) PS. Would be super nice to setup some sort of testing workflow along with CI service to automate development of these init actions. :) PSS. Any thoughts on Python 3 issue I mentioned above and in #25? |
Excellent, everything tested cleanly, thanks so much for putting this together! Closed with 2d1ff5d Agreed that we'll want to set up some CI/testing service to make it easier to grow the community involvement, I'll be sure to ping you if we get such a thing set up. Thanks for the in-depth notes in #25 - we'll definitely keep track of Python 3 support as a feature for future Dataproc versions, but unfortunately it may be at least several weeks before we can get around to starting an in-depth plan for the upgrade as we work through our existing set of feature plans. |
A start for #17.