ENH: Feature/conda init action/develop #18

nehalecky · 2015-12-21T20:19:31Z

A start for #17.

nehalecky · 2015-12-21T20:30:08Z

Hit a failure on initialization:

grep: //.bashrc: No such file or directory

for some reason $HOME env var is resolving to / and not /root. Will statically define path and try again.

nehalecky · 2016-01-14T00:55:17Z

Hi @dennishuo, for everyone's sanity, I went ahead and cleaned up the commit history in this PR to something more reasonable—hope it helps. :)

For reference, I also have a copy of the previous branch, here. Please let me know what else I can do and thanks for your help with this.

dennishuo · 2016-01-14T23:26:36Z

Thanks for the cleanups! How were you testing the scripts as initialization actions? When I create a cluster with the two scripts as initialization actions, and then run the pyspark job to check the paths, I get the following:

[Stage 0:>                                                          (0 + 2) / 2]16/01/14 23:19:40 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, dhuo-conda-w-0.c.hadoop-cloud-dev.google.com.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1452810606380_0004/container_1452810606380_0004_01_000002/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1452810606380_0004/container_1452810606380_0004_01_000002/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1452810606380_0004/container_1452810606380_0004_01_000002/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
    for obj in iterator:
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1704, in add_shuffle_key
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1452810606380_0004/container_1452810606380_0004_01_000002/pyspark.zip/pyspark/rdd.py", line 74, in portable_hash
    raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")
Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED

  at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
  at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
  at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
  at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
  at org.apache.spark.scheduler.Task.run(Task.scala:88)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

16/01/14 23:19:41 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/home/dhuo/get-sys-exec.py", line 11, in <module>
    python_execs = distData.map(lambda x: sys.executable).distinct().collect()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 773, in collect
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
16/01/14 23:19:41 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

That's if I run spark-submit get-sys-exec.py as root; if I run it as my non-root user, it successfully runs and prints out the default python distro:

dhuo@dhuo-conda-m:~$ spark-submit get-sys-exec.py
16/01/14 23:25:03 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started
16/01/14 23:25:03 INFO Remoting: Starting remoting
16/01/14 23:25:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.240.0.191:50822]
16/01/14 23:25:03 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/01/14 23:25:03 INFO org.spark-project.jetty.server.AbstractConnector: Started SocketConnector@0.0.0.0:46917
16/01/14 23:25:03 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/01/14 23:25:03 INFO org.spark-project.jetty.server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/01/14 23:25:04 WARN org.apache.spark.metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/14 23:25:04 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dhuo-conda-m/10.240.0.191:8032
16/01/14 23:25:06 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1452810606380_0006
['/usr/bin/python']                                                             
16/01/14 23:25:18 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

When running as root I can confirm that the default python is the miniconda python:

dhuo@dhuo-conda-m:~$ sudo su
root@dhuo-conda-m:/home/dhuo# which python
/usr/local/bin/miniconda/bin/python

nehalecky · 2016-01-15T00:06:42Z

@dennishuo, I'm checking on this and will get back to you soon. Thanks for the feedback. :)

dennishuo · 2016-01-15T00:12:33Z

FWIW I just tried changing to use the Python 2 version instead of Python 3 and it seemed to fix it:

root@dhuo-conda-m:/home/dhuo# spark-submit get-sys-exec.py 
16/01/15 00:10:58 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started
16/01/15 00:10:58 INFO Remoting: Starting remoting
16/01/15 00:10:58 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.240.0.162:46432]
16/01/15 00:10:59 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/01/15 00:10:59 INFO org.spark-project.jetty.server.AbstractConnector: Started SocketConnector@0.0.0.0:45504
16/01/15 00:10:59 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/01/15 00:10:59 INFO org.spark-project.jetty.server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/01/15 00:10:59 WARN org.apache.spark.metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/15 00:10:59 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at dhuo-conda-m/10.240.0.162:8032
16/01/15 00:11:03 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1452815292926_0001
['/usr/local/bin/miniconda/bin/python']                                         
16/01/15 00:11:18 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
root@dhuo-conda-m:/home/dhuo# python -v
...
Python 2.7.11 |Continuum Analytics, Inc.| (default, Dec  6 2015, 18:08:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2

nehalecky · 2016-01-15T21:30:14Z

Hi @dennishuo, thanks again for the feedback. Took me a while longer to get back to you than expected...

To test, (besides getting the cluster to launch, Spark shell to run, and executing a few examples), we need to ensure that the worker nodes (executors) reference the correct (conda) python distro. This can be performed with a simple job resolves a list of distinct paths to the python executable found across each partition in an RDD—what you're doing looks right to me. :) I'm working on updating the README with info on this and am adding some files to support this (e.g., get-sys-exec.py).

With that, addressing a few more topics...:

1. Python3 PYTHONHASHSEED exception

I was able to reproduce the exception you hit in Python 3 which I somehow I hadn't hit it previously (been working mostly in Python 2). A little sleuthing—turns out this is a known bug in Pyspark with the rdd.py module not setting the PYTHONHASHSEED env var. More details here at the JIRA (still open):
https://issues.apache.org/jira/browse/SPARK-12100

The reporter of the jira issue (assuming they have the same name) also posted a fix on his blog: Python 3 on Spark - Return of the PYTHONHASHSEED. I am working on implementing this in this init action, hope to have a fix soon.

2. Different users, different results.

Resolving different paths for different users (e.g., /usr/bin/python vs. /usr/local/bin/miniconda/bin/python) stems from the user's profiles not containing an updated path referencing the conda bin directory. If conda is installed, the PATH for all users should be updated to include it (e.g.,/usr/local/bin/miniconda/bin). Right now, this is done by modifying the root user's .bashrc, to update PATH, but perhaps it should be updated at a higher / more global level, like /etc/profile.d/?

What's your thoughts / preferences?

3. Remote (Dataproc API) vs. local job submittal

You may have also noticed that depending on where you submit the job, the results of get-sys-exec.py differ—calling from the root account on master node seems to work fine:

> spark-submit get-sys-exec.py
...
['/usr/local/bin/miniconda/bin/python']
...

however, when submitting a job remotely using the dataproc API, it references the default python distribution:

> gcloud beta dataproc jobs submit pyspark --cluster $DATAPROC_CLUSTER_NAME get-sys-exec.py
...
['/usr/bin/python']
...

The Dataproc API jobs for sure run under a different shell, and again that shell's PATH is not being updated, as discussed above. I'm hoping that updating the global profile does the trick, but your help here would be much appreciated. :)

Thanks much, look forward to your feedback.

dennishuo · 2016-01-15T22:15:23Z

Thanks for the summary. For (1), if you have more trouble getting Python 3 to work, it seems we can at least just move forward with the Python 2 version of miniconda as the default for now, unless there's a pressing need to strongly prefer Python 3 initially.

For (2), I think global profiles would be a good approach; we do a similar thing in the related bdutil scripts since bdutil installs from tarballs, as opposed to Dataproc's distro installation into actual user paths under /usr/bin. We tee -a into both /etc/profile.d/bdutil_env.sh and /etc/*bashrc.

For (3), I think it'd be nice to investigate a bit more how to get PYSPARK_PYTHON working since dependency on login settings could fail unexpectedly for daemon-service-oriented callers, including the Dataproc daemons. By either getting the anaconda env to load via the PYSPARK_PYTHON environment variable or by source'ing something else inside spark-env.sh we can better ensure that the python environment is a property of the spark submission environment rather than depending on the shell environment of the caller; this would also protect from possibly different behaviors between sudo -u foouser spark-submit foo.py and sudo -i -u foouser spark-submit foo.py.

nehalecky · 2016-01-18T20:20:13Z

Hi @dennishuo, thanks for the follow up.

It seems there is a larger issue for Dataproc supporting Python 3, I just opened an issue detailing it (Support Python >= 3.3 in Dataproc (ensure spark-submit shell has access to global env vars) #25). We can revisit the specifics of supporting different versions of miniconda later, however, the version of miniconda should have no impact on PySpark jobs.
Awesome! Great design, I'm using it now for setting the env vars like you suggested. Thanks for the tip!
Totally agree, however, after spending quite a bit of time trying to get this working without success (i.e. pyspark jobs were never seeing the env var set in spark-env.sh), I believe that this is due to something related to Support Python >= 3.3 in Dataproc (ensure spark-submit shell has access to global env vars) #25.

If it sounds good to you, I'll keep this PR moving forward by reverting back to having conda install Python 2 by default, and we can update this with Python 3 once we iron out the other issue. We still need to test whether it (correctly) supports remote jobs (i.e., that remote jobs run against the conda distribution).

Thanks again for your time.

- Add README - Add bootstrap-conda.sh for Miniconda install

- directories on *nix boxes - static defining to /root for expected paht: (fixes failure in resolving $HOME)

- Update gcloud SDK - Install py4j by default with conda install

- Add conda bin to PATH - Export PYTHONHASHSEED env var across cluster (to resolves https://issues.apache.org/jira/browse/SPARK-12100_

see: GoogleCloudDataproc#25

see: http://askubuntu.com/questions/391515/changing-etc-environment-did-not-affect-my-environemtn-variables

dennishuo · 2016-02-05T01:56:48Z

Hey, sorry for the delays, did you have any luck getting remote jobs to run against the conda distribution? I tried a fresh run with your latest updates, and I'm hitting errors on what appears to be the pip install line:

Activating root environment...
discarding /usr/local/bin/miniconda/bin from PATH
prepending /usr/local/bin/miniconda/bin to PATH
Installing / updating conda packages: numpy pandas scikit-learn networkx seaborn bokeh ipython Jupyter pytables
Using Anaconda Cloud api site https://api.anaconda.org
Fetching package metadata: ....
Solving package specifications: ..............................................................................................
Package plan for installation in environment /usr/local/bin/miniconda:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libgfortran-1.0            |                0         170 KB
    libpng-1.6.17              |                0         214 KB
    libsodium-1.0.3            |                0         774 KB
    pixman-0.32.6              |                0         2.4 MB
    freetype-2.5.5             |                0         2.2 MB
    hdf5-1.8.15.1              |                2         1.9 MB
    libxml2-2.9.2              |                0         4.2 MB
    openblas-0.2.14            |                3         3.5 MB
    zeromq-4.1.3               |                0         4.1 MB
    backports_abc-0.4          |           py27_0           5 KB
    decorator-4.0.6            |           py27_0          11 KB
    fontconfig-2.11.1          |                5         402 KB
    futures-3.0.3              |           py27_0          17 KB
    ipython_genutils-0.1.0     |           py27_0          32 KB
    jsonschema-2.4.0           |           py27_0          51 KB
    mistune-0.7.1              |           py27_0         570 KB
    numpy-1.10.2               |           py27_0         5.9 MB
    path.py-8.1.2              |           py27_1          45 KB
    pexpect-3.3                |           py27_0          60 KB
    ptyprocess-0.5             |           py27_0          19 KB
    pygments-2.1               |           py27_0         1.2 MB
    pyparsing-2.0.3            |           py27_0          63 KB
    pyzmq-15.2.0               |           py27_0         704 KB
    simplegeneric-0.8.1        |           py27_0           6 KB
    sip-4.16.9                 |           py27_0         247 KB
    ssl_match_hostname-3.4.0.2 |           py27_0           6 KB
    cairo-1.12.18              |                6         594 KB
    cycler-0.9.0               |           py27_0           6 KB
    networkx-1.11              |           py27_0         1.1 MB
    numexpr-2.4.6              |      np110py27_0         330 KB
    pickleshare-0.5            |           py27_0           8 KB
    qt-4.8.7                   |                1        36.4 MB
    scipy-0.16.1               |      np110py27_0        23.9 MB
    singledispatch-3.4.0.3     |           py27_0          12 KB
    traitlets-4.1.0            |           py27_0         104 KB
    ipython-4.0.3              |           py27_0         917 KB
    jupyter_core-4.0.6         |           py27_0          48 KB
    pandas-0.17.1              |      np110py27_0        12.4 MB
    pycairo-1.10.0             |           py27_0          81 KB
    pyqt-4.11.4                |           py27_1         3.5 MB
    pytables-3.2.2             |      np110py27_0         3.4 MB
    scikit-learn-0.17          |      np110py27_1         8.6 MB
    tornado-4.3                |           py27_0         548 KB
    bokeh-0.11.1               |           py27_0         3.1 MB
    jupyter_client-4.1.1       |           py27_0          89 KB
    matplotlib-1.5.1           |      np110py27_0         8.2 MB
    nbformat-4.0.1             |           py27_0         112 KB
    terminado-0.5              |           py27_1          17 KB
    ipykernel-4.2.2            |           py27_0         113 KB
    nbconvert-4.1.0            |           py27_0         273 KB
    seaborn-0.7.0              |           py27_0         264 KB
    jupyter_console-4.1.0      |           py27_0          24 KB
    notebook-4.1.0             |           py27_0         4.4 MB
    qtconsole-4.1.1            |           py27_0         126 KB
    ipywidgets-4.1.1           |           py27_0          98 KB
    jupyter-1.0.0              |           py27_1           2 KB
    ------------------------------------------------------------
                                           Total:       137.3 MB

The following NEW packages will be INSTALLED:

    backports_abc:      0.4-py27_0        
    bokeh:              0.11.1-py27_0     
    cairo:              1.12.18-6         
    cycler:             0.9.0-py27_0      
    decorator:          4.0.6-py27_0      
    fontconfig:         2.11.1-5          
    freetype:           2.5.5-0           
    futures:            3.0.3-py27_0      
    hdf5:               1.8.15.1-2        
    ipykernel:          4.2.2-py27_0      
    ipython:            4.0.3-py27_0      
    ipython_genutils:   0.1.0-py27_0      
    ipywidgets:         4.1.1-py27_0      
    jsonschema:         2.4.0-py27_0      
    jupyter:            1.0.0-py27_1      
    jupyter_client:     4.1.1-py27_0      
    jupyter_console:    4.1.0-py27_0      
    jupyter_core:       4.0.6-py27_0      
    libgfortran:        1.0-0             
    libpng:             1.6.17-0          
    libsodium:          1.0.3-0           
    libxml2:            2.9.2-0           
    matplotlib:         1.5.1-np110py27_0 
    mistune:            0.7.1-py27_0      
    nbconvert:          4.1.0-py27_0      
    nbformat:           4.0.1-py27_0      
    networkx:           1.11-py27_0       
    notebook:           4.1.0-py27_0      
    numexpr:            2.4.6-np110py27_0 
    numpy:              1.10.2-py27_0     
    openblas:           0.2.14-3          
    pandas:             0.17.1-np110py27_0
    path.py:            8.1.2-py27_1      
    pexpect:            3.3-py27_0        
    pickleshare:        0.5-py27_0        
    pixman:             0.32.6-0          
    ptyprocess:         0.5-py27_0        
    pycairo:            1.10.0-py27_0     
    pygments:           2.1-py27_0        
    pyparsing:          2.0.3-py27_0      
    pyqt:               4.11.4-py27_1     
    pytables:           3.2.2-np110py27_0 
    pyzmq:              15.2.0-py27_0     
    qt:                 4.8.7-1           
    qtconsole:          4.1.1-py27_0      
    scikit-learn:       0.17-np110py27_1  
    scipy:              0.16.1-np110py27_0
    seaborn:            0.7.0-py27_0      
    simplegeneric:      0.8.1-py27_0      
    singledispatch:     3.4.0.3-py27_0    
    sip:                4.16.9-py27_0     
    ssl_match_hostname: 3.4.0.2-py27_0    
    terminado:          0.5-py27_1        
    tornado:            4.3-py27_0        
    traitlets:          4.1.0-py27_0      
    zeromq:             4.1.3-0           

Fetching packages ...
libgfortran-1. 100% |###############################| Time: 0:00:00 954.63 kB/s
libpng-1.6.17- 100% |###############################| Time: 0:00:00   1.02 MB/s
libsodium-1.0. 100% |###############################| Time: 0:00:00   1.86 MB/s
pixman-0.32.6- 100% |###############################| Time: 0:00:00   3.81 MB/s
freetype-2.5.5 100% |###############################| Time: 0:00:00   3.57 MB/s
hdf5-1.8.15.1- 100% |###############################| Time: 0:00:00   3.12 MB/s
libxml2-2.9.2- 100% |###############################| Time: 0:00:00   5.14 MB/s
openblas-0.2.1 100% |###############################| Time: 0:00:00   4.57 MB/s
zeromq-4.1.3-0 100% |###############################| Time: 0:00:00   4.49 MB/s
backports_abc- 100% |###############################| Time: 0:00:00   6.70 MB/s
decorator-4.0. 100% |###############################| Time: 0:00:00  14.08 MB/s
fontconfig-2.1 100% |###############################| Time: 0:00:00   1.41 MB/s
futures-3.0.3- 100% |###############################| Time: 0:00:00 584.56 kB/s
ipython_genuti 100% |###############################| Time: 0:00:00   1.08 MB/s
jsonschema-2.4 100% |###############################| Time: 0:00:00 884.83 kB/s
mistune-0.7.1- 100% |###############################| Time: 0:00:00   1.60 MB/s
numpy-1.10.2-p 100% |###############################| Time: 0:00:01   5.80 MB/s
path.py-8.1.2- 100% |###############################| Time: 0:00:00 733.32 kB/s
pexpect-3.3-py 100% |###############################| Time: 0:00:00 671.26 kB/s
ptyprocess-0.5 100% |###############################| Time: 0:00:00 615.21 kB/s
pygments-2.1-p 100% |###############################| Time: 0:00:00   2.49 MB/s
pyparsing-2.0. 100% |###############################| Time: 0:00:00 665.86 kB/s
pyzmq-15.2.0-p 100% |###############################| Time: 0:00:00   1.78 MB/s
simplegeneric- 100% |###############################| Time: 0:00:00   8.13 MB/s
sip-4.16.9-py2 100% |###############################| Time: 0:00:00   1.19 MB/s
ssl_match_host 100% |###############################| Time: 0:00:00   8.16 MB/s
cairo-1.12.18- 100% |###############################| Time: 0:00:00   1.66 MB/s
cycler-0.9.0-p 100% |###############################| Time: 0:00:00   7.85 MB/s
networkx-1.11- 100% |###############################| Time: 0:00:00   2.35 MB/s
numexpr-2.4.6- 100% |###############################| Time: 0:00:00   1.39 MB/s
pickleshare-0. 100% |###############################| Time: 0:00:00   8.98 MB/s
qt-4.8.7-1.tar 100% |###############################| Time: 0:00:01  22.04 MB/s
scipy-0.16.1-n 100% |###############################| Time: 0:00:01  16.99 MB/s
singledispatch 100% |###############################| Time: 0:00:00  13.06 MB/s
traitlets-4.1. 100% |###############################| Time: 0:00:00 870.00 kB/s
ipython-4.0.3- 100% |###############################| Time: 0:00:00   2.09 MB/s
jupyter_core-4 100% |###############################| Time: 0:00:00 732.82 kB/s
pandas-0.17.1- 100% |###############################| Time: 0:00:01  10.77 MB/s
pycairo-1.10.0 100% |###############################| Time: 0:00:00 909.37 kB/s
pyqt-4.11.4-py 100% |###############################| Time: 0:00:00   5.84 MB/s
pytables-3.2.2 100% |###############################| Time: 0:00:00   4.70 MB/s
scikit-learn-0 100% |###############################| Time: 0:00:00   9.83 MB/s
tornado-4.3-py 100% |###############################| Time: 0:00:00   1.83 MB/s
bokeh-0.11.1-p 100% |###############################| Time: 0:00:00   5.44 MB/s
jupyter_client 100% |###############################| Time: 0:00:00 696.50 kB/s
matplotlib-1.5 100% |###############################| Time: 0:00:00   8.60 MB/s
nbformat-4.0.1 100% |###############################| Time: 0:00:00 955.71 kB/s
terminado-0.5- 100% |###############################| Time: 0:00:00 532.37 kB/s
ipykernel-4.2. 100% |###############################| Time: 0:00:00 703.04 kB/s
nbconvert-4.1. 100% |###############################| Time: 0:00:00   1.06 MB/s
seaborn-0.7.0- 100% |###############################| Time: 0:00:00   1.26 MB/s
jupyter_consol 100% |###############################| Time: 0:00:00 756.88 kB/s
notebook-4.1.0 100% |###############################| Time: 0:00:00   6.60 MB/s
qtconsole-4.1. 100% |###############################| Time: 0:00:00 845.10 kB/s
ipywidgets-4.1 100% |###############################| Time: 0:00:00 819.97 kB/s
jupyter-1.0.0- 100% |###############################| Time: 0:00:00   2.15 MB/s
Extracting packages ...
[      COMPLETE      ]|##################################################| 100%
Linking packages ...
[      COMPLETE      ]|##################################################| 100%
Collecting plotly
  Downloading plotly-1.9.5.tar.gz (1.7MB)
Requirement already satisfied (use --upgrade to upgrade): py4j in /usr/local/bin/miniconda/lib/python2.7/site-packages
Collecting gcloud
  Downloading gcloud-0.9.0-py2.py3-none-any.whl (455kB)
Requirement already satisfied (use --upgrade to upgrade): requests in /usr/local/bin/miniconda/lib/python2.7/site-packages (from plotly)
Requirement already satisfied (use --upgrade to upgrade): six in /usr/local/bin/miniconda/lib/python2.7/site-packages (from plotly)
Requirement already satisfied (use --upgrade to upgrade): pytz in /usr/local/bin/miniconda/lib/python2.7/site-packages (from plotly)
Collecting protobuf>=3.0.0b2 (from gcloud)
  Downloading protobuf-3.0.0b2-py2.py3-none-any.whl (326kB)
Collecting googleapis-common-protos (from gcloud)
  Downloading googleapis-common-protos-1.1.0.tar.gz
Collecting pyOpenSSL (from gcloud)
  Downloading pyOpenSSL-0.15.1-py2.py3-none-any.whl (102kB)
Collecting oauth2client>=1.4.6 (from gcloud)
  Downloading oauth2client-1.5.2.tar.gz (56kB)
Collecting httplib2>=0.9.1 (from gcloud)
  Downloading httplib2-0.9.2.tar.gz (205kB)
Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg (from protobuf>=3.0.0b2->gcloud)
Collecting cryptography>=0.7 (from pyOpenSSL->gcloud)
  Downloading cryptography-1.2.2.tar.gz (372kB)
Collecting pyasn1>=0.1.7 (from oauth2client>=1.4.6->gcloud)
  Downloading pyasn1-0.1.9-py2.py3-none-any.whl
Collecting pyasn1-modules>=0.0.5 (from oauth2client>=1.4.6->gcloud)
  Downloading pyasn1_modules-0.0.8-py2.py3-none-any.whl
Collecting rsa>=3.1.4 (from oauth2client>=1.4.6->gcloud)
  Downloading rsa-3.3-py2.py3-none-any.whl (44kB)
Collecting idna>=2.0 (from cryptography>=0.7->pyOpenSSL->gcloud)
  Downloading idna-2.0-py2.py3-none-any.whl (61kB)
Collecting enum34 (from cryptography>=0.7->pyOpenSSL->gcloud)
  Downloading enum34-1.1.2.tar.gz (46kB)
Collecting ipaddress (from cryptography>=0.7->pyOpenSSL->gcloud)
  Downloading ipaddress-1.0.16-py27-none-any.whl
Collecting cffi>=1.4.1 (from cryptography>=0.7->pyOpenSSL->gcloud)
  Downloading cffi-1.5.0.tar.gz (385kB)
Collecting pycparser (from cffi>=1.4.1->cryptography>=0.7->pyOpenSSL->gcloud)
  Downloading pycparser-2.14.tar.gz (223kB)
Building wheels for collected packages: plotly, googleapis-common-protos, oauth2client, httplib2, cryptography, enum34, cffi, pycparser
  Running setup.py bdist_wheel for plotly: started
  Running setup.py bdist_wheel for plotly: finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/ac/5d/c5/b8cc3cd586e2bd68601a12bc8e85c3fc880dee5f5fc3f95c0f
  Running setup.py bdist_wheel for googleapis-common-protos: started
  Running setup.py bdist_wheel for googleapis-common-protos: finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/a6/e9/19/c9995c1e5380883503193fb97d5debbd1e6abc36baf8f7ce1c
  Running setup.py bdist_wheel for oauth2client: started
  Running setup.py bdist_wheel for oauth2client: finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/79/9e/56/e82aa05430ac46bd675b90e030d09d567cd29a524ff9ce1a67
  Running setup.py bdist_wheel for httplib2: started
  Running setup.py bdist_wheel for httplib2: finished with status 'done'
  Stored in directory: /root/.cache/pip/wheels/e1/a3/05/e66aad1380335ee0a823c8f1b9006efa577236a24b3cb1eade
  Running setup.py bdist_wheel for cryptography: started
  Running setup.py bdist_wheel for cryptography: finished with status 'error'
  Complete output from command /usr/local/bin/miniconda/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-yPOgiO/cryptography/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmp2NMFbspip-wheel- --python-tag cp27:
  c/_cffi_backend.c:15:17: fatal error: ffi.h: No such file or directory
   #include <ffi.h>
                   ^
  compilation terminated.
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-build-yPOgiO/cryptography/setup.py", line 318, in <module>
      **keywords_with_side_effects(sys.argv)
    File "/usr/local/bin/miniconda/lib/python2.7/distutils/core.py", line 111, in setup
      _setup_distribution = dist = klass(attrs)
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/setuptools/dist.py", line 269, in __init__
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/setuptools/dist.py", line 313, in fetch_build_eggs
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/pkg_resources/__init__.py", line 825, in resolve
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/pkg_resources/__init__.py", line 1070, in best_match
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/pkg_resources/__init__.py", line 1082, in obtain
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/setuptools/dist.py", line 380, in fetch_build_egg
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/setuptools/command/easy_install.py", line 640, in easy_install
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/setuptools/command/easy_install.py", line 670, in install_item
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/setuptools/command/easy_install.py", line 853, in install_eggs
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/setuptools/command/easy_install.py", line 1081, in build_and_install
    File "/usr/local/bin/miniconda/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/setuptools/command/easy_install.py", line 1069, in run_setup
  distutils.errors.DistutilsError: Setup script exited with error: command 'gcc' failed with exit status 1

Any ideas what may have changed recently to produce this error?

nehalecky · 2016-02-05T19:54:09Z

Hi @dennishuo! Sure, I'll take a look now. Could you give me the command line sequence you're executing when launching the cluster, so we can be consistent in how we're testing?

dennishuo · 2016-02-05T20:10:01Z

I just uploaded the two .sh files to GCS and then ran: gcloud beta dataproc clusters create dhuo-conda --initialization-actions gs://hadoop-cloud-dev-dhuo/conda/bootstrap-conda.sh,gs://hadoop-cloud-dev-dhuo/conda/install-conda-env.sh --zone us-central1-f

nehalecky · 2016-02-05T21:25:57Z

@dennishuo, thanks. The failure stems from installing the gcloud python client via:

$ pip install gcloud

While that seems like a nice dependency to have as a default, it's not fundamentals, so I've removed it from the install-conda-env.sh script. Please try again with your call on CL, it should launch. 🙏

Also, in testing, now that we're not using Python 3, things should work for both local and remote jobs, as specified above. See updated testing info in README and let me know if you still have any issues. :)

PS. Would be super nice to setup some sort of testing workflow along with CI service to automate development of these init actions. :)

PSS. Any thoughts on Python 3 issue I mentioned above and in #25?

dennishuo · 2016-02-06T02:59:55Z

Excellent, everything tested cleanly, thanks so much for putting this together! Closed with 2d1ff5d

Agreed that we'll want to set up some CI/testing service to make it easier to grow the community involvement, I'll be sure to ping you if we get such a thing set up.

Thanks for the in-depth notes in #25 - we'll definitely keep track of Python 3 support as a feature for future Dataproc versions, but unfortunately it may be at least several weeks before we can get around to starting an in-depth plan for the upgrade as we work through our existing set of feature plans.

googlebot added the cla: yes label Dec 21, 2015

nehalecky force-pushed the feature/conda_init_action/develop branch 3 times, most recently from 497a562 to 6c1144a Compare December 21, 2015 23:51

nehalecky changed the title ~~WIP: (do not merge) Feature/conda init action/develop #17~~ WIP: Feature/conda init action/develop Jan 5, 2016

chimerasaurus assigned dennishuo Jan 11, 2016

nehalecky force-pushed the feature/conda_init_action/develop branch from b6ed7d8 to 0c72e78 Compare January 14, 2016 00:47

nehalecky force-pushed the feature/conda_init_action/develop branch from 0c72e78 to 7b5b509 Compare January 14, 2016 01:27

nehalecky force-pushed the feature/conda_init_action/develop branch from e692526 to 04373a8 Compare January 15, 2016 19:09

nehalecky force-pushed the feature/conda_init_action/develop branch from e98c562 to d711073 Compare January 18, 2016 19:17

nehalecky added 12 commits January 18, 2016 14:36

ENH: Add conda bootstrap init action

c740533

- Add README - Add bootstrap-conda.sh for Miniconda install

ENH: Add init action for setup of conda environment

0c1b472

CLN: Capitalize bash vars

490d1a8

DEV: Update conda bootstrap with defaults

3ee2afc

- directories on *nix boxes - static defining to /root for expected paht: (fixes failure in resolving $HOME)

ENH: Include networkx, py4j and gcloud as dependencies

881b6f5

ENV: Ensure all Python nodes (executors) reference correct Python distro

3b71444

DEV: Update Miniconda versions

d48cde6

BUG: Fixes to resolve Python3 vs 2 issues.

36c5281

- Update gcloud SDK - Install py4j by default with conda install

TST: Add simple .py to test conda install

6c68380

ENV: Update PATH across global profiles

648242d

- Add conda bin to PATH - Export PYTHONHASHSEED env var across cluster (to resolves https://issues.apache.org/jira/browse/SPARK-12100_

TST: Update test to include PYTHONHASHSEED info

6440da5

ENV: Revert back to Python 2 variant for miniconda

aceb38e

see: GoogleCloudDataproc#25

nehalecky force-pushed the feature/conda_init_action/develop branch from d711073 to aceb38e Compare January 18, 2016 22:37

nehalecky added 2 commits January 18, 2016 21:29

ENV: Target /etc/profile also for env vars

277634c

TST: Temp echo of a few key env vars.

fd883e5

nehalecky force-pushed the feature/conda_init_action/develop branch from 88418a6 to 811ab55 Compare January 19, 2016 23:57

ENV: Update PATH with conda dir in /etc/profile, too

132f7dc

nehalecky force-pushed the feature/conda_init_action/develop branch from 811ab55 to 132f7dc Compare January 20, 2016 00:44

nehalecky added 2 commits January 19, 2016 17:14

ENV: Source conda_config.sh in install-conda-env.sh

a00572f

ENV: Remove env var tee -a to /etc/environment

8d836a8

see: http://askubuntu.com/questions/391515/changing-etc-environment-did-not-affect-my-environemtn-variables

nehalecky added 2 commits February 5, 2016 13:21

DOC: Update README.md with testing info and note on py3

f261fa7

BUG: Remove gcloud pip installation from install-conda-env.sh

73b6e43

dennishuo closed this Feb 6, 2016

nehalecky deleted the feature/conda_init_action/develop branch February 6, 2016 22:40

jlowin mentioned this pull request Feb 16, 2016

export PYTHONHASHSEED into spark-defaults.conf #28

Closed

nehalecky changed the title ~~WIP: Feature/conda init action/develop~~ ENH: Feature/conda init action/develop Feb 16, 2016

nehalecky mentioned this pull request Feb 16, 2016

Create init action for installing conda #17

Closed

nehalecky mentioned this pull request Mar 12, 2016

bootstrap-conda.sh failing on build (no .bashrc found) #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Feature/conda init action/develop #18

ENH: Feature/conda init action/develop #18

nehalecky commented Dec 21, 2015

nehalecky commented Dec 21, 2015

nehalecky commented Jan 14, 2016

dennishuo commented Jan 14, 2016

nehalecky commented Jan 15, 2016

dennishuo commented Jan 15, 2016

nehalecky commented Jan 15, 2016

dennishuo commented Jan 15, 2016

nehalecky commented Jan 18, 2016

dennishuo commented Feb 5, 2016

nehalecky commented Feb 5, 2016

dennishuo commented Feb 5, 2016

nehalecky commented Feb 5, 2016

dennishuo commented Feb 6, 2016

ENH: Feature/conda init action/develop #18

ENH: Feature/conda init action/develop #18

Conversation

nehalecky commented Dec 21, 2015

nehalecky commented Dec 21, 2015

nehalecky commented Jan 14, 2016

dennishuo commented Jan 14, 2016

nehalecky commented Jan 15, 2016

dennishuo commented Jan 15, 2016

nehalecky commented Jan 15, 2016

1. Python3 PYTHONHASHSEED exception

2. Different users, different results.

3. Remote (Dataproc API) vs. local job submittal

dennishuo commented Jan 15, 2016

nehalecky commented Jan 18, 2016

dennishuo commented Feb 5, 2016

nehalecky commented Feb 5, 2016

dennishuo commented Feb 5, 2016

nehalecky commented Feb 5, 2016

dennishuo commented Feb 6, 2016