Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in dataproc-initialization-actions/jupyter/ causes different version on master and worker #300

Closed
willbowditch opened this issue Jul 24, 2018 · 3 comments

Comments

@willbowditch
Copy link

I'm getting the following error using a basic cluster initialised using the dataproc-initialization-actions/jupyter/ scripts.

Exception: Python in worker has different version 3.6 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I've confirmed this is the case: on the master version is 3.7.0 and on workers its 3.6.5.

I've tracked down the issue to this segment of code:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/dced28ebd3780307418789ac51bfe8303030cd9a/jupyter/jupyter.sh#L44-L52

conda defaults to upgrading python to 3.7.0 for installations of jupyter causing the different versions.

I think wither the init script should prevent conda from installing the new python version, or perhaps upgrading the workers (I did this manually for now). I'm not familiar with conda (use pip instead), so unsure if there is a straightforward fix for this.

@willbowditch willbowditch changed the title Bug in dataproc-initialization-actions/jupyter/ causes different version on master and worker Bug in dataproc-initialization-actions/jupyter/ causes different version on master and worker Jul 24, 2018
@willbowditch willbowditch changed the title Bug in dataproc-initialization-actions/jupyter/ causes different version on master and worker Bug in dataproc-initialization-actions/jupyter/ causes different version on master and worker Jul 24, 2018
@karth295
Copy link
Contributor

Confirmed your theory:

  1. Create conda cluster
  2. Run conda install jupyter
The following packages will be UPDATED:

    asn1crypto:         0.24.0-py36_0          --> 0.24.0-py37_0         
    certifi:            2018.4.16-py36_0       --> 2018.4.16-py37_0      
    cffi:               1.11.5-py36h9745a5d_0  --> 1.11.5-py37h9745a5d_0 
    chardet:            3.0.4-py36h0f667ec_1   --> 3.0.4-py37_1          
    conda:              4.5.4-py36_0           --> 4.5.8-py37_0          
    cryptography:       2.2.2-py36h14c3975_0   --> 2.2.2-py37h14c3975_0  
    idna:               2.6-py36h82fb2a8_1     --> 2.7-py37_0            
    pip:                10.0.1-py36_0          --> 10.0.1-py37_0         
    pycosat:            0.6.3-py36h0a5515d_0   --> 0.6.3-py37h14c3975_0  
    pycparser:          2.18-py36hf9f622e_1    --> 2.18-py37_1           
    pyopenssl:          18.0.0-py36_0          --> 18.0.0-py37_0         
    pysocks:            1.6.8-py36_0           --> 1.6.8-py37_0          
    python:             3.6.5-hc3d631a_2       --> 3.7.0-hc3d631a_0      
    requests:           2.18.4-py36he2e5f8d_1  --> 2.19.1-py37_0         
    ruamel_yaml:        0.15.37-py36h14c3975_2 --> 0.15.42-py37h14c3975_0
    setuptools:         39.2.0-py36_0          --> 39.2.0-py37_0         
    six:                1.11.0-py36h372c433_1  --> 1.11.0-py37_1         
    sqlite:             3.23.1-he433501_0      --> 3.24.0-h84994c4_0     
    urllib3:            1.22-py36hbe7ace6_0    --> 1.23-py37_0           
    wheel:              0.31.1-py36_0          --> 0.31.1-py37_0         

Indeed Python is one of the packages that gets updated. I like the idea of preventing conda from upgrading other packages (--no-update-dependencies)

@karth295
Copy link
Contributor

A different solution is to pin to a particular version of conda: https://stackoverflow.com/questions/51427175/error-while-running-pyspark-dataproc-job-due-to-python-version

@karth295
Copy link
Contributor

For posterity, --no-update-dependencies does not work in this case.

A third possible solution is to install jupyter on all nodes so that the python environment starts out consistent between master and workers.

karth295 added a commit to karth295/dataproc-initialization-actions that referenced this issue Jul 30, 2018
karth295 added a commit to karth295/dataproc-initialization-actions that referenced this issue Aug 6, 2018
karth295 added a commit to karth295/dataproc-initialization-actions that referenced this issue Aug 6, 2018
karth295 added a commit that referenced this issue Aug 6, 2018
karth295 added a commit that referenced this issue Aug 6, 2018
Also, pin python version to version already installed

Fixes #300
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants