Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update desc-pyspark to use Spark3 #79

Open
heather999 opened this issue Mar 29, 2021 · 12 comments
Open

Update desc-pyspark to use Spark3 #79

heather999 opened this issue Mar 29, 2021 · 12 comments
Assignees

Comments

@heather999
Copy link
Collaborator

heather999 commented Mar 29, 2021

As discussed on Slack.
We have a new shifter image lsstdesc/desc-python:spark-v3.1.1 based on NERSC's most recent Spark docker image.
I would like to update the desc-pyspark jupyter kernel to use this version.
Additionally we will pull in Stephane's desc-spark scripts and install them along with the typical desc-python installations at NERSC.

@plaszczy
Copy link

plaszczy commented Apr 7, 2021

I just tried the image on interactive nodes

cori08/~ $ salloc -N 5 -t 30 -C haswell -q debug --image=lsstdesc/desc-python:spark-v3.1.1 --volume='/global/cscratch1/sd/plaszczy/tmpfiles:/tmp:perNodeCache=size=200G'

but I get stuck on :

salloc: Granted job allocation 41422941
salloc: Waiting for resource configuration
salloc: Nodes nid0[1433-1437] are ready for job

in real bad way (even C^C nor C^z can help me!)

@heather999
Copy link
Collaborator Author

Thanks for giving this a try @plaszczy
I was able use the interactive queue rather than debug, and I used double quotes:

heatherk@cori12:/global/cscratch1/sd/heatherk> salloc -N 5 -t 30 -C haswell -q interactive --image=lsstdesc/desc-python:spark-v3.1.1 --volume="/global/cscratch1/sd/heatherk/tmpfiles:/tmp:perNodeCache=size=200G"
salloc: Granted job allocation 41429516
salloc: Waiting for resource configuration
salloc: Nodes nid000[09-13] are ready for job
heatherk@nid00009:/global/cscratch1/sd/heatherk> ls

Then I went back and tried the debug queue.. it took 5-10 minutes, but it finally started up:

heatherk@cori12:/global/cscratch1/sd/heatherk> salloc -N 5 -t 30 -C haswell -q debug --image=lsstdesc/desc-python:spark-v3.1.1 --volume="/global/cscratch1/sd/heatherk/tmpfiles:/tmp:perNodeCache=size=200G"
salloc: Pending job allocation 41429532
salloc: job 41429532 queued and waiting for resources
salloc: job 41429532 has been allocated resources
salloc: Granted job allocation 41429532
salloc: Waiting for resource configuration
salloc: Nodes nid0[0764,1262-1265] are ready for job
heatherk@nid00764:/global/cscratch1/sd/heatherk> 

Maybe try again and see if it works after a few minutes? The queue may have been busy.

@plaszczy
Copy link

plaszczy commented Apr 7, 2021

yes you are right, I could log now.
But then I cannot have access to /root
ls /root/anaconda3/etc/profile.d/conda.sh
ls: cannot access '/root/anaconda3/etc/profile.d/conda.sh': No such file or directory

conda does not seems to be there
ls /root/
ansible.hash bin limits.conf limits.conf.2021-04-06@14:42:32

it does not seem to be the right python version
python -V
Python 2.7.17

@plaszczy
Copy link

plaszczy commented Apr 7, 2021

actually it should be run within shifter. but
shifter "source /root/anaconda3/etc/profile.d/conda.sh"
shifter: source /root/anaconda3/etc/profile.d/conda.sh: No such file or directory

if we forget about the setup and just run (within the session)

shifter pyspark
env: ‘ipython’: Permission denied

so it seems that within the image , the user does not have access rights to the conda setup.

@heather999
Copy link
Collaborator Author

ah yes.. this is due to how the NERSC spark docker image is set up to install under /root. Let me go back to that NERSC ticket with Lisa and ask about that to get their suggestion. If I were setting up this image myself, I would have installed everything in another part of the directory tree to avoid that problem.

@heather999
Copy link
Collaborator Author

@plaszczy I see Lisa isn't available this week, so I went ahead and updated the docker image to reinstall Anaconda under /opt/desc/py to avoid the problems using /root. For now, this means we have two copies of Anaconda in the same image, but only ours in /opt/desc/py is made available by setting up PATH appropriately.
I've loaded this image lsstdesc/desc-python:spark-v3.1.1 into shifter and I can confirm it starts up and I can activate the conda environment by doing:

shifter -E --image=lsstdesc/desc-python:spark-v3.1.1 /bin/bash
$ source /opt/desc/py/etc/profile.d/conda.sh
$ conda activate base
(base) heatherk@cori03:/global/cfs/cdirs/lsst/gsharing/DC2_ImSim/DR2/repo/rerun/dr2-coadd$ 

Note to get pyspark to start up properly, I needed to include that -E when starting shifter to avoid loading my local environment.

Once Lisa is back and I get a new NERSC image we can clean up a bit, but I think this new lsstdesc/desc-python:spark-v3.1.1 might allow for some testing.

@plaszczy
Copy link

plaszczy commented Apr 8, 2021

OK that's intresting, I could reproduce.

I avoid using -E that is very general, we found with Lisa that using:
export PYTHONUSERBASE=
avoids loading the users libs.

So after the source+activate conda, I ran a basic test in pyspark (which is an "ang2pix" in spark)

from pyspark.sql.functions import pandas_udf, PandasUDFType
import numpy as np
import pandas as pd
import healpy as hp

nside=256
nest=False
@pandas_udf('long', PandasUDFType.SCALAR)
def Ang2Pix(ra,dec):
    return pd.Series(hp.ang2pix(nside,np.radians(90-dec),np.radians(ra),nest=nest))

df = spark.createDataFrame([[10.0, 20]], ["ra", "dec"])
df=df.withColumn("ipix",Ang2Pix("ra","dec"))
df.show()

but we are back to the case pyspark does not find the libs:

...
  return self.loads(obj)
  File "/usr/local/bin/spark-3.1.1/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/usr/local/bin/spark-3.1.1/python/lib/pyspark.zip/pyspark/cloudpickle/cloudpickle.py", line 562, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'numpy'

@heather999
Copy link
Collaborator Author

heather999 commented Apr 8, 2021

HI @plaszczy
It's definitely something in the environment that carries over when we start shifter. PYTHONUSERBASE is already empty when we start the shifter image (whether or not we use the -E flag).
If I do this with -E, things appear to work:

heatherk@cori04:~> shifter -E --image=lsstdesc/desc-python:spark-v3.1.1 /bin/bash
<eatherk$ source /opt/desc/py/etc/profile.d/conda.sh               
heatherk@cori04:/global/u1/h/heatherk$ conda activate base
(base) heatherk@cori04:/global/u1/h/heatherk$ which java
/usr/bin/java
(base) heatherk@cori04:/global/u1/h/heatherk$ which pyspark
/usr/local/bin/spark-3.1.1/bin/pyspark
(base) heatherk@cori04:/global/u1/h/heatherk$ pyspark
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/bin/spark-3.1.1-bin-hadoop2.7/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/04/08 06:49:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Python version 3.8.5 (default, Sep  4 2020 07:30:14)
Spark context Web UI available at http://cori04:4040
Spark context available as 'sc' (master = local[*], app id = local-1617889787587).
SparkSession available as 'spark'.

In [1]: from pyspark.sql.functions import pandas_udf, PandasUDFType

In [2]: import numpy as np

In [3]: import pandas as pd

In [4]: import healpy as hp

In [5]: nside=256
   ...: nest=False

In [6]: @pandas_udf('long', PandasUDFType.SCALAR)
   ...: def Ang2Pix(ra,dec):
   ...:     return pd.Series(hp.ang2pix(nside,np.radians(90-dec),np.radians(ra),
   ...: nest=nest))
   ...: 
/usr/local/bin/spark-3.1.1/python/pyspark/sql/pandas/functions.py:389: UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to specify type hints for pandas UDF instead of specifying pandas UDF type which will be deprecated in the future releases. See SPARK-28264 for more details.
  warnings.warn(

In [7]: df = spark.createDataFrame([[10.0, 20]], ["ra", "dec"])

In [8]: df=df.withColumn("ipix",Ang2Pix("ra","dec"))
   ...: df.show()
+----+---+------+                                                               
|  ra|dec|  ipix|
+----+---+------+
|10.0| 20|257564|
+----+---+------+


Let me see if I can figure out a way to get it to work without relying on that -E flag. Julien's set up for the pyspark jupyter kernel might offer me some hints: https://github.com/LSSTDESC/desc-python/blob/master/jupyter-kernels/desc-pyspark/desc-pyspark.sh

@heather999
Copy link
Collaborator Author

heather999 commented Apr 13, 2021

HI @plaszczy & @JulienPeloton Coming back to this and following the NERSC Ticket. I have updated the lsstdesc/desc-python:spark-v3.1.1 image to use NERSC's most recent: nersc/spark-3.1.1:v2 and pyspark runs interactively if I set JAVA_HOME=/usr as Lisa explained.
I'd like to understand better the use cases for our pyspark shifter image as it sounds like setting up a jupyter kernel versus running on a login node might require different set up? Is there a circumstance where we really need to use Cori's Java installation rather than the one available in the image?
Looking at Julien's previous jupyter kernel setup, some bits will have to be modified since we now have the desc-python environment in this shifter image and no longer need to point at /global/common/software/lsst and I'm assuming JAVA_HOME would be set to /usr as Lisa noted.

I think it would help to have some explicit use case I can try out and to see if things work or not.

@plaszczy
Copy link

I think I messed with the tickets, so I recopy my message here:

I still can't make it run without the -E shifter option. here is what do:

salloc -N 5 -t 40 -C haswell -q interactive --image=lsstdesc/desc-python:spark-v3.1.1 --volume='/global/cscratch1/sd/plaszczy/tmpfiles:/tmp:perNodeCache=size=200G'

then:
module load spark/3.1.1
start-all.sh 
export PYTHONUSERBASE=
export JAVA_HOME=/usr
shifter /bin/bash

source /opt/anaconda3/etc/profile.d/conda.sh
conda activate base
pyspark

then my basic test fails:
from pyspark.sql.functions import pandas_udf, PandasUDFType
import numpy as np
import pandas as pd
import healpy as hp

nside=256
nest=False
@pandas_udf('long', PandasUDFType.SCALAR)
def Ang2Pix(ra,dec):
    return pd.Series(hp.ang2pix(nside,np.radians(90-dec),np.radians(ra),nest=nest))

df = spark.createDataFrame([[10.0, 20]], ["ra", "dec"])
df=df.withColumn("ipix",Ang2Pix("ra","dec"))

df.show()

since it is OK with -E shifter option I guess something else remains...
We may consider using -E would be the right way to go however.

cheers,
stephane

@heather999
Copy link
Collaborator Author

Hi @plaszczy I haven't forgotten this - just became very busy. I am planning to look again at this tomorrow morning!

@plaszczy
Copy link

thanks no hurry, if you mange to run in your own env, we'll have to compare our full list en env var (otherwise...?). Other option clean all (-E) but must find a way to transmit still some env var to shifter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants