Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment Isolation #4480

Closed
1 of 4 tasks
chrisburr opened this issue Mar 5, 2020 · 11 comments
Closed
1 of 4 tasks

Environment Isolation #4480

chrisburr opened this issue Mar 5, 2020 · 11 comments

Comments

@chrisburr
Copy link
Member

chrisburr commented Mar 5, 2020

Current situation

I've ran tests in the DIRAC certification instance (DIRACOS) and in LHCb's production DIRAC instance (lcgbundle). The results can be split into two categories.

No isolation at all

When submitting with:

  1. JDL with dirac-wms-job-submit (Vanilla DIRAC + DIRACOS)
  2. Python using DIRAC.Interfaces.API (Vanilla DIRAC + DIRACOS)
  3. JDL with dirac-wms-job-submit (LHCbDIRAC + lcgbundle)

The job inherits the full DIRACOS environment. This mainly a problem because PYTHONPATH and LD_LIBRARY_PATH are set and this makes it almost impossible to use non-DIRACOS binaries. For example:

+ /usr/bin/python -c 'import _ssl; print(_ssl.__file__)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: /home/condor/dir_32607/qL0MDmhQfTwnAaRUDnOCcW4oyGx30mABFKDmeQpXDmABFKDmImgvpn/DIRAC_FMuoLepilot/diracos/usr/lib64/libcrypto.so.10: version `OPENSSL_1.0.2' not found (required by /usr/lib64/python2.7/lib-dynload/_ssl.so)

PYTHONPATH also causes some weird things. It's less bad now only DIRAC and extensions will be added but it's still going to be weird from a user perspective:

+ /cvmfs/lhcbdev.cern.ch/conda-experimental/bin/python -c 'import DIRAC.Interfaces.API.Dirac'
Python Version 3.7.3 not supported by DIRAC
Supported versions are:
2.7.x
+ /usr/bin/python2.7 -c 'import DIRAC.Interfaces.API.Dirac'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "DIRAC/Interfaces/API/Dirac.py", line 20, in <module>
    from past.builtins import long
ImportError: No module named past.builtins

Some other variables will also cause unexpected issues like PYTHONOPTIMIZE causing assertions in analysis scripts to not trigger:

+ /cvmfs/lhcbdev.cern.ch/conda-experimental/bin/python -c 'assert False; print("All is okay")'
All is okay

Too much isolation

During the BiLD discussion it was mentioned that LHCb cleans the environment. This isn't the case for the vast majority of jobs, including all production jobs however it turns out it is for jobs submitted with LHCbDIRAC.Interfaces.API.DiracLHCb. The environment is cleaned completely and not even $PATH is set:

+ env
PWD=/pool/condor/dir_25753/DIRAC_NQjOZ9pilot/346094363
SHLVL=1
_=/usr/bin/env

Surprisingly this is less broken than I expected:

+ echo 1
1
+ python --help
[help is shown]

Though it is still a strange environment:

+ which python
which: no python in ((null))

It's also fundamentally broken. For example $X509_USER_PROXY isn't set causing authentication to fail:

+ xrdcp root://eoslhcb.cern.ch//eos/lhcb/grid/prod/lhcb/LHCb/Collision16/CHARM_D2HLL_DVNTUPLE.ROOT/00075549/0000/00075549_00000079_1.charm_d2hll_dvntuple.root -
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Server responded with an error: [3010] Unable to give access - user access restricted - unauthorized identity used ; Permission denied

Solution

DIRAC(OS) should be modified to not need $PYTHONPATH/$LD_LIBRARY_PATH by:

  • Adding a trivial patch to python to look for Python libraries in $DIRAC (using sitecustomize). It would be better would be to pip install DIRAC so there is no need for anything special but is a longer-term goal.
  • Setting RPATH in all binaries relative to $ORIGIN therefore it possible to use /path/to/diracos/bin/python directly without any environment being set. There is a nice right up about when LD_LIBRARY_PATH is appropriate here.

Regardless of what we do here I expect this to happen as part of the move to Python 3 but that's a discussion for elsewhere.

1. Clean the environment as much as possible

One solution would be to change the default to clean the environment like how LHCbDIRAC.Interfaces.API.DiracLHCb currently does. If this is done we should add a whitelist of variables to keep. Not only those which are authentication related but also things like TMPDIR. We should also make sure $PATH is set to the result of getconf PATH.

Having $DIRAC set would probably be useful to allow jobs to run source "${DIRAC}/bashrc" if required.

It does however limit sites from being able to set variables to influence jobs. I know this has been used by LHCb sites to let them set OMP_NUM_THREADS=1 to prevent OpenMP from overwhelming the system in single threaded queues. That said, I still think cleaning the environment entirely is the right thing to do.

2. Just make DIRAC set fewer variables

If the dependency on $PYTHONPATH/$LD_LIBRARY_PATH being set is removed then most of the current issues go away automatically.

I dislike it though as it makes the grid less homogeneous. In particular I have had issues with some sites setting silly variable defaults like LD_LIBRARY_PATH=/usr/lib:/usr/lib64 or PYTHONPATH=/usr/lib64/lib/python2.7/site-packages.

3. Snapshot the environment before $DIRAC/bashrc is sourced

This is a more extreme version of 2 which I don't think gains very much and still need the whitelist from 1. to ensure variables like $X509_USER_PROXY are kept. I don't see any benefit from doing this.

Conclusion

In summary, my opinion is that these changes should be made:

  • Make DIRACOS's python build look in $DIRAC for python modules
  • Make DIRACOS set RPATH so LD_LIBRARY_PATH doesn't need to be set
  • Default to a bare environment for jobs submitted with both the Python API and as JDL
  • Make the "bare" environment keep a whitelisted set of variables like $DIRAC and $X509_USER_PROXY
@marianne013
Copy link
Contributor

marianne013 commented Mar 11, 2020

Hi Chris,
Simon and me are trying to come up with a coherent answer to this, but it's proving tricky.
Not sure if this works, but to make this less abstract, linked below is a typical bit of user code:
https://drive.google.com/file/d/1lKStwKlgr2rMSW0hkN4VY60MS3Wfwfrj/view?usp=sharing
So you can see what we are up against.

As for DIRAC(OS) should be modified to not need $PYTHONPATH/$LD_LIBRARY_PATH:
Simon F says it should not really be necessary, but it's fine for us either way.

Daniela

@chrisburr
Copy link
Member Author

Simon F says it should not really be necessary, but it's fine for us either way.

I don't understand what you mean by this? If PYTHONPATH and LD_LIBRARY_PATH weren't needed by DIRACOS I think your example would just work.

@marianne013
Copy link
Contributor

I feel like a secretary here: Simon thinks you could be right and it might work, but without trying it we cannot say.

I was in the middle of a lengthy comment, which might or might not be useful, but I've typed it now, so here it is:
For GridPP Option 1 is going the right direction. We have not been able to come up with a complete way forward, as there are too many options to consider.
Our preferred way to deal with this would be to implement the environment cleaning in the pilot (as a pre-release that we know is incomplete ?) and then we can start seeing what breaks and what we need to add. Simon suggests a white and a blacklist in a different way than suggested in ticket, instead following the pattern that is used by cron (whitelist should override blacklist, e.g. in case someone is only interested in resetting the LD_LIBRARY_PATH) and for the lists to support globbing (so we can keep e.g. OMP_*).
(Here is some horrible bit of pseudo code I just worked through:
https://docs.google.com/document/d/1UnvHjbP8mer49aOkpnWOS3wJhIu1en2TORb8FRQfzNg/edit?usp=sharing
for this)
We can probably agree on a default set of variables, but it should be an option in the configuration system, so we could tune it to whatever we need for our users without trampling over anyone else's setup. (This now looks like a tunable Option 2 to me.)

Option 3 would break in containers and assumes this all works in the first place, which is not always a given, so that would be a no from us.

@chrisburr
Copy link
Member Author

You can now give it a try by passing --dirac-os-version=cburr-set-rpath to dirac-install.py

@marianne013
Copy link
Contributor

Simon and me just did a quick test. It seemed to work for us as expected, but there is still the danger of the user overriding the PATH and breaking it that way. E.g. experiments often set up their own version of python. (Cleaning the environment in the pilot should probably be a separate issue)

If we stick the PATH etc into the wrapper script

lx02:scripts > cat dirac-proxy-info
#!/bin/bash
--> here <--
export DCOMMANDS_PPID=$PPID
exec /usr/bin/env python $DIRAC/DIRAC/FrameworkSystem/scripts/dirac-proxy-info.py "$@"

then this would be perfectly self contained and protected from accidental user interference.
Would you be able to add this ?
Then I think this could be released, and while it's not the whole thing we could start using it and seeing of something else breaks, I would prefer this to "one big release".

@chaen
Copy link
Contributor

chaen commented Apr 2, 2020

sorry but can you remind me what this DCOMMANDS_PPID is for ?

@atsareg
Copy link
Contributor

atsareg commented Apr 2, 2020

This is an environment variable used to define an interactive client session used by the COMDIRAC commands

@marianne013
Copy link
Contributor

Is there any progress ? As I said, this would be a great step in the right direction for us.

Daniela

@fstagni
Copy link
Contributor

fstagni commented Apr 21, 2020

The last version of v7r0 (v7r0p21) is using DIRACOS v1r11 (DIRACGrid/DIRACOS#127)

@fstagni
Copy link
Contributor

fstagni commented Oct 18, 2021

I am tempted to close this task (DIRACOS2, py3 etc), unless there are objections.

@marianne013
Copy link
Contributor

I don't even remember the details of this, but given that I haven't seen any more issues looking like this recently, go ahead.

@fstagni fstagni closed this as completed Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants