Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency Management Strategies... #741

Closed
petersilva opened this issue Aug 3, 2023 · 16 comments
Closed

Dependency Management Strategies... #741

petersilva opened this issue Aug 3, 2023 · 16 comments
Labels
bug Something isn't working crasher Crashes entire app. Design Problem hard to fix, affects design. Developer not a problem, more of a note to self for devs about work to do. Discussion_Needed developers should discuss this issue. enhancement New feature or request

Comments

@petersilva
Copy link
Contributor

petersilva commented Aug 3, 2023

The Problem

Sarracenia uses a lot of other packages to provide functionality. These are called dependencies. In it's native environment (Ubuntu Linux) most of these dependencies are easily resolved using the built-in debian packaging tools (apt-get.) but in many other environments, It is more complex. like: https://xkcd.com/1987/ Even in environments where dependencies are installed somewhere it is not always clear which ones are available to a given program.

On redhat-8, for example, there does not seem to be a wide variety of python packages available in operating system repositories. Rather the specific minimal packages needed for the OS's own needs of python are all that seem to be available. This makes it challenging to install on redhat, as one now has to package many dependencies as well as the main package. The typical approach is to hunt for individual dependencies in different third party repositories, or rebuild them from source... This is a bit haphazard, and in some cases, like watchdog or dateparser, the package itself has dependencies and one ends up having to create dozens of python packages.

On redhat, as in many other environments, it seems more practical to use python native packaging, rather than the incomplete OS ones, as they do dependency resolution, and all the dependencies can be brought in using pip. The result of this, if done system-wide, is a mix of Distro packages, and pip provided packages, which complicates auditing and patching. System Administrators may also object to the use of pip packages in the base operating system.

Windows is another example of an environment where pre-existing package availability is unclear. On windows, the natural distribution format would be a self-extracting EXE, but use of plugins with such a method is unclear, and all the dependencies need to be packaged within it. People also install python distributions ActiveState, Anaconda, or the more traditional cpython, and those will each have their own installation methods.

The complications mostly arise from dependencies such as xattr, python3-magic, watchdog, etc... that is packages that are wrappers around C libraries or use C libraries as part of their implementation. In these cases, pure python packaging often fails, as more environmental support is needed. For example, the python-magic python package requires the c-library libmagic1 to be installed. If using OS packages, this is just an additional dependency, no problem, but with pip, it will just fail, and the user needs to find the OS package, install that, and then try installing the python package again.

Another complication results from all these different platforms having methods of installation mean that it is not obvious what advice to provide to users when a dependency is missing "pip installe? conda install? apt install, yum install ?" ... the package naming conventions vary by distribution, and are different from the module names used to test their presence.

Approaches to Dependency Management

Manual Tailoring

For HPC (which runs redhat 8.x) there are a few dependencies brought in by EPEL packages, some built from source, but some had to be left out. The setup.py file, when building packages on redhat are typically hand edited to work around packages that are not available. So manual editing of packages is done. After the RPM is generated, it is then tested on another system, and a different user, to see whether it runs (as the local user doing the build may have pip packages which provide deps not available to others.)

implementation: manual editing of setup.py to remove dependencies.

(Mostly) Silent Disable

Looking at xattr, the import is in a try/except, and if it fails, the storing of metadata in extended file attributes is disabled. There is a loss of functionality or a different behaviour on these systems as a result. There is no way to query the system for which degrades are active. nothing to prompt the user what to do to address, if they want to.

implementation in filemetadata.py:

try:
    import xattr
    supports_extended_attributes = True

except:
    supports_extended_attributes = False

There are also tests in sarracenia/init.py for the code to degrade/understand when dependencies are missing:

extras = {
   'amqp' : { 'modules_needed': [ 'amqp' ], 'present': False, 'lament' : 'will not be able to connect to rabbitmq broker
s' },
   'appdirs' : { 'modules_needed': [ 'appdirs' ], 'present': False, 'lament' : 'will assume linux file placement under h
ome dir' },
   'ftppoll' : { 'modules_needed': ['dateparser', 'pytz'], 'present': False, 'lament' : 'will not be able to poll with f
tp' },
   'humanize' : { 'modules_needed': ['humanize' ], 'present': False, 'lament': 'humans will have to read larger, uglier
numbers' },
   'mqtt' : { 'modules_needed': ['paho.mqtt.client'], 'present': False, 'lament': 'will not be able to connect to mqtt b
rokers' },
   'filetypes' : { 'modules_needed': ['magic'], 'present': False, 'lament': 'will not be able to set content headers' },
   'vip'  : { 'modules_needed': ['netifaces'] , 'present': False, 'lament': 'will not be able to use the vip option for
high availability clustering' },
   'watch'  : { 'modules_needed': ['watchdog'] , 'present': False, 'lament': 'cannot watch directories' }
}

for x in extras:

   extras[x]['present']=True
   for y in  extras[x]['modules_needed']:
       try:
           if importlib.util.find_spec( y ):
               #logger.debug( f'found feature {y}, enabled')
               pass
           else:
               logger.debug( f"extra feature {x} needs missing module {y}. Disabled" )
               extras[x]['present']=False
       except:
           logger.debug( f"extra feature {x} needs missing module {y}. Disabled" )
           extras[x]['present']=False

Demotion to Extras

The Python Packaging tool has a concept of extras, sort of the inverse of batteries included... in setup.py one can put extras that are available with additional dependencies being installed:

extras = {
       'amqp' : [ "amqp" ],
       'filetypes': [ "python-magic" ],
       'ftppoll' : ['dateparser' ],
       'mqtt': [ 'paho.mqtt>=1.5.1' ],
       'vip': [ 'netifaces' ],
       'redis': [ 'redis' ]
    }
extras['all'] = list(itertools.chain.from_iterable(extras.values()))

Platform Dependent Deps

one can add dependencies that vary depending on the platform we are installing on.

    install_requires=[
        "appdirs", "humanfriendly", "humanize", "jsonpickle", "paramiko",
        "psutil>=5.3.0", "watchdog",
        'xattr ; sys_platform!="win32"', 'python-magic; sys_platform!="win32"',
        'python-magic-bin; sys_platform=="win32"'

    ],

( this is in the v03_issue721_platdep branch)

What do we do?

So all of the approaches above (and perhaps others?) are used in the code, and someone using an installation will have a subset of functionality available, and sr3 has no way of reporting what is available or not. there is a branch #738 that provides an example report of modules available using an sr3 extras command.

should we at least report what is working, and what isn't? An additional problem is that configured plugins may have additional dependencies. The mechanism in the pull request also provides a way for plugins to register those, so they show up in the inventory command.

Is this a reasonable/adviseable approach?

@petersilva petersilva added bug Something isn't working enhancement New feature or request Design Problem hard to fix, affects design. Developer not a problem, more of a note to self for devs about work to do. Discussion_Needed developers should discuss this issue. crasher Crashes entire app. labels Aug 3, 2023
@petersilva
Copy link
Contributor Author

what I am guessing from the above:

  • there is currently no way to query sr3 to understand what degrades are active as a result of missing dependencies. It is probably a good thing to report that somehow.
  • some missing deps are not the result of extras but just stuff missing in the environment... missing features could be from built-in degrades (like xattr) or from plugins (like clamav.py.) rather than just optional extras that were not installed.
  • perhaps a different name is better features so the sr3 features command would say which degrades are in place on the current system.
  • so far have not discussed implications for pynsist (building self-contained .exe binaries.... where we likely want a batteries included approach?
  • for packages only used with certain callbacks (e.g. clamav) do we include them in the .exe?

Does anybody have any literature to review on this topic?

@petersilva
Copy link
Contributor Author

Another, related question... should we make more dependencies optional, and degrade further to allow simpler installation when deps are hard to resolve... (so HPC becomes not a special case) degrade when watchdog is missing, when appdirs is missing, etc...

@petersilva
Copy link
Contributor Author

Should we make feature enablement an additional switch beyond just presence of a module? Maybe when appdirs is present, people still don't want to use it?

@petersilva
Copy link
Contributor Author

There is currently a mix of all approaches present in the code, which is a bit incoherent. Might be good to increase consistency.

@petersilva
Copy link
Contributor Author

platform dependent dependencies were tried in the past... At one point it appeared that the deps were evaluated when building the wheel, rather than when installing, so one needed a wheel built on windows, and another on linux... but the details of the testing are lost to time. Likely want to validate again.

@petersilva
Copy link
Contributor Author

  • combined both branches together
  • renamed extras to features
  • refined the output a bit.

If this makes sense to people as an approach, will need to document it.

@petersilva
Copy link
Contributor Author

petersilva commented Aug 4, 2023

As an example of dependency shenanigans... trying to build a windows executable:

 ./generate-win-installer.sh
Collecting amqp
  Downloading amqp-5.1.1-py3-none-any.whl (50 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.8/50.8 kB 1.2 MB/s eta 0:00:00
Collecting vine>=5.0.0 (from amqp)
  Downloading vine-5.0.0-py2.py3-none-any.whl (9.4 kB)
Saved ./amqp-5.1.1-py3-none-any.whl
Saved ./vine-5.0.0-py2.py3-none-any.whl
Successfully downloaded amqp vine
Collecting appdirs
  Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Saved ./appdirs-1.4.4-py2.py3-none-any.whl
Successfully downloaded appdirs
ERROR: Could not find a version that satisfies the requirement netifaces (from versions: none)
ERROR: No matching distribution found for netifaces
fractal%

so then I edit the file and remove the netifaces package download,
and the package needs to include all deps, so this will disable vip .. which means the executable binary will lose the ablity to work with vips.

so when running on windows with the executable built this way:

fractal% sr3 features
2023-08-04 14:35:58,558 240728 [INFO] sarracenia.flow loadCallbacks flowCallback plugins to load: ['sarracenia.flowcb.retry.Retry', 'sarracenia.flowcb.housekeeping.resources.Resources', 'dcpflow', 'log', 'post.message', 'clamav']
2023-08-04 14:35:58,564 240728 [INFO] dcpflow __init__ really I mean hi
2023-08-04 14:35:58,564 240728 [WARNING] sarracenia.config add_option multiple declarations of lrgs_download_redundancy=['Yes', 'on'] choosing last one: on
2023-08-04 14:35:58,564 240728 [INFO] dcpflow __init__  lrgs_download_redundancy is True 
2023-08-04 14:35:58,564 240728 [INFO] sarracenia.flowcb.log __init__ flow initialized with: {'after_post', 'after_accept', 'after_work', 'post', 'on_housekeeping'}
2023-08-04 14:35:58,566 240728 [CRITICAL] sarracenia.flow loadCallbacks flowCallback plugin clamav did not load: 'pyclamd'

Status:    feature:   python imports:      Description: 
Installed  amqp       amqp                 can connect to rabbitmq brokers
Installed  appdirs    appdirs              place configuration and state files appropriately for platform (windows/mac/linux)
Installed  ftppoll    dateparser,pytz      able to poll with ftp
Installed  humanize   humanize             humans numbers that are easier to read.
Absent     mqtt       paho.mqtt.client     cannot connect to mqtt brokers
Installed  filetypes  magic                able to set content headers
Installed  redis      redis,redis_lock     can use redis implementations of retry and nodupe
Absent    vip        netifaces                will not be able to use the vip option for high availability clustering
Installed  watch      watchdog             watch directories
MISSING    clamd      pyclamd              cannot use clamd to av scan files transferred

the vip line will say absent. and description: will not be able to use the vip option for high availability clustering

if it worked, but the pynsist generation just failed later anyways.

@petersilva
Copy link
Contributor Author

in previous releases, it would complain only a run time when someone tried to use it and give no hint as to what is missing... with the features... um... feature, at least the package can be interrogated.

@petersilva
Copy link
Contributor Author

petersilva commented Aug 4, 2023

Similarly, on redhat, it is not possible to get an OS package for the python watchdog package. So... should report that watch feature is missing. If a redhat user chooses to install watchdog in their account via pip, they can then confirm that sr3 will use that package with sr3 features

@petersilva
Copy link
Contributor Author

petersilva commented Aug 5, 2023

now have built a self-extracting executable and installed it on windows, and the result is:

C:\Users\SilvaP2>sr3 features
INFO: No sr3 configuration found. creating an empty one C:\Users\SilvaP2\AppData\Local\MetPX\sr3
INFO: No sr3 state or log files found. Creating an empty one C:\Users\SilvaP2\AppData\Local\MetPX\sr3\Cache

Status:    feature:   python imports:      Description:
Installed  amqp       amqp                 can connect to rabbitmq brokers
Installed  appdirs    appdirs              place configuration and state files appropriately for platform (windows/mac/linux)
Installed  ftppoll    dateparser,pytz      able to poll with ftp
Installed  humanize   humanize             humans numbers that are easier to read.
Installed  mqtt       paho.mqtt.client     can connect to mqtt brokers
Absent     filetypes  magic                will not be able to set content headers
Absent     redis      redis,redis_lock     cannot use redis implementations of retry and nodupe
Absent     vip        netifaces            will not be able to use the vip option for high availability clustering
Installed  watch      watchdog             watch directories


C:\Users\SilvaP2>

in an experimental build...

https://hpfx.collab.science.gc.ca/~pas037/Sarracenia_Releases/ see release 3.00.42_pre1

@petersilva
Copy link
Contributor Author

xattr has been folded into the proposed features function from
#738

as has paramiko... so consistency for the modules that give us trouble has been improved.

@petersilva
Copy link
Contributor Author

with the last patches:

fractal% sr3 features
2023-08-07 17:38:54,785 2012828 [INFO] sarracenia.flow loadCallbacks flowCallback plugins to load: ['sarracenia.flowcb.retry.Retry', 'sarracenia.flowcb.housekeeping.resources.Resources', 'dcpflow', 'log', 'post.message']
2023-08-07 17:38:54,791 2012828 [INFO] dcpflow __init__ really I mean hi
2023-08-07 17:38:54,791 2012828 [WARNING] sarracenia.config add_option multiple declarations of lrgs_download_redundancy=['Yes', 'on'] choosing last one: on
2023-08-07 17:38:54,791 2012828 [INFO] dcpflow __init__  lrgs_download_redundancy is True 
2023-08-07 17:38:54,791 2012828 [INFO] sarracenia.flowcb.log __init__ flow initialized with: {'post', 'on_housekeeping', 'after_accept', 'after_work', 'after_post'}

Status:    feature:   python imports:      Description: 
Installed  amqp       amqp                 can connect to rabbitmq brokers
Installed  appdirs    appdirs              place configuration and state files appropriately for platform (windows/mac/linux)
Installed  filetypes  magic                able to set content headers
Installed  ftppoll    dateparser,pytz      able to poll with ftp
Installed  humanize   humanize             humans numbers that are easier to read.
Absent     mqtt       paho.mqtt.client     cannot connect to mqtt brokers
Installed  redis      redis,redis_lock     can use redis implementations of retry and nodupe
Installed  sftp       paramiko             can use sftp or ssh based services
Installed  vip        netifaces            able to use the vip option for high availability clustering
Installed  watch      watchdog             watch directories
Installed  xattr      xattr                on linux, will store file metadata in extended attributes

 state dir: /home/peter/.cache/sr3 
 config dir: /home/peter/.config/sr3 

fractal% 

The functionality degrades nicely if paramiko is missing now... question is, do we remove the hard dependency
in setup.py?

@petersilva
Copy link
Contributor Author

so now the idea is to go through the list of all dependencies and analyze what functionality we lose and degrade nicely, or list them as part of existing ones.

  • humanfriendly ... added to humanize feature.
  • psutil ... made a new process feature, and got sr3 CLI to still work, but degraded.
  • now looking at jsonpickle...

So in the end... when deps are missing: don't crash, just work worse (degrade, and do everything else that you can without the feature). Also: report the breakage via sr3 features command.

@petersilva
Copy link
Contributor Author

After an audit of all the dependencies, it seems that they are all used for very specific purposes, and using a guard if features['x']['present'] guard around their usage is not that big of a problem. It also makes things
work substantially better in the HPC environment where which functionality is available is hard to say. It lets
us remove things from setup.py (hard dependencies) to be able to provide partial functionality in places where satisfying them all is hard. with the sr3 features mechanism, it also tells the user what is missing.

@andreleblanc11
Copy link
Member

I've gone through the pull request and the issue description/comments.

I like the changes that you've added. In the documentation, I think we should emphasize users to run sr3 features right off the bat after an install, just so they can make sure that all the dependencies that they want/need are present.

I might've missed in the PR, but did you add some output for the logs? It could be helpful for analysts to have the sr3 features output in the logs when debug mode is ran, or something of the likes.

@petersilva
Copy link
Contributor Author

released as part of 3.0.42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working crasher Crashes entire app. Design Problem hard to fix, affects design. Developer not a problem, more of a note to self for devs about work to do. Discussion_Needed developers should discuss this issue. enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants