Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packages can require patches on dependencies #5476

Merged
merged 9 commits into from
Sep 30, 2017

Conversation

tgamblin
Copy link
Member

@tgamblin tgamblin commented Sep 26, 2017

Addresses an issue that came up in #4907.

A package can now depend on a special patched version of its dependencies.

  • depends_on() can optionally take a patch directive or a list of them:

         depends_on(<spec>,
                    patches=patch(..., when=<cond>),    # condition on dependency
                    when=<cond>)                        # condition on this package
         # or
         depends_on(<spec>,
                    patches=[patch(..., when=<cond>),   # condition on dependency
                             patch(..., when=<cond>)],  # condition on dependency
                    when=<cond>)                        # condition on this package
  • The Spec YAML (and therefore the hash) now includes the sha256 of
    the patch in the Spec YAML, which changes its hash.

    • The special patched version will be built separately from a "vanilla"
      version of the same package.
    • This allows packages to maintain patches on their dependencies
      without affecting either the dependency package or its dependents.
      This could previously be accomplished with special variants, but
      having to add variants means the hash of the dependency changes
      frequently when it really doesn't need to. This commit allows the
      hash to change just for dependencies that need patches.
    • Patching dependencies shouldn't be the common case, but some packages
      (qmcpack, hpctoolkit, openspeedshop) do this kind of thing and it
      makes the code structure mirror maintenance responsibilities.
  • This commit means that adding or changing a patch on a
    package will change its hash. This is probably what should happen,
    but we haven't done it so far. (@adamjstewart mentioned some ANL
    folks wanting this)

    • Only applies to patch() directives; package.py files (and their
      patch() functions) are not hashed, but we'd like to do that in the
      future.
  • To implement these extensions to Spack's DSL, some fanciness was added to
    directives.DirectiveMetaMixin. In particular:

    • directives can be nested (called as parameters to other directives) and their
      results won't be enqueued for lazy execution like a normal directive.
    • same for directives called within directives
  • Previously, the patch() directive only took an md5 parameter. Now
    it takes a sha256 parameter. We restrict the hash used because we want
    to be consistent about which hash is used in the Spec.

    • A side effect of hashing patches is that compressed patches fetched
      from URLs now need two checksums: one for the downloaded archive and
      one for the content of the patch itself. Patches fetched uncompressed
      only need a checksum for the patch. Rationale:
    • we include the content of the patch in the spec hash, as that is
      the checksum we can do consistently for patches included in Spack's
      source and patches fetched remotely, both compressed and
      uncompressed.
    • we still need the patch of the downloaded archive, because we want
      to verify the download before handing it off to tar, unzip, or
      another decompressor. Not doing so is a security risk and leaves
      users exposed to any arbitrary code execution vulnerabilities in
      compression tools.
  • This also refactors the way dependency metadata is stored on packages.

    • Previously, dependencies and dependency_types were stored as separate
      dicts on Package. This means a package can only depend on another in one
      specific way, which is usually but not always true.

    • Prior code unioned dependency types statically across dependencies on
      the same package.

    • New code stores dependency relationships as their own object (a Dependency),
      with a spec constraint and a set of dependency types and patches per relationship.

      • Dependency types are now more precise
      • There is now room to add more information to dependency relationships (like patches)
    • New Dependency class lives in spack.dependency, along with deptype
      definitions that used to live in spack.spec.

    • Updated nwchem and nauty to use sha256 for patches (@junghans)

    • Lines with long checksums are now ignored by spack flake8

  • Documentation

@naromero77: this is nearly done, and should let you continue to maintain your own patches on dependencies for qmcpack.

@jgalarowicz: you may be interested in this for openspeedshop.
@mwkrentel: you may be interested in this for hpctoolkit, too, as I know you maintain a lot of patches on your dependencies. This would allow you to maintain them with the hpctoolkit package wtihout disrupting other packages.

@tgamblin tgamblin changed the title Features/dependency patching Packages can require patches on dependencies Sep 26, 2017
Copy link
Member

@alalazo alalazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Among the points above I was able to give a close look at:

  1. the refactoring of dependencies metadata
  2. the modifications to the meta-classes and directives

I think 1. is great, and it will permit to be quite flexible in the future if we need a dependee to modify other aspects of its dependencies.

For 2. the only concern I have is if we really need that additional complexity. The meta-class served the purpose of delaying the directive machinery until when the package to which we want to attach attributes actually existed. The support for nested directives takes a great deal to deactivate that same machinery when a nested directive is encountered.

But since the Dependency objects to which we want to attach attributes are already constructed when we need to evaluate patch (see comment below), is it worth generalizing the mechanism in this way or we can give up using patch lowercase to permit a more direct logic in the implementation of depends_on? Am I missing some important details here?

builtin function ``all`` or the string 'all', which result in
a tuple of all dependency types known to Spack.
"""
if deptype in (None, 'all', all):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check for None here? From what I can see the only point where:

s = canonical_deptype(None)

is used is within unit tests. Is this a leftover from some refactoring?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is commonly the default value for deptypes= kwargs. But you're right, I suppose we should find usages and replace it with 'all'. It'll be clearer that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in next push


elif isinstance(deptype, string_types):
if deptype not in all_deptypes:
raise ValueError('Invalid dependency type: %s' % deptype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OT (not strictly part of the code review): I also tend to use built-in exceptions in code, rather than declaring specific ones. I do that mainly based on the assumption that Spack is basically an application. Anyhow I have seen in a couple of places lately (like #5462) that Spack is also being used "as a library" to write scripts.

Do you think it makes sense to start adopting a pattern like:

class SpackValueError(ValueError, SpackError):
    pass

so that people are allowed to decide whether to manage a generic ValueError or an error coming from Spack? Just asking to raise the question and know your opinion on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when a ValueError occurs, it's generally because there is an error in the program, not something that would need to be handled programmatically or that you'd need an exception handler for. The only place I've seen ValueError even caught is if someone's trying to construct an int from a string or something similar, and generally in that case you juts make a really narrow try/catch statement.

For things that the caller would actually want to handle (like network errors, or issues with packages) I think wrapping in a SpackError makes a lot more sense.

return (deptype,)

elif isinstance(deptype, (tuple, list)):
invalid = next((d for d in deptype if d not in all_deptypes), None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpicking: if there's more than one invalid argument in the list or tuple, the error message will just show the first one, instead of pointing them out all at once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


A dependency is a requirement for a configuration of another package
that satisfies a particular spec. The dependency can have *types*,
which determine *how* that packge configuration is required,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: packge

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Args:
pkg (type): Package that has this dependency
spec (Spec): Spec indicating dependency requirements
patches (list): list of patches to apply to this dependency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__init__ signature and docs seem to disagree, as there's no patches argument to __init__.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still appears to be out of sync


# wrapped function returns same result as original so
# that we can nest directives
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the nastiness here:

  1. nested directives are stripped from the argument list
  2. then the current directive is evaluated with the stripped list, like it was before
  3. the executor is returned so that inner directives like patches=[patches(...)] can be collected by the outer directive

else p for p in patches]
assert all(callable(p) for p in patches)

# this is where we actally add the dependency to this package
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo : actually

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return dict(
(name, conds) for name, conds in self.dependencies.items()
if any(dt in self.dependencies[name][cond].type
for cond in conds for dt in deptypes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""Get subset of the dependencies that *may* have certain types."""

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I think the spack info output should be reworked for clarity, but maybe not in this PR. Organizing the dependency data is all of a sudden harder. But yes this comment conveys the message better. fixed.


# apply patches to the dependency
for execute_patch in patches:
execute_patch(dependency)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in saying that all the nested directive machinery exists to evaluate this single statement:

depends_on(<spec>,
    patches=[patch(..., when=<cond>), ...],    # condition on dependency
    when=<cond>)

while still permitting to use the word "patch" on the rhs of a named argument? Can't we just capture the arguments in a named tuple - or something similar - like:

depends_on(<spec>,
    patches=[Patch(..., when=<cond>), ...],    # named tuple spack.directives.Patch, not spack.patch.Patch
    when=<cond>)

and use a more direct logic within _depends_on? Am I missing some fundamental details here?

# Special-case: keeps variant values unique but ordered.
s.variants['patches'] = MultiValuedVariant('patches', ())
mvar = s.variants['patches']
mvar._value = mvar._original_value = tuple(dedupe(patches))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are using a MV variant with a reserved name to keep track of patches, and correctly set it after concretization. I am curious why we didn't go with an additional Spec attribute and a modification to Spec.to_node_dict.

@scheibelp
Copy link
Member

I'll have some initial comments on this by the end of today

@tgamblin
Copy link
Member Author

@alalazo: ok all the suggestions are fixed and I added docs. @scheibelp FYI.

Comments on @alalazo's more in-depth questions:

Clearing the DirectiveMetaMixin attribute seems absolutely correct to me, as now it may get populated as a side effect of the lazy evaluation above, for instance when depends_on eagerly evaluates the nested patches directive.

There's a little ambiguity in the term "nested" here. clearing the queue handles calls that really are "nested", in the sense that the depends_on directive calls patch. It doesn't handle calls to patches that are in parameters -- those are handled inside the directive decorator. The reason for that is that the parameters are enqueued right before the decorator is called, but truly nested calls are only enqueued in the mixin, after each directive is executed. It's complicated but it seems to work well, and I can't think of any cases where it would do something unexpected.

Am I right in saying that all the nested directive machinery exists to evaluate this single statement:

depends_on(<spec>,
    patches=[patch(..., when=<cond>), ...],    # condition on dependency
    when=<cond>)

while still permitting to use the word "patch" on the rhs of a named argument? Can't we just capture the arguments in a named tuple - or something similar - like:

depends_on(<spec>,
    patches=[Patch(..., when=<cond>), ...],    # named tuple spack.directives.Patch, not spack.patch.Patch
    when=<cond>)

and use a more direct logic within _depends_on? Am I missing some fundamental details here?

I don't think you're missing much:

  1. The call to patch inside _depends_on doesn't actually require a lot of machinery -- just the one statement in DirectiveMetaMixin.
  2. The parameter calls to patch() do require a bit more complexity -- all the code in the directive decorator.

But I think there are good reasons to make them the same. Right now, if you know how to write a patch directive, you also know how to write a patch parameter for depends_on. The packager doesn't have to remember that it's Patch in one scenario and patch in the other, and I think we should try really hard to avoid adding more details and subtle special cases to packagers' lives.

This also means there is only one piece of code to maintain for patch directives, and only one place to make a mistake. That's only partially true since patch now adds patches to either Dependency or Package objects, but I still think it's good that adding a patch to either (which is moderately complex) is done essentially the same way.

Here we are using a MV variant with a reserved name to keep track of patches, and correctly set it after concretization. I am curious why we didn't go with an additional Spec attribute and a modification to Spec.to_node_dict.

So, the basic rationale here is that there is a lot of machinery already in place for multi-value variants: adding them, managing a list of attributes, displaying them in spack find, adding them to the spec YAML, hashing them, etc. etc. I actually think we should figure out a way to consolidate them with compiler flags so that we don't have to maintain a special dict class for that. Compiler flags are almost the same, in that their names are effectively reserved (we should probably disallow adding variants for them, too), but they are separate attributes. I think it'll be much easier to have all that stuff as a unified parameter system, especially when we start doing SAT solves on it.

Copy link
Member

@scheibelp scheibelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly questions and documentation suggestions. I'm still trying to figure out the directive logic. I'll keep looking at this tomorrow.

@@ -328,3 +332,126 @@ def find_versions_of_archive(archive_urls, list_url=None, list_depth=0):
continue

return versions


def spider_checksums(url_dict, name, stage_function=None, keep_stage=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to have been moved vs. edited. Is that correct? Why is it called spider_checksums? Spidering typically means to me building on an initial set to pull more results

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep -- just moved. Mainly because it spiders webpages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I'm saying is that that all this is doing is taking a list of versions, converting each version to a url, and then fetching from that list of URLs. There is no poking around inside each URL which is what I normally think of when I think of spidering. I think get_checksums_for_versions would be more accurate and straightforward.

return version_lines


def fetch_remote_package_versions(pkg):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently the protobuf package had to override Package.fetch_remote_versions (see https://github.com/LLNL/spack/pull/5373), what are your thoughts on that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good catch. We should probably move it back if there are actually packages that need custom logic. I can tweak that.


@property
def name(self):
"""Get the name of the package to be patched.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is confusing IMO. It would be fine to say "The name of the dependency package".

@@ -967,7 +969,7 @@ def from_list_url(pkg):
the specified package's version."""
if pkg.list_url:
try:
versions = pkg.fetch_remote_versions()
versions = spack.util.web.fetch_remote_packge_versions(pkg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: fetch_remote_pack[a]ge_versions

self.url = path_or_url
self.md5 = kwargs.get('md5')

self.archive_sha256 = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you handled swapping sha256 for md5 for a couple packages. Out of curiosity was that all the packages using UrlPatch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- that's the only place the checksum is needed. For patches in the filesystem, we compute the sha256 on demand.

# caller can return the directive if it should be queued.
def remove_directives(arg):
directives = DirectiveMetaMixin._directives_to_be_executed
if isinstance(arg, (list, tuple)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is always initially called with a collection so IMO it's clearer to replace everything from here to the else with for arg in args:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand what you want here -- args is outside the function scope and if an arg happens to be a collection, we still need to descend into it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the two cases where this function is used

remove_directives(args)
remove_directives(kwargs.values())

in both cases a collection is being passed in. I meant to say: change the name of the parameter in this definition to args and then replace lines 206-210 with for arg in args:.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also are there other examples of directives as arguments besides passing patch to depends_on? Either way IMO it would make sense to describe that particular case in the explanatory comment.

if 'archive_sha256' not in kwargs:
raise PatchDirectiveError(
"Compressed patches require 'archive_sha256' "
"and patch 'sha256' attributes: %s" % self.url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a 'sha256' and 'archive_256' attribute? It doesn't seem strictly required. Is it to alert the user that they should be creating a hash of the archive and not the decompressed patch? When there is a sha256 sum for the archive why is the sha256 of the decompressed patch also required?

Copy link
Member Author

@tgamblin tgamblin Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this annoyed me as well, but I do not see a good way around it. Here's why:

  1. This PR makes spec.yaml (and thus the DAG hash) include patches. So, we need a sha256 (or some deterministic hash) for each patch, so that each spec.yaml represents one unique configuration in a consistent way.
  2. We need to add patch hashes to the spec on or before the end of concretization, which is when specs are set in stone.
  3. We need to verify archives before passing them to things like tar or unzip. If we do not verify, someone could potentially replace a download with a malformed tarball or zip file and exploit a vulnerability in the compression tools.

(3) is really why we need the archive_sha256. If we could just expand things safely and check only the patch file, we'd be fine with just the patch file's checksum.

We could make (2) work for compressed patches by fetching all the necessary patches before or during concretization. But that requires us to do remote operations that may fail (or may need a mirror), and I like the fact that you can currently reason about packages (i.e. concretize a spec) without anything but the spec and a repo. I think we should try to keep things that way.

You could think about approximating (1) by just using the hash of the compressed archive (which is pretty much unique), but there are issues with that. gzip, tar, etc. aren't deterministic, so if someone changes the way a patch is distributed (e.g. uses a different compression mechanism, or re-tars the same patch), the hash changes, and you won't have the same hash for the same patch over time.

Anyway, that's the rationale. I think compressed patches are infrequent (even if nwchem has an awful lot of them) and adding a little pain for compressed patches seems worth it to avoid a bunch of downloads being required before concretize.

Thoughts?

@alalazo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's easier to implement this way and the majority of patches aren't affected by it, then that seems reasonable. That being said while all of 1-3 make sense to me I still don't see why these are both required and/or why both would need to be set. I was assuming if we have a sha256 and the patch is an archive, it is assumed that the sha256 is for the archive. So I think that's another way of saying I agree with

You could think about approximating (1) by just using the hash of the compressed archive (which is pretty much unique)

As a side note: I'm unclear why the hash of the compressed archive would be less unique than the hash of the decompressed archive.

Regarding:

there are issues with that. gzip, tar, etc. aren't deterministic, so if someone changes the way a patch is distributed (e.g. uses a different compression mechanism, or re-tars the same patch), the hash changes, and you won't have the same hash for the same patch over time.

I think it's fine to treat this the same as if they had edited the patch contents themselves. It would be no less noisy because the archive's sha256 would be changing as often as this happens anyway (and the package must track the archive hash on account of 3).

"""Get the name of the package to be patched.

This is needed so that a ``Dependency`` acts like a ``Package``
object when passed to a directive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is a dependency object passed to a directive? I see cases where patches are passed to a directive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just deleted this part of the comment since I think it's unnecessary with your suggestion to just say "the name of the dependency package".

But Dependency objects are passed to the patch() directive inside _depends_on.


# Nasty, but it's the best way I can think of to avoid
# side effects if directive results are passed as args
remove_directives(args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to wrap my head around the directives logic and your updates to it. Could you provide an example of what goes wrong when this isn't done?

I feel like there's got to be a way to avoid undoing undesired modifications, but I admit I don't get what's going on here yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Say you do this:

class Foo(Package):
    depends_on('bar', patches=[
        patch('bar-bugfix.patch'),
        patch('https://example.com/bar-feature.patch', sha256='abc123...'))

If you do not remove the results of the two patch calls from the queue here, they'll be lazily executed along with all the other directives in DirectiveMetaMixin.__init__ (the metaclass initializer), because the @directive decorator adds them to DirectiveMetaMixin._directives_to_be_executed. The point of this is to only pass the result of patch() to the depends_on() directive, and let that directive handle it.

Basically the point of all the extra directive machinery is so that only directives at the class definition level get executed. The others are either nested (e.g., like patch() in _depends_on, which is only executed for its side effects) or they're passed as parameters, and we let the callee directive decide what to do with their results. The callee (depends_on) could execute the _execute_patch function objects itself, or it could conceivably enqueue them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluation of patch is passed to depends_on so that it may call _execute_patch with the Dependency created by _depends_on. So I'm confused by the line

The others are either nested (e.g., like patch() in _depends_on, which is only executed for its side effects) or they're passed as parameters, and we let the callee directive decide what to do with their results.

because patch in depends_on is an example of the latter as far as I can tell from examining the code, but you explicitly use it as an example of the former.

@@ -1044,7 +1090,11 @@ def do_patch(self):
except spack.multimethod.NoSuchMethodError:
# We are running a multimethod without a default case.
# If there's no default it means we don't need to patch.
tty.msg("No patches needed for %s" % self.name)
if not patched:
# if we didn't apply a patch, AND the patch function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: "if we didn't apply a patch from a patch directive, AND..."

"apply a patch" is general and I had to read this several times to figure out what it meant.

@tgamblin
Copy link
Member Author

Ok @scheibelp: I pushed changes to address your review. I also replied to your questions but the replies are on now-outdated commits so I think you'll have to expand them above.

@tgamblin tgamblin force-pushed the features/dependency-patching branch 2 times, most recently from 00f2603 to 05ea79d Compare September 28, 2017 21:10
Copy link
Member

@scheibelp scheibelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the responses. I have a few more comments and questions. I'm going to look at this again tomorrow but by this point I think I have a general understanding of the how and why of most of the changes here.

# caller can return the directive if it should be queued.
def remove_directives(arg):
directives = DirectiveMetaMixin._directives_to_be_executed
if isinstance(arg, (list, tuple)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the two cases where this function is used

remove_directives(args)
remove_directives(kwargs.values())

in both cases a collection is being passed in. I meant to say: change the name of the parameter in this definition to args and then replace lines 206-210 with for arg in args:.

if 'archive_sha256' not in kwargs:
raise PatchDirectiveError(
"Compressed patches require 'archive_sha256' "
"and patch 'sha256' attributes: %s" % self.url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's easier to implement this way and the majority of patches aren't affected by it, then that seems reasonable. That being said while all of 1-3 make sense to me I still don't see why these are both required and/or why both would need to be set. I was assuming if we have a sha256 and the patch is an archive, it is assumed that the sha256 is for the archive. So I think that's another way of saying I agree with

You could think about approximating (1) by just using the hash of the compressed archive (which is pretty much unique)

As a side note: I'm unclear why the hash of the compressed archive would be less unique than the hash of the decompressed archive.

Regarding:

there are issues with that. gzip, tar, etc. aren't deterministic, so if someone changes the way a patch is distributed (e.g. uses a different compression mechanism, or re-tars the same patch), the hash changes, and you won't have the same hash for the same patch over time.

I think it's fine to treat this the same as if they had edited the patch contents themselves. It would be no less noisy because the archive's sha256 would be changing as often as this happens anyway (and the package must track the archive hash on account of 3).

Args:
pkg (type): Package that has this dependency
spec (Spec): Spec indicating dependency requirements
patches (list): list of patches to apply to this dependency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still appears to be out of sync

@@ -328,3 +332,126 @@ def find_versions_of_archive(archive_urls, list_url=None, list_depth=0):
continue

return versions


def spider_checksums(url_dict, name, stage_function=None, keep_stage=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I'm saying is that that all this is doing is taking a list of versions, converting each version to a url, and then fetching from that list of URLs. There is no poking around inside each URL which is what I normally think of when I think of spidering. I think get_checksums_for_versions would be more accurate and straightforward.


# Ignore any directives executed *within* top-level
# directives by clearing out the queue they're appended to
DirectiveMetaMixin._directives_to_be_executed = []
Copy link
Member

@scheibelp scheibelp Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(EDIT: this was intended to add to https://github.com/LLNL/spack/pull/5476/files#r141277358)

The only example I see of this is the call to patch in _depends_on (which only seems to be to support the syntactic sugar depends_on('libelf', patches='foo.patch')). Why not handle this the way the extends directive calls _depends_on vs. depends_on? In other words I can see that the remove_directives logic below would be difficult to get rid of if users can pass directives as arguments, but this case looks like a problem we create for ourselves.


# Nasty, but it's the best way I can think of to avoid
# side effects if directive results are passed as args
remove_directives(args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluation of patch is passed to depends_on so that it may call _execute_patch with the Dependency created by _depends_on. So I'm confused by the line

The others are either nested (e.g., like patch() in _depends_on, which is only executed for its side effects) or they're passed as parameters, and we let the callee directive decide what to do with their results.

because patch in depends_on is an example of the latter as far as I can tell from examining the code, but you explicitly use it as an example of the former.

# caller can return the directive if it should be queued.
def remove_directives(arg):
directives = DirectiveMetaMixin._directives_to_be_executed
if isinstance(arg, (list, tuple)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also are there other examples of directives as arguments besides passing patch to depends_on? Either way IMO it would make sense to describe that particular case in the explanatory comment.

@scheibelp scheibelp mentioned this pull request Sep 29, 2017
@tgamblin tgamblin force-pushed the features/dependency-patching branch 2 times, most recently from d6882e4 to 7cee03f Compare September 30, 2017 07:39
- Previously, dependencies and dependency_types were stored as separate
  dicts on Package.
  - This means a package can only depend on another in one specific way,
    which is usually but not always true.
  - Prior code unioned dependency types statically across dependencies on
    the same package.

- New code stores dependency relationships as their own object, with a
  spec constraint and a set of dependency types per relationship.
  - Dependency types are now more precise
  - There is now room to add more information to dependency relationships.

- New Dependency class lives in dependency.py, along with deptype
  definitions that used to live in spack.spec.

Move deptype definitions to spack.dependency
- move `spack.cmd.checksum.get_checksums` to `spack.util.web.spider_checksums`

- move `spack.error.NoNetworkError` to `spack.util.web.NoNetworkError` since
  it is only used there.
- Functions returned by directives were all called `_execute`, which made
  reading stack traces hard because you couldn't tell what directive a
  frame came from.
  - renamed them all to `_execute_<directive>`

- Exceptions in directives were only really used in one or two places --
  get rid of the boilerplate init functions and let the callsite specify
  the message.
- A package can depend on a special patched version of its dependencies.

  - The `Spec` YAML (and therefore the hash) now includes the sha256 of
    the patch in the `Spec` YAML, which changes its hash.

  - The special patched version will be built separately from a "vanilla"
    version of the same package.

  - This allows packages to maintain patches on their dependencies
    without affecting either the dependency package or its dependents.
    This could previously be accomplished with special variants, but
    having to add variants means the hash of the dependency changes
    frequently when it really doesn't need to.  This commit allows the
    hash to change *just* for dependencies that need patches.

  - Patching dependencies shouldn't be the common case, but some packages
    (qmcpack, hpctoolkit, openspeedshop) do this kind of thing and it
    makes the code structure mirror maintenance responsibilities.

- Note that this commit means that adding or changing a patch on a
  package will change its hash.  This is probably what *should* happen,
  but we haven't done it so far.

  - Only applies to `patch()` directives; `package.py` files (and their
    `patch()` functions) are not hashed, but we'd like to do that in the
    future.

- The interface looks like this: `depends_on()` can optionally take a
  patch directive or a list of them:

     depends_on(<spec>,
                patches=patch(..., when=<cond>),
                when=<cond>)
     # or
     depends_on(<spec>,
                patches=[patch(..., when=<cond>),
                         patch(..., when=<cond>)],
                when=<cond>)

- Previously, the `patch()` directive only took an `md5` parameter.  Now
  it only takes a `sha256` parameter.  We restrict this because we want
  to be consistent about which hash is used in the `Spec`.

- A side effect of hashing patches is that *compressed* patches fetched
  from URLs now need *two* checksums: one for the downloaded archive and
  one for the content of the patch itself.  Patches fetched uncompressed
  only need a checksum for the patch.  Rationale:

  - we include the content of the *patch* in the spec hash, as that is
    the checksum we can do consistently for patches included in Spack's
    source and patches fetched remotely, both compressed and
    uncompressed.

  - we *still* need the patch of the downloaded archive, because we want
    to verify the download *before* handing it off to tar, unzip, or
    another decompressor.  Not doing so is a security risk and leaves
    users exposed to any arbitrary code execution vulnerabilities in
    compression tools.
@tgamblin
Copy link
Member Author

Ok this is updated with @scheibelp's suggestions and some bugfixes.

@tgamblin tgamblin merged commit 96d2488 into develop Sep 30, 2017
@tgamblin
Copy link
Member Author

@naromero77: this is merged.

@tgamblin tgamblin deleted the features/dependency-patching branch September 30, 2017 23:32
@naromero77
Copy link
Contributor

@tgamblin thanks

@tgamblin tgamblin mentioned this pull request Oct 1, 2017
alalazo added a commit to epfl-scitas/spack that referenced this pull request Oct 4, 2017
refers spack#5587
closes spack#5593

The implementation introduced in spack#5476 used `dedupe` to set a private
member of a `MultiValuedVariant` object. `dedupe` is meant to give a
stable deduplication of the items in a sequence, and returns an object
whose value depends on the order of items in its argument.

Now the private member is set implicitly using the associated property
setter. The logic changes from stable deduplication to a totally ordered
deduplication (i.e. instead of `dedupe(t)` it uses `sorted(set(t))`).
This should also grant that packages that shuffle the same set of
patch directives maintain the same hash.
else:
mvar = dspec.spec.variants['patches']
mvar._value = mvar._original_value = tuple(
dedupe(list(mvar._value) + patches))
Copy link
Member

@alalazo alalazo Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgamblin @scheibelp I'll write this comment here just to be sure I passed the information. What we are doing in the line:

mvar._value = mvar._original_value = tuple(dedupe(list(mvar._value) + patches))

is breaking an invariant of the class (the assumption that self._value is totally ordered). This assumption is transparent to the user of the class, and bypassing the property setter here is not very robust. In fact any call to a method that does:

self.value = self.value + other

can potentially reorder self._value (and lead to #5587?). An example of such a method may be for instance MultiValuedVariant.constrain.

@@ -1829,6 +1805,43 @@ def concretize(self):
if s.namespace is None:
s.namespace = spack.repo.repo_for_pkg(s.name).namespace

# Add any patches from the package to the spec.
patches = []
for cond, patch_list in s.package_class.patches.items():
Copy link
Member

@alalazo alalazo Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgamblin @scheibelp There's a thing I don't fully understand about the patch mechanism. The information a packager provides with a directive is stored in a dictionary. This means that the for loop above is not ordered. So we are assuming a weak ordering by constraints. To be more clear, if we have:

patch('foo', when='@1')
patch('another_foo', when='%gcc')
patch('bar', when='@1')
patch('another_bar', when='%gcc')
patch('baz', when='+shared')

right now we'll be ensured that:

  1. foo is applied immediately before bar
  2. another_foo is applied immediately before another_bar

because they share the same cond, but we don't know whether the first patch is foo, another_foo or baz.

Would it make sense to relax this constraint and say that the patches must be commutative? If this is the case, then I think #5596 will solve #5587 after aggregating the two patches for mpfr which depend on their relative order of application.

If that's not the case then I think we should go completely the other direction and ensure that the order of patches:

  1. reflects the order of directives, regardless of the condition
  2. enters the hash mechanism

Note that this is getting a lot trickier than the case before, since we can now inject patches. For example, if in a DAG we have 2 nodes A and B that both want to inject patches on C we must ensure that A and B are always traversed in the same relative order in any graph were they are present.

Do the considerations above sound reasonable to you?

Copy link
Member

@scheibelp scheibelp Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To update this, https://github.com/LLNL/spack/pull/5596 now preserves the weak ordering that currently exists on packages patches (EDIT).

@mathstuf
Copy link
Contributor

mathstuf commented Oct 4, 2017

Note that this has some consequences: there are problems concretizing associated with the (implicit?) patches variant. See #5598. Some initial poking has not been sufficient, so something more in-depth is required.

scheibelp pushed a commit that referenced this pull request Oct 4, 2017
fixes #5587

In trying to preserve patch ordering, #5476 made equality inconsistent
for the added 'patches' variant. This commit maintains the original
weak ordering of patch applications while preserving consistency of
comparisons. The ordering DOES NOT enter the hashing mechanism. It's
supposed to be a hotfix, while we think of a cleaner and more-permanent
solution.
@naromero77
Copy link
Contributor

Thanks seems to have resolved my issue with PR#4907.

@tgamblin tgamblin added this to the v0.11.0 milestone Nov 12, 2017
@tgamblin tgamblin mentioned this pull request Feb 20, 2018
scheibelp added a commit that referenced this pull request Jun 7, 2018
Fixes #7885

#7193 added the patches_to_apply function to collect patches which are then
applied in Package.do_patch. However this only collects patches that are
associated with the Package object and does not include Spec-related patches
(which are applied by dependents, added in #5476).

Spec.patches already collects patches from the package as well as those applied
by dependents, so the Package.patches_to_apply function isn't necessary. All
uses of Package.patches_to_apply are replaced with Package.spec.patches.

This also updates Package.content_hash to require the associated spec to be
concrete: Spec.patches is only set after concretization. Before this PR, it was
possible for Package.content_hash to be valid before concretizing the associated
Spec if all patches were associated with the Package (vs. being applied by
dependents). This behavior was unreliable though so the change is unlikely to
be disruptive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants