Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the hackage and nixpkgs meta analyzers #3549

Merged
merged 10 commits into from
Apr 13, 2024

Conversation

MangoIV
Copy link
Contributor

@MangoIV MangoIV commented Mar 13, 2024

Description

This solves part 2 of #3545

Addressed Issue

#3545

Additional Details

this depends on #3546

the apache http client utils version 4 don't implement brotli decompression; I redid some of the common http client code to keep this scoped; a future refactoring could remove the extraneous code.

Checklist

  • I have read and understand the contributing guidelines
  • This PR implements an enhancement, and I have provided tests to verify that it works as intended
  • This PR introduces new or alters existing behavior, and I have updated the documentation accordingly

Signed-off-by: Magnus Viernickel <magnus.viernickel@wire.com>
Signed-off-by: Magnus Viernickel <magnus.viernickel@wire.com>
Signed-off-by: Magnus Viernickel <magnus.viernickel@wire.com>
Signed-off-by: Magnus Viernickel <magnus.viernickel@wire.com>
@MangoIV
Copy link
Contributor Author

MangoIV commented Mar 13, 2024

what's missing:

  • invoke the refreshing job every now and then
  • make Brotli decompression not error
  • the usual chores.

different than the older http client

Signed-off-by: Magnus Viernickel <magnus.viernickel@wire.com>
@MangoIV MangoIV force-pushed the mangoiv/nixpkgs-meta-analyzer branch from bf2c25b to c5096b2 Compare March 14, 2024 12:09
Signed-off-by: Magnus Viernickel <magnus.viernickel@wire.com>
Signed-off-by: Magnus Viernickel <magnus.viernickel@wire.com>
@MangoIV MangoIV force-pushed the mangoiv/nixpkgs-meta-analyzer branch from e96397d to 965fbc3 Compare March 18, 2024 10:53
@MangoIV
Copy link
Contributor Author

MangoIV commented Mar 18, 2024

so if my theory is true, then this is some truly cursed shit:

  • at runtime, the http client library looks for presence of the brotli dec class
  • if it doesn't exist, it doesn't try to decode brotli encoded files but still removes the decoding header and goes on
  • if it does exist, it actually converts it into a decompressing entity and then goes on decoding it

@MangoIV MangoIV changed the title [wip] implement the nixpkgs meta analyzer [wip] implement the hackage and nixpkgs meta analyzers Mar 18, 2024
Signed-off-by: Magnus Viernickel <magnus.viernickel@wire.com>
@MangoIV MangoIV force-pushed the mangoiv/nixpkgs-meta-analyzer branch from a5813d9 to 93573e0 Compare March 18, 2024 11:00
@MangoIV
Copy link
Contributor Author

MangoIV commented Mar 25, 2024

Hi, I would like some input on where to best invoke the update of the nixpkgs meta analyzer singleton. Thanks in advance!

@emil-wire
Copy link

@nscuro I don't want to pester you and understand you have a lot on your mind. We'd be grateful if you could provide some guidance at your earliest convenience to make this PR a success and land this feature in DT. Thank you!

@nscuro
Copy link
Member

nscuro commented Mar 25, 2024

@MangoIV @emil-wire Hey, will have a look at this ASAP and provide feedback. Thank you for your contributions! :)

@nscuro
Copy link
Member

nscuro commented Mar 25, 2024

@MangoIV Instead of using a singleton of NixpkgsMetaAnalyzer, and updating it every now and then, you could use a read-through cache with "expire-after-write" eviction. The Caffeine cache library is already available in DT.

So the first time a task tries to access nix versions, it will fetch and parse packages.json, and put it in the cache. Other tasks trying to access it are blocked until the cache population finishes. After that, they can be quickly served from said cache. The cache will expire after a given interval (e.g. 1h), and the whole thing repeats.

The benefit of this is that you won't have to deal with concurrency controls such as locking on your own. Also refreshing happens implicitly and as such doesn't require a dedicated scheduled task.

Because we'd want to refresh the entire package index at once, the entire index would be a single cache entry:

class NixPkgMetaAnalyzer {

    private static final Cache<String, Map<String, String>> CACHE = Caffeine.newBuilder()
            .expireAfterWrite(60, TimeUnit.MINUTES)
            .maximumSize(1)
            .build();

    MetaModel analyze(Component component) {
        Map<String, String> latestVersions = CACHE.get("nixpkgs", cacheKey -> {
            final var versions = new HashMap<String, String>();
            // TODO: Download and parse package index
            return versions;
        });

        // TODO: Do things with latestVersions
    }

More details here: https://github.com/ben-manes/caffeine/wiki/Population#manual

@MangoIV
Copy link
Contributor Author

MangoIV commented Mar 25, 2024

Thank you for the answer, that looks very nice!

It looks like a fairly large hammer for the tiny nail in trying to get into the wall but if it’s already used across the codebase, this looks like a feasible option; I’ll adjust accordingly.

…xpkgs json

Signed-off-by: Magnus Viernickel <magnus.viernickel@wire.com>
@MangoIV
Copy link
Contributor Author

MangoIV commented Apr 8, 2024

package-url/packageurl-java@73b2f51

btw; this got merged, i.e. theoretically I could do this minor cleanup; however upstream didn't do a release yet so idk if this is worth the effort.

@MangoIV
Copy link
Contributor Author

MangoIV commented Apr 8, 2024

new tests pass locally, let's see if CI is merciful today 🥺

@MangoIV MangoIV marked this pull request as ready for review April 8, 2024 07:28
Signed-off-by: Magnus Viernickel <magnus.viernickel@wire.com>
Copy link

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
+0.43% (target: -1.00%) 72.22% (target: 70.00%)
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (f30bef6) 21514 15924 74.02%
Head commit (2e71242) 21739 (+225) 16185 (+261) 74.45% (+0.43%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#3549) 90 65 72.22%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

You may notice some variations in coverage metrics with the latest Coverage engine update. For more details, visit the documentation

@MangoIV
Copy link
Contributor Author

MangoIV commented Apr 12, 2024

@nscuro is this how you'd imagine it? what can I do to make this PR progress?

@nscuro nscuro added this to the 4.11 milestone Apr 13, 2024
@nscuro nscuro added the enhancement New feature or request label Apr 13, 2024
Copy link
Member

@nscuro nscuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MangoIV!

As a future enhancement, we'd want to improve the parsing of the nixpkgs catalogue. At the moment, the entire thing is deserialized at once, which causes quite a large amount of memory allocations:

image

A more efficient way would be to leverage a streaming API such as the one provided by Jackson. That allows us to only deserialize fields we actually care about. We make use of that when parsing CVE data from the NVD already:

try (final InputStream in = Files.newInputStream(file.toPath());
final JsonParser jsonParser = objectMapper.createParser(in)) {
jsonParser.nextToken(); // Position cursor at first token
// Due to JSON feeds being rather large, do not parse them completely,
// but "stream" through them. Parsing individual CVE items
// one-by-one allows for garbage collection to kick in sooner,
// keeping the overall memory footprint low.
JsonToken currentToken;
while (jsonParser.nextToken() != JsonToken.END_OBJECT) {
final String fieldName = jsonParser.getCurrentName();
currentToken = jsonParser.nextToken();
if ("CVE_Items".equals(fieldName)) {
if (currentToken == JsonToken.START_ARRAY) {
while (jsonParser.nextToken() != JsonToken.END_ARRAY) {
final ObjectNode cveItem = jsonParser.readValueAsTree();
parseCveItem(cveItem);
}
} else {
jsonParser.skipChildren();
}
} else {
jsonParser.skipChildren();
}
}
} catch (Exception e) {

@nscuro nscuro changed the title [wip] implement the hackage and nixpkgs meta analyzers Implement the hackage and nixpkgs meta analyzers Apr 13, 2024
@nscuro nscuro merged commit b40ea44 into DependencyTrack:master Apr 13, 2024
12 checks passed
@MangoIV
Copy link
Contributor Author

MangoIV commented Apr 13, 2024

It’s a tradeoff. I would be reluctant to make a guess on whether it’s wise to use streaming here without thorough benchmarks.

nscuro added a commit to nscuro/dependency-track that referenced this pull request May 6, 2024
This was missed in DependencyTrack#3549

Signed-off-by: nscuro <nscuro@protonmail.com>
nscuro added a commit to nscuro/dependency-track that referenced this pull request May 6, 2024
This was missed in DependencyTrack#3549

Signed-off-by: nscuro <nscuro@protonmail.com>
nscuro added a commit to nscuro/dependency-track that referenced this pull request May 6, 2024
This was missed in DependencyTrack#3549

Signed-off-by: nscuro <nscuro@protonmail.com>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants