Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate BOMs cannot handle components with differing dependency trees in different modules #310

Open
knrc opened this issue Mar 15, 2023 · 37 comments

Comments

@knrc
Copy link
Contributor

knrc commented Mar 15, 2023

When processing aggregate BOMs it's possible to encounter projects which would cause the resolution of dependencies for a component to differ, for example

  • during the resolution process, with different sets of transitive dependencies
  • using dependency management
  • using exclusions

An example of exclusion could be represented by the following dependency trees, where dependency_F has managed dependency_B to exclude dependency_E from the dependency graph

com.example.dependency_trees.exclusion:dependency_A:jar:1.0.0
\- com.example.dependency_trees.exclusion:dependency_B:jar:1.0.0:compile
   +- com.example.dependency_trees.exclusion:dependency_C:jar:1.0.0:compile
   |  \- com.example.dependency_trees.exclusion:dependency_D:jar:1.0.0:compile
   \- com.example.dependency_trees.exclusion:dependency_E:jar:1.0.0:compile

and

com.example.dependency_trees.exclusion:dependency_F:jar:1.0.0
\- com.example.dependency_trees.exclusion:dependency_B:jar:1.0.0:compile
   \- com.example.dependency_trees.exclusion:dependency_C:jar:1.0.0:compile
      \- com.example.dependency_trees.exclusion:dependency_D:jar:1.0.0:compile

An example of managing the versions could be represented by the following dependency trees, where dependency_E has managed the version of dependency_C to use version 2.0.0 instead of 1.0.0

com.example.dependency_trees.managed:dependency_A:jar:1.0.0
\- com.example.dependency_trees.managed:dependency_B:jar:1.0.0:compile
   \- com.example.dependency_trees.managed:dependency_C:jar:1.0.0:compile
      \- com.example.dependency_trees.managed:dependency_D:jar:1.0.0:compile

and

com.example.dependency_trees.managed:dependency_E:jar:1.0.0
\- com.example.dependency_trees.managed:dependency_B:jar:1.0.0:compile
   \- com.example.dependency_trees.managed:dependency_C:jar:2.0.0:compile (version managed from 1.0.0)
      \- com.example.dependency_trees.managed:dependency_D:jar:1.0.0:compile

Note this last example is also something that could occur based on the dependency resolution process, since the context of the roots would be different and could resolve to a different set of artifacts.

The aggregate SBOM should be able to represent all the valid dependency hierarchies, which means that each component with an alternative dependency hierarchy would exist multiple times (same purl but differing bom-ref/ref)

In both the above examples we would expect to see two components included for dependency_B with each related to a different reference (bom-ref) and a different dependency hierarchy (ref)

knrc added a commit to knrc/cyclonedx-maven-plugin that referenced this issue Mar 15, 2023
…y hierarchies

Signed-off-by: Kevin Conner <kev.conner@gmail.com>
@knrc
Copy link
Contributor Author

knrc commented Mar 15, 2023

Note: prior to #306 the dependency tree would only include one of the dependency trees. After this PR is applied the SBOM will contain multiple components for dependency_B, with the same purl but differing bom-refs, and each component will be included in a separate dependency tree.

knrc added a commit to knrc/cyclonedx-maven-plugin that referenced this issue Mar 15, 2023
…y hierarchies

Signed-off-by: Kevin Conner <kev.conner@gmail.com>
knrc added a commit to knrc/cyclonedx-maven-plugin that referenced this issue Mar 16, 2023
…y hierarchies

Signed-off-by: Kevin Conner <kev.conner@gmail.com>
@hboutemy hboutemy changed the title Aggregate BOMs cannot handle components with differing dependency trees Aggregate BOMs cannot handle components with differing dependency trees in different modules Mar 16, 2023
@stevespringett
Copy link
Member

With regards to duplicate dependency_B expressed in the BOM with varying bom-refs.

My question: is a single quantity of dependency_B delivered, or are multiple occurrences of dependency_B delivered?

If multiple occurrences of dependency_B are delivered in the final build, then expressing dependency_B twice is expected. If a single occurrence of dependency_B is delivered then most use cases will not be expecting that.

Take for example a few SBOM use cases.

  • Vulnerability management:
    • may erroneously increase the total number of vulnerabilities for each duplicate component
  • License compliance:
    • may erroneously increase the amount of potential license compliance issues for each duplicate component
  • Software asset management:
    • will erroneously increase the quantity of assets that are being tracked
  • Vendor management:
    • will erroneously increase the amount of components the organization relies on from a vendor

I understand the importance that including dependency_B twice and why it is being done. And I believe the use cases most affected by this are internal ones, or ones where first-party developers need to know different paths, especially for triage and remediation.

What is the probability that this happens? Does Sonatype or others have any data about how frequently this could occur?

Is it possible to provide a configurable option to enable/disable this behavior. Different stakeholders and going to have different expectation for what should and should not be in a BOM. I believe this case may warrant a configurable option to support the majority of stakeholders.

My two cents.

@knrc
Copy link
Contributor Author

knrc commented Mar 16, 2023

@stevespringett I'm not sure the question as to whether multiple dependency_B components will be delivered is necessarily relevant, since the SBOMs being generated by the cyclonedx maven plugin are build time related (so more towards supply chain) than runtime related. If these components are consumed by other projects they could easily expose a different dependency hierarchy when consumed, e.g. see #312 for which I have a PR/testcase waiting to be submitted.

As things currently stand the output from the aggregate SBOMs would not be reliable. Each component will only exist once in the SBOM (ref == purl) and each component will then be part of only a single hierarchy, however this hierarchy is not necessarily correct nor consistent since it's currently "first one wins" for dependencies. The only reliable source would be the individual SBOMs for each project. If that's the case then do we even need aggregates? What are aggregates trying to do if not aggregate the individual project SBOMs?

With regards to how often this could occur I don't really have any concrete numbers, but I suspect it is more common than you would possibly expect given the number of ways this could occur

  • dependency version management
  • dependency exclusions
  • dependency resolution with differing sets of artifacts
  • poms specifying the same dependencies in different orders
  • dependency resolution involving provided, test, optional dependencies

This could certainly be made configurable, however if this were to be done we should also cause the build to fail when using the current mechanism when it determines it cannot handle these types of conflicts. I'm not really sure of the utility of this, however if this is a concern then perhaps a better approach would be to move over to this mechanism and bump the version to the next major with documentation stating what the impact would be.

@hboutemy
Copy link
Contributor

I think this all has to do with "what does an aggregate SBOM mean"?

My question: is a single quantity of dependency_B delivered, or are multiple occurrences of dependency_B delivered?
If multiple occurrences of dependency_B are delivered in the final build, then expressing dependency_B twice is expected. If a single occurrence of dependency_B is delivered then most use cases will not be expecting that.

An aggregate SBOM is an aggregation of multiple individual SBOMs, each individual SBOM representing one concrete (Maven) module that is one concrete build delivering a concrete build.

Then to me, we can consider that the aggregation delivers multiple occurrences of dependency_B

@hboutemy
Copy link
Contributor

@knrc can you share your example pom.xml tree so we can make the example clearer? We need to clearly name the root POM vs the modules POMs vs the dependencies
Then associate the different SBOMS to their pom.xml (with the SBOM for the root pom.xml being the aggregate SBOM, while others are the individual SBOMs)

@knrc
Copy link
Contributor Author

knrc commented Mar 20, 2023

@hboutemy The test case in the PR has examples of version management and exclusions, and I'm writing up a blog with examples for the last three in the list mentioned above. I'll hopefully have this completed before our call to discuss.

@stevespringett
Copy link
Member

stevespringett commented Mar 20, 2023

I'm not sure the question as to whether multiple dependency_B components will be delivered is necessarily relevant, since the SBOMs being generated by the cyclonedx maven plugin are build time related (so more towards supply chain) than runtime related

It's absolutely related. If you take an application that has a multi-module build, such as Webgoat, the SBOM that I generate at build represents the build that is potentially delivered to customers. Webgoat does not deliver multiple versions of commons-io for example. It delivers only one in the resulting artifact. Having multiple of the same component will be a non-starter for many/most orgs.

If this change is implemented, I will need a way to disable this functionality for use with my own employer. I suspect other Java shops will need the same since they do not want to erroneously include duplicate components. If a workaround is not provided, I will no longer be able to use the plugin myself, and will have to find alternatives.

We need to think about all the use cases that are affected by a change like this. While this change works for some use cases, it doesn't for others.

@stevespringett
Copy link
Member

stevespringett commented Mar 20, 2023

Since the default as of this time has been to only include a single occurrence of a component, I would like to retain that behavior and add an optional flag that would go down this path. Both paths should retain the complete dependency tree, but the optional flag would include the duplicate components.

@knrc
Copy link
Contributor Author

knrc commented Mar 20, 2023

It's absolutely related. If you take an application that has a multi-module build, such as Webgoat, the SBOM that I generate at build represents the build that is potentially delivered to customers. Webgoat does not deliver multiple versions of commons-io for example. It delivers only one in the resulting artifact. Having multiple of the same component will be a non-starter for many/most orgs.

I'm fine with making this configurable, but I think the only way this makes sense is if we then fail the build when we know that it will generate SBOMs that are not consistent. Webgoat is a good project to discuss, since there are certainly inconsistencies in those projects which show up through the aggregated SBOM.

One example is oauth-bypass, when it is built it has the following direct dependencies

    <dependency ref="pkg:maven/org.owasp.webgoat.lesson/auth-bypass@v8.0.0.M15?type=jar">
      <dependency ref="pkg:maven/org.owasp.webgoat/webgoat-container@v8.0.0.M15?type=jar"/>
      <dependency ref="pkg:maven/org.owasp.encoder/encoder@1.2?type=jar"/>
      <dependency ref="pkg:maven/com.thoughtworks.xstream/xstream@1.4.7?type=jar"/>
      <dependency ref="pkg:maven/org.projectlombok/lombok@1.16.20?type=jar"/>
      <dependency ref="pkg:maven/org.apache.commons/commons-exec@1.3?type=jar"/>
    </dependency>

and when consumed by webgoat-server it has the following direct dependencies

    <dependency ref="pkg:maven/org.owasp.webgoat.lesson/auth-bypass@v8.0.0.M15?type=jar">
      <dependency ref="pkg:maven/org.owasp.encoder/encoder@1.2?type=jar"/>
      <dependency ref="pkg:maven/com.thoughtworks.xstream/xstream@1.4.7?type=jar"/>
      <dependency ref="pkg:maven/org.apache.commons/commons-exec@1.3?type=jar"/>
    </dependency>

In this example the versions of the three artifacts above happen to be the same in each, however that is not guaranteed to be the case.

If this change is implemented, I will need a way to disable this functionality for use with my own employer. I suspect other Java shops will need the same since they do not want to erroneously include duplicate components. If a workaround is not provided, I will no longer be able to use the plugin myself, and will have to find alternatives.

We need to think about all the use cases that are affected by a change like this. While this change works for some use cases, it doesn't for others.

Agreed, although I think this is really only an issue when these differences exist and are mis-represented in the aggregate SBOM. In those situations the aggregate SBOM could easily mislead someone into thinking they know everything which has gone into the build when perhaps they don't. I believe the only safe approach we currently have is to rely on the individual BOMs.

We have a call tomorrow with @hboutemy, perhaps this is easier to go through in person.

@jkowalleck
Copy link
Member

jkowalleck commented Mar 23, 2023

came here because of the initial CycloneDX/specification#197


I am really trying to understand, but I do not get it. Explain it to me as I was 5 years old, please, and assume I know SBOM but do not know any Java or it's ecosystems.

For example here, to me, it looks like you are confusing runtime resolution graphs with actual dependency graphs. Based on #310 (comment) I read that the pkg:maven/org.owasp.webgoat/webgoat-container" was bundled to the product(foo.jar) because it is a dependency of pkg:maven/org.owasp.webgoat.lesson/auth-bypass. But this dependency is never used at runtime, when consumed by webgoat-server, possibly because this server shipped its own version of this webgoat-container, so this other one is resolved and used at runtime.
But STILL this dependency exists in the jar, and it does not matter who resolves it or if it is used at all. The dependency shows why a component was part of the build result.

@knrc
Copy link
Contributor Author

knrc commented Mar 23, 2023

came here because of the initial CycloneDX/specification#197

Hiya Jan

I am really trying to understand, but I do not get it. Explain it to me as I was 5 years old, please, and assume I know SBOM but do not know any Java or it's ecosystems.

I have a blog post coming up that will help, I'm just about to publish it.

For example here, to me, it looks like you are confusing runtime resolution graphs with actual dependency graphs.

Not at all, but there is a related issue (#312) in which I would like to offer the option of generating either of those perspectives in the SBOM.

Based on #310 (comment) I read that the pkg:maven/org.owasp.webgoat/webgoat-container" was bundled to the product(foo.jar) because it is a dependency of pkg:maven/org.owasp.webgoat.lesson/auth-bypass. But this dependency is never used at runtime, when consumed by webgoat-server, possibly because this server shipped its own version of this webgoat-container, so this other one is resolved and used at runtime. But STILL this dependency exists in the jar, and it does not matter who resolves it or if it is used at all. The dependency shows why a component was part of the build result.

The issue is the SBOM under discussion is an aggregate SBOM, and is intended to represent all projects within the multi-module project. As this aggregate contains both the oauth-bypass and webgoat-server projects I would expect both hierarchies to be present in the SBOM. Unfortunately the dependency graphs seen in the current implementation are worse than just having chosen one or the other, since you can end up with orphaned parts of the graph and incorrect dependency graphs.

@knrc
Copy link
Contributor Author

knrc commented Mar 23, 2023

@jkowalleck I've finally published the blog post, I hope this helps to explain the current issues with the aggregated SBOM. I included a number of examples in the blog, what I feel are the more common scenarios, and there are two more in the PR tests covering version management/exclusion.

@stevespringett
Copy link
Member

stevespringett commented Mar 23, 2023

The two recommended approaches in CycloneDX both involve component isolation.

  1. Use component assemblies to specify all the components that are includes in each Maven model, establishing a hierarchy within the BOM. Components with the same PURL (but different BOM refs) could be represented independently under each structure and the corresponding dependency tree could accurately reflect that.
  2. Use BOM-Link to externalize and link to each BOM. This accomplishes the same thing as option 1, but uses multiple BOM files.

During the call, it was expressed that neither of these options are preferred, yet, these are the two recommended ways of isolating component usage and dependency relationships.

If neither of these two are wanted, then I'm not 100% sure that this is something we to try to fix. We could simply warn the user stating that the dependency graph may be incorrect. Perhaps the issue isn't with the specification, but rather, it was an early assumption that the output from an aggregate build could be accurately represented in a single BOM. I don't think it can without the use of assemblies.

Reading the blog post, I don't think a Merkel tree is the solution. This approach represents multiple problems:

  1. PURL is used in this case, but PURL is only one identifier. The spec also supports GNV, CPE, and SWID. If we truly want to solve this, not just for the Maven plugin but for everything, then multiple identifiers would need to be supported.
  2. The Merkel tree logic would need to be applied to every CycloneDX implementation, many of which we have no control over.
  3. How would we reverse the Merkel tree to understand the precise graph association, and how we would implement that logic in every implementation.

The amount of work involved in providing this functionality, not breaking anything in the process, for the sake of trying to stuff everything in a single BOM without using built-in features such as assemblies, is going to be massive. It's a ton of work that may take a year or more to complete.

I think this is one case where perfection is the enemy of good. And if we truly want perfect, then we need to utilize the existing mechanisms provided (assemblies or BOM-Link) to get to a state of perfection.

@jkowalleck
Copy link
Member

jkowalleck commented Mar 24, 2023

nothing new, already solved it ;-)

i'll leave here my practical solution for nodeJS, where every module is a package with an own module-resolution-tree: https://github.com/CycloneDX/cyclonedx-node-npm/tree/main/demo/juice-shop/example-results

docs:

@knrc
Copy link
Contributor Author

knrc commented Mar 24, 2023

@jkowalleck It looks as if your flat version is the same as the what I've implemented in this PR, you also have the same component being duplicated at the top level but with different bom-refs.

@jkowalleck
Copy link
Member

jkowalleck commented Mar 24, 2023

[...] have the same component being duplicated at the top level but with different s bom-ref.

@knrc yes. because in my case the components are actually duplicated in the file system. they actually exist multiple times in the build artifact. I do not track runtime-resolution but actual components, which affect runtime-resolution.

@knrc
Copy link
Contributor Author

knrc commented Mar 24, 2023

Reading the blog post, I don't think a Merkel tree is the solution. This approach represents multiple problems:

I think you are assuming the merkel tree approach is to be generalized for every provider of SBOM, is this the case?

  1. PURL is used in this case, but PURL is only one identifier. The spec also supports GNV, CPE, and SWID. If we truly want to solve this, not just for the Maven plugin but for everything, then multiple identifiers would need to be supported.

PURL is the identifier used by the maven plugin, and this is only a proposal for the maven plugin. The npm approach is handling the generation the same way, but choosing a different mechanism for generating their unique bom refs.

This is not something that needs to be standardized across every provider, the npm approach shows this is already not the case, and only has to make sense within the context of the SBOM where the reference is relevant.

  1. The Merkel tree logic would need to be applied to every CycloneDX implementation, many of which we have no control over.

This is not the case, the generation of the bom-ref is specific to the implementation and should be treated as opaque by every consumer. The only aspect they should care about is equality for matching with references elsewhere in the SBOM.

  1. How would we reverse the Merkel tree to understand the precise graph association, and how we would implement that logic in every implementation.

Why do you need to? The bom-ref is an opaque value, we shouldn't be using this to define any graph association.

The amount of work involved in providing this functionality, not breaking anything in the process, for the sake of trying to stuff everything in a single BOM without using built-in features such as assemblies, is going to be massive. It's a ton of work that may take a year or more to complete.

It seems npm is already implementing this approach, for an example see this bom.xml.

I think this is one case where perfection is the enemy of good. And if we truly want perfect, then we need to utilize the existing mechanisms provided (assemblies or BOM-Link) to get to a state of perfection.

It looks as if this approach is already in use elsewhere, I'm still struggling to understand the objection.

@knrc
Copy link
Contributor Author

knrc commented Mar 24, 2023

@knrc yes. because in my case the components are actually duplicated in the file system. they actually exist multiple times in the build artifact. I do not track runtime-resolution but actual components, which affect runtime-resolution.

+1

In the maven plugin case these components are not duplicated in the filesystem but do exist multiple times in the build, with different hierarchies, and the current SBOM generation is representing the build time resolution. Consumers of the artifact see a different graph, which is the subject of a different issue (#312).

@jkowalleck
Copy link
Member

jkowalleck commented Mar 24, 2023

re: #310 (comment)

PURL is the identifier used by the maven plugin, and this is only a proposal for the maven plugin. The npm approach is handling the generation the same way, but choosing a different mechanism for generating their unique bom refs.
This is not something that needs to be standardized across every provider, the npm approach shows this is already not the case, and only has to make sense within the context of the SBOM where the reference is relevant.

Correct me if I'm wrong, @stevespringett

Each bom-ref's value must be unique in the CycloneDX document it is defined in.
That is asserted by schema rules - like here: https://github.com/CycloneDX/specification/blob/ccbf7b5781ef534cd62616e3c4221004c7c82a66/schema/bom-1.4.xsd#L2402-L2405

The value of a bom-ref is nothing with meaning outside the CycloneDX document. Its only purpose is to be an anchor that can be referenced (via ref) inside the document or from another document via bom-link (read https://cyclonedx.org/capabilities/bomlink/).
You could use any string for bom-ref value. Usually people derive the value from PURL, in hope it is unique in the context of the BOM document they build.

@knrc
Copy link
Contributor Author

knrc commented Mar 24, 2023

In the maven plugin case these components are not duplicated in the filesystem but do exist multiple times in the build, with different hierarchies, and the current SBOM generation is representing the build time resolution. Consumers of the artifact see a different graph, which is the subject of a different issue (#312).

Note that if each of these aggregated builds were packaged up (i.e. in a zip/tar etc) then it would be the same situation.

@aloubyansky
Copy link

Could we clarify the following: could a single SBOM document contain multiple components that share the same purl but have unique bom-refs?

AFAIU, the answer is yes. And if so, there is no problem representing multiple variations in direct dependencies of a component with the same purl in a single SBOM. As to how often this will happen, it depends on what a given project represents.

If a build of a project produces a single and flat (classloading-wise if we are talking about Java) runtime then, I suppose, what counts is what ends up in that single runtime, which would typically represented as a single module (even if it's a multi module project) and, I would argue, there shouldn't be any aggregation happening. So the issue raised wouldn't occur in this case.

If the produced runtime is not flat classloading-wise or a project produces multiple root components that could be consumed in any combination by target users then we'd either need a separate SBOM per root component or an aggregate one, in which dependency variations for components with the same purl will be a common case and must be supported. Having an option to "suppress" them in some way will be a major flaw in the tool.

@knrc
Copy link
Contributor Author

knrc commented Mar 24, 2023

Hiya Alexey

Could we clarify the following: could a single SBOM document contain multiple components that share the same purl but have unique bom-refs?

Yes, the npm SBOMs are already doing this.

AFAIU, the answer is yes. And if so, there is no problem representing multiple variations in direct dependencies of a component with the same purl in a single SBOM. As to how often this will happen, it depends on what a given project represents.

The current discussion is more to do with how this is represented, whether we try to de-duplicate and flatten the representation in the SBOM (as this PR does) or generate the hierarchy through the assembly approach leading to the exploded dependency graph.

If a build of a project produces a single and flat (classloading-wise if we are talking about Java) runtime then, I suppose, what counts is what ends up in that single runtime, which would typically represented as a single module (even if it's a multi module project) and, I would argue, there shouldn't be any aggregation happening. So the issue raised wouldn't occur in this case.

In this case it would be a normal SBOM rather than the aggregate one (leaving aside the "build time vs runtime" discussion for now).

If the produced runtime is not flat classloading-wise or a project produces multiple root components that could be consumed in any combination by target users then we'd either need a separate SBOM per root component or an aggregate one, in which dependency variations for components with the same purl will be a common case and must be supported. Having an option to "suppress" them in some way will be a major flaw in the tool.

+1, which is what this PR is addressing albeit using a flat structure.

The issue with the current cyclonedx maven plugin implementation is the dependency hierarchies for aggregated SBOMs are unreliable, I wrote this up in a blog.

@aloubyansky
Copy link

@knrc could you elaborate a bit more on (quote from the blog):

What is really lacking in the CycloneDX specification is a way in which we can easily describe alternative dependency hierarchies for a component

Today this can be done by generating unique bom-refs for each variation. Or did you mean to solve more than that?

Unless the assembly approach is the way to go, what I find truly lacking is a way to identify the roots of the trees (or root components of the project) in an SBOM. We could say some of them could be identified by analyzing dependency trees and finding components that are not dependent on by any other component but that won't always catch all the legitimate root components a project.

When it comes to assemblies vs what @knrc is suggesting here, perhaps what should be considered, besides other things, is how these SBOMs will be consumed. To me it looks like SBOM formats imply specialized tools for humans to analyze their content. For a project of a "decent size", they simply look too complex to read and make sense of them for a human in their original text format. If SBOMs were indeed intended to be analyzed using tools then looking for ways to remove redundancy in content would be a good idea, given that SBOMs often will be pretty huge anyway. From that perspective, the approach @knrc is suggesting would have an advantage over the one based on assemblies, because it will allow representing each variation of direct dependencies (and their subtrees) of a component only once per SBOM document instead of once per occurrence in every dependency tree.

@knrc
Copy link
Contributor Author

knrc commented Mar 24, 2023

@knrc could you elaborate a bit more on (quote from the blog):

What is really lacking in the CycloneDX specification is a way in which we can easily describe alternative dependency hierarchies for a component

Today this can be done by generating unique bom-refs for each variation.

+1, which is what this PR does however it has to duplicate the component in order to do this. The only piece of information which does differ between those components is the bom-ref.

Or did you mean to solve more than that?

Ideally the component would exist only once and you would be able to relate that component to multiple hierarchies within the same SBOM, the spec currently assumes a 1-1 between a component in the SBOM and its hierarchy. There was a version of the blog which had a proposal, however I removed it to keep the blog focussed on the problems with the current implementation.

Unless the assembly approach is the way to go, what I find truly lacking is a way to identify the roots of the trees (or root components of the project) in an SBOM. We could say some of them could be identified by analyzing dependency trees and finding components that are not dependent on by any other component but that won't always catch all the legitimate root components a project.

Identifying those roots is certainly more complex, and I'm not sure the current plugin has sufficient information to determine those even with the assembly approach. There is also a problem with the current implementation erroneously generating orphaned segments of the dependency graph, so you could end up with components which were not depended upon but not necessarily the roots as one might expect.

When it comes to assemblies vs what @knrc is suggesting here, perhaps what should be considered, besides other things, is how these SBOMs will be consumed. To me it looks like SBOM formats imply specialized tools for humans to analyze their content. For a project of a "decent size", they simply look too complex to read and make sense of them for a human in their original text format. If SBOMs were indeed intended to be analyzed using tools then looking for ways to remove redundancy in content would be a good idea, given that SBOMs often will be pretty huge anyway. From that perspective, the approach @knrc is suggesting would have an advantage over the one based on assemblies, because it will allow representing each variation of direct dependencies (and their subtrees) of a component only once per SBOM document instead of once per occurrence in every dependency tree.

+1 and that's certainly what I was hoping for. As you say, the assembly approach will likely result in a BOM which is too large and unwieldy for human consumption for any sizeable aggregation.

@aloubyansky
Copy link

Ideally the component would exist only once and you would be able to relate that component to multiple hierarchies within the same SBOM, the spec currently assumes a 1-1 between a component in the SBOM and its hierarchy. There was a version of the blog which had a proposal, however I removed it to keep the blog focussed on the problems with the current implementation.

Yes, that makes sense.

Unless the assembly approach is the way to go, what I find truly lacking is a way to identify the roots of the trees (or root components of the project) in an SBOM. We could say some of them could be identified by analyzing dependency trees and finding components that are not dependent on by any other component but that won't always catch all the legitimate root components a project.

Identifying those roots is certainly more complex, and I'm not sure the current plugin has sufficient information to determine those even with the assembly approach. There is also a problem with the current implementation erroneously generating orphaned segments of the dependency graph, so you could end up with components which were not depended upon but not necessarily the roots as one might expect.

Right, I was commenting more on the current model (concept) in general though rather than the current impl.

@aloubyansky
Copy link

While I'm generally in favor of the direction of @knrc's PR, for the next model revision, instead of encoding purls into refs I wanted to suggest exploring a little structural adjustment instead.
Requiring IDs to carry meaning beyond simply satisfying uniqueness may complicate the design and possibly undermine UX and performance, especially if they turn out to be long strings in DBs and maps and used as keys in searches.

As @knrc pointed out in his blog, there is already a notion of a dependency tree node represented by org.cyclonedx.model.Dependency. A dependency tree node is referencing a component but at the same time is a separate entity in the model and so naturally could have its own ID, properties and metadata. So how about adding an (explicit) ID to org.cyclonedx.model.Dependency and letting org.cyclonedx.model.Dependency express dependencies on other org.cyclonedx.model.Dependencys using their IDs?

Here is how it could look like for the following two module framework:

library-a-module1:1.0 depends on library-a-module2:1.0 (with the exclusion of library-b)
library-a-module2:1.0 depends on library-b:1.0
{
  "bomFormat": "CycloneDX",
  "specVersion": "1.4",
  "serialNumber": "urn:uuid:3e671687-395b-41f5-a30f-a58921a69b79",
  "version": 1,
  "metadata": {
    "component": {
      "bom-ref": "library-a",
      "type": "framework",
      "name": "A framework",
      "version": "1.0"
    }
  },
  "components": [
    {
      "bom-ref": "pkg:maven/org.acme/library-a-module1@1.0",
      "type": "library",
      "group": "org.acme",
      "name": "library-a-module1",
      "version": "1.0",
      "purl": "pkg:maven/org.acme/library-a-module1@1.0"
    },
    {
      "bom-ref": "pkg:maven/org.acme/library-a-module2@1.0",
      "type": "library",
      "group": "org.acme",
      "name": "library-a-module2",
      "version": "1.0",
      "purl": "pkg:maven/org.acme/library-a-module2@1.0"
    },
    {
      "bom-ref": "pkg:maven/org.acme/library-b@1.0",
      "type": "library",
      "group": "org.acme",
      "name": "library-b",
      "version": "1.0",
      "purl": "pkg:maven/org.acme/library-b@1.0"
    }
  ],
  "dependencies": [
    {
      "id": "library-a-module1#1",
      "ref": "pkg:maven/org.acme/library-a-module1@1.0",
      "dependsOn": [
        "library-a-module2#1"
      ]
    },
    {
      "id": "library-a-module2#1",
      "ref": "pkg:maven/org.acme/library-a-module2@1.0",
      "dependsOn": []
    },
    {
      "id": "library-a-module2#2",
      "ref": "pkg:maven/org.acme/library-a-module2@1.0",
      "dependsOn": [
        "library-b#1"
      ]
    },
    {
      "id": "library-b#1",
      "ref": "pkg:maven/org.acme/library-b@1.0",
      "dependsOn": []
    }
  ]
}

Here we end up with a single component per purl but there could be multiple dependency tree nodes referencing the same component by its purl/bom-ref.
There are two dependency tree nodes for library-a-module2: library-a-module2#1 and library-a-module2#2, each having different direct dependencies (these example IDs are not meant to suggest any specific convention, they could be anything that's unique in the scope of an SBOM document, including a simple integer).
Also values of dependsOn now become IDs of other dependency tree nodes instead of the bom-refs.

@knrc
Copy link
Contributor Author

knrc commented Mar 24, 2023

Here is how it could look like for the following two module framework

+1 this is the proposal I originally had on my blog post.

@aloubyansky
Copy link

AFAIU, currently bom-refs are not required to be explicitly set in the text document and, if absent, would be set to the value of purl. To be able to parse current versions of SBOMs in the future, the id of a dependency could follow a similar rule: if not explicitly set, it would be initialized with the value of the ref. Explicit ids could be required in documents only in cases where a component was found to have different direct dependencies in different dependency trees.

@jkowalleck
Copy link
Member

jkowalleck commented Mar 31, 2023

bom-ref is already a linkage. why add another one called id?
Did the topic change? Are you discussion runtime-resolution graphs or dependency-graphs (they are NOT the same, the one is build-time, the other is runtime and mostly unpredictable at the time an SBOM is generated)?

@aloubyansky
Copy link

Just so I understand it correctly, don't bom-refs exist specifically to record dependencies? Are they used for anything else?

It's true that bom-refs do allow recording dependency trees (runtime, build time, whatever). The issue is that the current design (unless we are meant to use assemblies for this) forces adding multiple "copies" of a component with the same purl under components but with different bom-refs. Adding an id to the dependency will eliminate the need for different bom-refs pointing to the same purl.

@aloubyansky
Copy link

This happens primarily when aggregating SBOMs.

@jkowalleck
Copy link
Member

jkowalleck commented Mar 31, 2023

re: #310 (comment)
what does a purl have to do with a bom-ref?

@aloubyansky
Copy link

Good question. Sorry for my ignorance, I'm still learning. Could you help clarify the meaning and purpose behind both bom-ref and purl?
I thought, purl was pretty much meant to identify a component. However, the actual ID in the model is defined by a bom-ref, which is defaulted to the value of purl, unless explicitly overridden. Is there another reason to override the default bom-ref value (which is the purl) besides referring to dependencies?

@jkowalleck
Copy link
Member

jkowalleck commented Mar 31, 2023

this issue is getting to mixed.
i would suggest to convert it to a discussion, so we could use threads to stay focussed and split topics at the same time. @CycloneDX/java-maven-maintainers

@knrc
Copy link
Contributor Author

knrc commented Mar 31, 2023

@jkowalleck @aloubyansky This was intended to be a discussion about the underlying issue, with any spec changes being discussed elsewhere. Can we please keep to this?

There was an initial proposal I made to Steve about possible spec changes, trying to remain backwardly compatible with the current use. The idea was that we would keep ref as the primary attribute for each dependency, as the linkage between then, but allow an additional (optional) attribute (componentRef was suggested, but Steve wasn't keen on the name) which could then override the reference to the component.

  • ref would remain unique within the dependencies section
  • by default ref would reference the component bom-ref
  • if we needed to include the component multiple times then we would choose unique ref attributes and then override the component relationship using the second attribute

Steve had some other ideas, including looking at how we could use some potential changes already coming through 1.5, but this particular discussion is really about what we can do now and not what the future direction would be.

The suggestions for what we could do now really focussed on composition, but I'm not really a fan of these since they explode the size of the SBOM and makes it even more unwieldy than it currently is. I have some things to test out to compare the approaches, unfortunately I've been offline this week because of family in the UK (still here for a few days).

I'll get back to this soon.

@stevespringett
Copy link
Member

@aloubyansky the proposed solution will break all existing consumption tools.

@aloubyansky
Copy link

@stevespringett it's not backwards compatible, it could be made forward compatible though. It's natural for standards to evolve with time. That's the reason for specVersion, isn't it? I'd assume forward compatible changes should be acceptable in new versions. Otherwise, how would the model ever evolve?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants