Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlatMerge should deduplicate "equivalent" entries and squash their data #245

Open
wants to merge 297 commits into
base: main
Choose a base branch
from

Conversation

jimklimov
Copy link
Contributor

@jimklimov jimklimov commented Aug 11, 2023

The proposed PR is the result of hard labor of my past month or so, as I learned a lot about C# along the way, to solve the practical problem that after I merged hundreds of component Bom documents into one to craft a "release SBOM", the resulting document was invalid against the CycloneDX schemas, particularly that many entries were mentioned many times. 26K times too many in fact...

Fixes CycloneDX/cyclonedx-cli#219
Fixes CycloneDX/cyclonedx-cli#188
Fixes #82
Follows up from #199 and #216 AFAIK

@fnxpt kindly pointed me to the work posted and merged in #199 for the library - note that related PR CycloneDX/cyclonedx-cli#305 to actually benefit from the change in the CLI tool is still not merged; my practical work builds upon that.

There was quite a bit of back and forth as I was coercing C# to minimize copy-paste in application classes, so all activities are doable by addressing their new base class BomEntity with which they may, but not "must", implement custom Equivalent(), Equal() and MergeWith() methods. So there is a lot of reflection, with cached type, property and method information which had really noticeably sped up the millions of loops I've had executed, and some insane tricks to produce desired types at run-time (eh, suddenly Groovy is a very friendly language!) On the upside, this allowed to remove the several identical implementations of Equals(), Serialize() etc. which popped up in earlier PRs since February.

A relatively compact example MergeWith() implementation can be seen in Hash.cs which on one hand served as a PoC, and on another can notice if someone has several different hash values for the same algorithm.

A major code block of such implementation is in the Component.cs since its "equivalence" is not "equality" of the whole record but just that of the "BomRef" - and numerous other properties tend to differ most in the data sets I had (different scopes, whose priority order of values is defined in spec; different collections of hashes; sometimes other info), where it loops over the class property lists and has a switch/case to choose further handling logic based on their types, names and other criteria.

Finally, the original "faulty" relaxed merging logic which got me into this was redeemed due to its speed: tossing away the first 25K entries that were identical to something already present with a quick and cheap check cost some 3-5 sec on my laptop. Then comes the last pass with the careful MergeWith() to squash the remaining ~1.5K entries into 660 ultimately unique entries which have all the original info coming from different corners of the universe in another 10 sec. Given that there is a lot of loop-in-loop parsing involved, the original take to be careful all the way (with each of the incoming 330 BOM documents) was expensive: script took about 2-5 minutes to complete each attempt in different iterations, which made dev-testing poorly feasible. And debugging (with expression breakpoints to catch the interesting errors) took half an hour just to get to a test case with the original wealth of data...

I currently do no plan to undertake the re-design of logic which went into the Component class to pull it apart into some helper methods (e.g. merge of properties which are lists and the default "un-interesting" property handling) which could be re-used if other classes get to define their non-default merging logic.

Some other ideas were to use the ListMergeHelper for list properties in the classes (as part of their MergeWith() implem), and to consider making the Bom merge a MergeWith() implementation too. This one feels difficult due to numerous (Flat, Hierarchical) merge strategies that are available; maybe subclassing with one implem each (and defined equality/equivalence between them) could be an answer. Comments about this have been left in the code.

Finally, there may be or not be further entity types that would benefit from a diligent MergeWith(), but currently I don't have any on the plate - and my files do not upset the schemas, so...

These are all reasonable exercises for some other day, takers welcome. I'm happy this finally works and the burden is off my shoulders (and a half-week sprint task that took an extra month can get closed) so would rest on laurels and sleep for a week ;)

…y (inherit equality methods, adapt MergeWith())

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…s for different traces

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ity.KnownEntityTypeProperties[]

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ipping

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ipping

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ull) Scope value

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ipping

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…arious Scope values, refresh comments about spec versions involved

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…methodEquals and methodMergeWith via pre-cached BomEntity static Dicts

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…tyListMergeHelperReflection and BomEntityListReflection

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
… in original JSON) NonNullable properties

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…use default Equals() and Equivalent() implementations

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ey for BomEntityListReflection[List<type>]

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…operty

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…property

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
… hit (handled a "tmp" target entry which was equivalent or equal to an incoming "obj" entry)

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ble" helper properties

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…s", skip "NonNullable" helper properties

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…rialize(BomEntity) to rule them all

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…nternal Serialize() implementations

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…mEntity implem)

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ompact(BomEntity) with minimal markup overhead and use it in BomEntity.Equals()

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…tations for BomEntity itself as the handler for derived classes

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…BomEntityWithRefLinkType_StringList

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ithRefLinkType* interfaces

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ter/setter methods

Could not just declare in interface anticipated getter/setter methods (for zero copy-pasta) - that conflicted with the later generated getters/setter actual methods.

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…eriments

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…latmerge-dedup-squash

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
@jimklimov
Copy link
Contributor Author

Not yet the complete intended solution, but a large leap in that direction, which took way longer than I had anticipated (or that colleagues are content with).

Anyhow, recent dozen or so commits adds a way to "walk" BomEntities (including the Bom object which represents the whole document) to map the "bom-ref"s used to name things, and "ref"s or list items pointing back to these names and so entities. This in turn allows to consistently rename objects in the in-memory representation (which would be the way to ensure uniqueness where due and we decide something is not mergeable despite originally having same "bom-ref" names).

As a practical example, the cyclonedx-cli got a command to rename entities using this library call.

@jimklimov
Copy link
Contributor Author

jimklimov commented Oct 13, 2023

Missing DCO is on a revert-commit which is too long ago in history to rebase (had several merges from master since, unraveling the conflicts just to change a commit message via rebase does not seem worthwhile).

…...}

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…...}

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…some messages to formatting strings, wrap longer others

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…lue, as planned

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…tter/setter; avoid initializing to default value du-jour

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…dd pragma for Sonar)

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
… returns dictBackrefs)

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ivate

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…...}

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…lue, as planned - correct fix

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…erformance even if privatized with getter/setter

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
… in (unlikely) case it is null

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…re-initialized or not (and a definitive value wins over compiler default du-jour), because Sonar analyzer wants opposite things

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…WalkResult.bomRoot

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…tainers() wrap

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…BomWalkResult.bomRoot

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…ing info exposure

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
…BomWalkResult.debugPerformance

Signed-off-by: Jim Klimov <jimklimov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants