Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlatMerge performance issue #287

Open
jwfx opened this issue Apr 26, 2024 · 16 comments
Open

FlatMerge performance issue #287

jwfx opened this issue Apr 26, 2024 · 16 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@jwfx
Copy link

jwfx commented Apr 26, 2024

As of 6.0.0 it looks like JSON serialization is used to establish equality for components. This seems to greatly hurt merging performance. In my case about 70% of time is spent on the serializer.

public bool Equals(Component obj)
{
return Json.Serializer.Serialize(this) == Json.Serializer.Serialize(obj);
}

@mtsfoni
Copy link
Contributor

mtsfoni commented Apr 26, 2024

You call CycloneDXUtils.FlatMerge(Bom bom1, Bom bom2)?

I can have a look and maybe change something for version 7. Without knowing the code, I think either give an option to not call this or rather do some equality check based on reflection - should still be much faster than json serialization.

@mtsfoni mtsfoni added the enhancement New feature or request label Apr 26, 2024
@jwfx
Copy link
Author

jwfx commented Apr 26, 2024

You call CycloneDXUtils.FlatMerge(Bom bom1, Bom bom2)?

I call the FlatMerge(IEnumerable<Bom> boms) overload, but I think that is not relevant to the issue.

I can have a look and maybe change something for version 7. Without knowing the code, I think either give an option to not call this or rather do some equality check based on reflection - should still be much faster than json serialization.

This merge request seems to try to address this issue, sadly it is a bit bloated and looks stale.
https://github.com/CycloneDX/cyclonedx-dotnet-library/pull/245/files#diff-0569878561c448041435a7c88dcd816279a2889bcc8e18193a2484fda5c15002

Question really is what constitutes equality for a component.

Personally I would not go down the reflection road, as it also does not win any performance prices. Since the model is static, it seems like a good job for a code generator.

@mtsfoni mtsfoni added the help wanted Extra attention is needed label Apr 26, 2024
@mtsfoni
Copy link
Contributor

mtsfoni commented Apr 26, 2024

I will have a look at the pull request.

Generally, I obviously want a solution that is quickly implemented but requires little maintenance in future changes to the model.
For a reflection based deep comparison, I have already working code at hand.
Labour is the limited resource here.

But I mark this as help wanted, maybe somebody wants to make a code generator as it is a rather "cool" thing to build.

@andreas-hilti
Copy link
Contributor

andreas-hilti commented May 25, 2024

Isn't the problem here much more the way in which the merge is performed?
What I mean is that we first merge bom1 with bom2 then with bom3 and so on.

foreach (var bom in boms)
{
result = FlatMerge(result, bom);
}

Let's focus only on the components.
This implies that we first merge components1 with components2 then with components3 and so on.
Each of these merges is performed using ListMergeHelper which merges 2 lists.
If we merge two lists with N_1 and N_2 entries, respectively, which are all distinct, because of
foreach (var item in list2)
{
if (!(result.Contains(item)))

we compute the serialization for each entry in the list1 N_2 times, and for the first entry in list2 N_1 times, for the next N_1+1 times etc. leading to N_1*N_2 + N_1 + (N_1+1) + ... +(N_1+N_2-1) = 2*N_1*N_2 + N_2*(N_2+1)/2 serialization calls.
The important bit is that it leads to a large number of serialization calls (quadratic complexity in terms of serialization calls).
And you are doing this repeately for all lists to be merged.

You can improve it in two steps:

Step 1:
In the ListMergeHelper, precompute the hashes of all entries in both lists (this will result in N_1+N_2 serialization calls). Then construct the merged list only using the hashes, except for hash collisions. In those cases, which should be rare, you can then resort to the full equality checks.
This would already improve the situation quite a bit (but doesn't solve it all). This improves performance mainly when you merge few BOMs with large components lists.

Step 2:
Refactor the merge logic such that you are merging all BOMs at once.
In particular, the list merge helper should then take an arbitrary number of lists.
It would then proceed analogously as described in step 1. This would then imply that you call (apart from hash collisions) serialize only once (!) on each component to be merged.

@andreas-hilti
Copy link
Contributor

@jwfx Would #300 help in your case? (This is only step 1.) Can you post numbers before/after for your case?

@andreas-hilti
Copy link
Contributor

andreas-hilti commented May 25, 2024

Based on this SBOM https://github.com/CycloneDX/bom-examples/blob/0979663521c4623792dc432d09f88bcb85862a62/SBOM/juice-shop/v11.1.2/bom.json
and this snippet:

int n = 10;
var resourceFilename = [...];
var jsonBom = File.ReadAllText(resourceFilename);
var boms = new List<Bom>();
for (int i = 0; i < n; i++) {
    boms.Add(Serializer.Deserialize(jsonBom));
}

var mergedSbom = CycloneDXUtils.FlatMerge(boms);
Assert.Equal(boms[0].Components.Count, mergedSbom.Components.Count);

it reduces the time for the merge on my computer from roughly 35s to 1s.
(Of course this is an exotic case as we have only duplicates, but it gives an indication.)

@jwfx
Copy link
Author

jwfx commented Jun 6, 2024

@andreas-hilti

Thanks for spending time on this and sorry for the late response, only managed to check on it now.

For my current test case the improvement is roughly 7.5x (2709ms -> 360ms).

So it looks like this was already significant step in the right direction. 👍

Quick profiler run on your PR showed that 3/4 of the time is still spend on the JSON serializer, so I guess for the final solution there is no way around equality without the JSON detour.

Looks like you are already part of the merging discussion CycloneDX/specification#320, though I could not quite grasp if the question what constitutes equality between components was really answered yet.

@mtsfoni
Copy link
Contributor

mtsfoni commented Jun 6, 2024

Looks like a good example for the 20:80 rule, here.

If Andreas fixes the warnings, I think we can merge it and build a patch version.

Quick profiler run on your PR showed that 3/4 of the time is still spend on the JSON serializer, so I guess for the final solution there is no way around equality without the JSON detour.

Completely on board with this. It's a terrible solution anyway and sometimes causes problems when testing too.

@andreas-hilti
Copy link
Contributor

Looks like you are already part of the merging discussion CycloneDX/specification#320, though I could not quite grasp if the question what constitutes equality between components was really answered yet.

Yes, the outcome of this discussion is still open. In general, dependening on the use case there might be different notions of equivalent components. The current hash based implementation can only handle "exact" equality. (However, as a starting point this is fine IMO.)

In addition, I'm not surprised that the serialization is still the dominating part. I guess step 2 could give another speedup in the same order of magnitude.

@andreas-hilti
Copy link
Contributor

@jwfx #306 would be an alternative. It improves the situation only if components have names (and the name collisions are rare); however, I guess at least in 95% of all practical cases it improves the performance significantly. Maybe you can rerun your case.

@jwfx
Copy link
Author

jwfx commented Jun 16, 2024

@andreas-hilti

In my test scenario all the components have names and I just merge the same SBOM multiple times with itself.

edit:
Not fully true, there are multiple different source SBOMs with different names and I merge the collection of them multiple times with each other.

improve_merge_performance_2:
3.3x (2709ms -> 820ms)

@andreas-hilti
Copy link
Contributor

@jwfx Thanks, surprises me a bit that the speed-up based on name equality is less (as the serialization should only be triggered in case of name collisions => i.e. I would expect that the JSON serialization shows up much less in the profiling). Do your SBOMs only have large number of components, or are there other large list of entities?
(In my test case it reduces to <1s.)

@jwfx
Copy link
Author

jwfx commented Jun 16, 2024

55, 51, 87, 57, 82, 14, 40, 96, 60, 34, 66, 126, 140, 1, 47

These are the component counts in the 15 SBOMs I'm merging. To make the consumed time a bit more significant I merge that list 5 times with each other.

Other than components and dependencies there is pretty much nothing in there.

@jwfx
Copy link
Author

jwfx commented Jun 18, 2024

Another test with production data:

16 SBOMs with 379, 215, 230, 218, 255, 214, 214, 222, 216, 253, 226, 218, 426, 252, 270, 50 components merged into 528 unique components.

NuGet/Branch Time
fix/improve_merge_performance 00:00:00.3056746
fix/improve_merge_performance_2 00:00:01.4284269
CycloneDX.Utils 7.0.1 00:00:07.0907321

#300 looks like a good candidate for merging.

@mtsfoni
Copy link
Contributor

mtsfoni commented Jun 18, 2024

As I understand we could also merge both, #300 and #306?

@andreas-hilti
Copy link
Contributor

andreas-hilti commented Jun 18, 2024

As I understand we could also merge both, #300 and #306?

We could, but it would not really help in terms of performance (you'd get the performance of #300 for flat merges).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants