FlatMerge performance issue #287

jwfx · 2024-04-26T11:49:08Z

As of 6.0.0 it looks like JSON serialization is used to establish equality for components. This seems to greatly hurt merging performance. In my case about 70% of time is spent on the serializer.

cyclonedx-dotnet-library/src/CycloneDX.Core/Models/Component.cs

Lines 218 to 221 in 1e3886b

    
           public bool Equals(Component obj) 
        
           { 
        
               return Json.Serializer.Serialize(this) == Json.Serializer.Serialize(obj); 
        
           }

The text was updated successfully, but these errors were encountered:

mtsfoni · 2024-04-26T12:45:23Z

You call CycloneDXUtils.FlatMerge(Bom bom1, Bom bom2)?

I can have a look and maybe change something for version 7. Without knowing the code, I think either give an option to not call this or rather do some equality check based on reflection - should still be much faster than json serialization.

jwfx · 2024-04-26T14:26:47Z

You call CycloneDXUtils.FlatMerge(Bom bom1, Bom bom2)?

I call the FlatMerge(IEnumerable<Bom> boms) overload, but I think that is not relevant to the issue.

I can have a look and maybe change something for version 7. Without knowing the code, I think either give an option to not call this or rather do some equality check based on reflection - should still be much faster than json serialization.

This merge request seems to try to address this issue, sadly it is a bit bloated and looks stale.
https://github.com/CycloneDX/cyclonedx-dotnet-library/pull/245/files#diff-0569878561c448041435a7c88dcd816279a2889bcc8e18193a2484fda5c15002

Question really is what constitutes equality for a component.

Personally I would not go down the reflection road, as it also does not win any performance prices. Since the model is static, it seems like a good job for a code generator.

mtsfoni · 2024-04-26T14:40:30Z

I will have a look at the pull request.

Generally, I obviously want a solution that is quickly implemented but requires little maintenance in future changes to the model.
For a reflection based deep comparison, I have already working code at hand.
Labour is the limited resource here.

But I mark this as help wanted, maybe somebody wants to make a code generator as it is a rather "cool" thing to build.

andreas-hilti · 2024-05-25T18:08:21Z

Isn't the problem here much more the way in which the merge is performed?
What I mean is that we first merge bom1 with bom2 then with bom3 and so on.

cyclonedx-dotnet-library/src/CycloneDX.Utils/Merge.cs

Lines 144 to 147 in 1e3886b

    
           foreach (var bom in boms) 
        
           { 
        
               result = FlatMerge(result, bom); 
        
           }

Let's focus only on the components.
This implies that we first merge components1 with components2 then with components3 and so on.
Each of these merges is performed using ListMergeHelper which merges 2 lists.
If we merge two lists with N_1 and N_2 entries, respectively, which are all distinct, because of

cyclonedx-dotnet-library/src/CycloneDX.Utils/Merge.cs

Lines 35 to 37 in 1e3886b

    
           foreach (var item in list2) 
        
           { 
        
               if (!(result.Contains(item)))

we compute the serialization for each entry in the list1 N_2 times, and for the first entry in list2 N_1 times, for the next N_1+1 times etc. leading to N_1*N_2 + N_1 + (N_1+1) + ... +(N_1+N_2-1) = 2*N_1*N_2 + N_2*(N_2+1)/2 serialization calls.
The important bit is that it leads to a large number of serialization calls (quadratic complexity in terms of serialization calls).
And you are doing this repeately for all lists to be merged.

You can improve it in two steps:

Step 1:
In the ListMergeHelper, precompute the hashes of all entries in both lists (this will result in N_1+N_2 serialization calls). Then construct the merged list only using the hashes, except for hash collisions. In those cases, which should be rare, you can then resort to the full equality checks.
This would already improve the situation quite a bit (but doesn't solve it all). This improves performance mainly when you merge few BOMs with large components lists.

Step 2:
Refactor the merge logic such that you are merging all BOMs at once.
In particular, the list merge helper should then take an arbitrary number of lists.
It would then proceed analogously as described in step 1. This would then imply that you call (apart from hash collisions) serialize only once (!) on each component to be merged.

andreas-hilti · 2024-05-25T19:03:57Z

@jwfx Would #300 help in your case? (This is only step 1.) Can you post numbers before/after for your case?

andreas-hilti · 2024-05-25T20:57:24Z

Based on this SBOM https://github.com/CycloneDX/bom-examples/blob/0979663521c4623792dc432d09f88bcb85862a62/SBOM/juice-shop/v11.1.2/bom.json
and this snippet:

int n = 10;
var resourceFilename = [...];
var jsonBom = File.ReadAllText(resourceFilename);
var boms = new List<Bom>();
for (int i = 0; i < n; i++) {
    boms.Add(Serializer.Deserialize(jsonBom));
}

var mergedSbom = CycloneDXUtils.FlatMerge(boms);
Assert.Equal(boms[0].Components.Count, mergedSbom.Components.Count);

it reduces the time for the merge on my computer from roughly 35s to 1s.
(Of course this is an exotic case as we have only duplicates, but it gives an indication.)

jwfx · 2024-06-06T13:08:36Z

@andreas-hilti

Thanks for spending time on this and sorry for the late response, only managed to check on it now.

For my current test case the improvement is roughly 7.5x (2709ms -> 360ms).

So it looks like this was already significant step in the right direction. 👍

Quick profiler run on your PR showed that 3/4 of the time is still spend on the JSON serializer, so I guess for the final solution there is no way around equality without the JSON detour.

Looks like you are already part of the merging discussion CycloneDX/specification#320, though I could not quite grasp if the question what constitutes equality between components was really answered yet.

mtsfoni · 2024-06-06T13:21:41Z

Looks like a good example for the 20:80 rule, here.

If Andreas fixes the warnings, I think we can merge it and build a patch version.

Quick profiler run on your PR showed that 3/4 of the time is still spend on the JSON serializer, so I guess for the final solution there is no way around equality without the JSON detour.

Completely on board with this. It's a terrible solution anyway and sometimes causes problems when testing too.

andreas-hilti · 2024-06-06T19:54:28Z

Looks like you are already part of the merging discussion CycloneDX/specification#320, though I could not quite grasp if the question what constitutes equality between components was really answered yet.

Yes, the outcome of this discussion is still open. In general, dependening on the use case there might be different notions of equivalent components. The current hash based implementation can only handle "exact" equality. (However, as a starting point this is fine IMO.)

In addition, I'm not surprised that the serialization is still the dominating part. I guess step 2 could give another speedup in the same order of magnitude.

andreas-hilti · 2024-06-11T19:50:43Z

@jwfx #306 would be an alternative. It improves the situation only if components have names (and the name collisions are rare); however, I guess at least in 95% of all practical cases it improves the performance significantly. Maybe you can rerun your case.

jwfx · 2024-06-16T19:48:50Z

@andreas-hilti

~~In my test scenario all the components have names and I just merge the same SBOM multiple times with itself.~~

edit:
Not fully true, there are multiple different source SBOMs with different names and I merge the collection of them multiple times with each other.

improve_merge_performance_2:
3.3x (2709ms -> 820ms)

andreas-hilti · 2024-06-16T20:44:35Z

@jwfx Thanks, surprises me a bit that the speed-up based on name equality is less (as the serialization should only be triggered in case of name collisions => i.e. I would expect that the JSON serialization shows up much less in the profiling). Do your SBOMs only have large number of components, or are there other large list of entities?
(In my test case it reduces to <1s.)

jwfx · 2024-06-16T21:14:46Z

55, 51, 87, 57, 82, 14, 40, 96, 60, 34, 66, 126, 140, 1, 47

These are the component counts in the 15 SBOMs I'm merging. To make the consumed time a bit more significant I merge that list 5 times with each other.

Other than components and dependencies there is pretty much nothing in there.

jwfx · 2024-06-18T16:52:44Z

Another test with production data:

16 SBOMs with 379, 215, 230, 218, 255, 214, 214, 222, 216, 253, 226, 218, 426, 252, 270, 50 components merged into 528 unique components.

NuGet/Branch	Time
fix/improve_merge_performance	00:00:00.3056746
fix/improve_merge_performance_2	00:00:01.4284269
CycloneDX.Utils 7.0.1	00:00:07.0907321

#300 looks like a good candidate for merging.

mtsfoni · 2024-06-18T17:37:21Z

As I understand we could also merge both, #300 and #306?

andreas-hilti · 2024-06-18T21:28:35Z

As I understand we could also merge both, #300 and #306?

We could, but it would not really help in terms of performance (you'd get the performance of #300 for flat merges).

mtsfoni added the enhancement New feature or request label Apr 26, 2024

mtsfoni added the help wanted Extra attention is needed label Apr 26, 2024

andreas-hilti mentioned this issue May 25, 2024

Improve merge performance #300

Open

andreas-hilti mentioned this issue Jun 11, 2024

Improve merge performance (Alternative) #306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlatMerge performance issue #287

FlatMerge performance issue #287

jwfx commented Apr 26, 2024

mtsfoni commented Apr 26, 2024

jwfx commented Apr 26, 2024

mtsfoni commented Apr 26, 2024

andreas-hilti commented May 25, 2024 •

edited

Loading

andreas-hilti commented May 25, 2024

andreas-hilti commented May 25, 2024 •

edited

Loading

jwfx commented Jun 6, 2024

mtsfoni commented Jun 6, 2024

andreas-hilti commented Jun 6, 2024

andreas-hilti commented Jun 11, 2024

jwfx commented Jun 16, 2024 •

edited

Loading

andreas-hilti commented Jun 16, 2024

jwfx commented Jun 16, 2024

jwfx commented Jun 18, 2024

mtsfoni commented Jun 18, 2024

andreas-hilti commented Jun 18, 2024 •

edited

Loading

FlatMerge performance issue #287

FlatMerge performance issue #287

Comments

jwfx commented Apr 26, 2024

mtsfoni commented Apr 26, 2024

jwfx commented Apr 26, 2024

mtsfoni commented Apr 26, 2024

andreas-hilti commented May 25, 2024 • edited Loading

andreas-hilti commented May 25, 2024

andreas-hilti commented May 25, 2024 • edited Loading

jwfx commented Jun 6, 2024

mtsfoni commented Jun 6, 2024

andreas-hilti commented Jun 6, 2024

andreas-hilti commented Jun 11, 2024

jwfx commented Jun 16, 2024 • edited Loading

andreas-hilti commented Jun 16, 2024

jwfx commented Jun 16, 2024

jwfx commented Jun 18, 2024

mtsfoni commented Jun 18, 2024

andreas-hilti commented Jun 18, 2024 • edited Loading

andreas-hilti commented May 25, 2024 •

edited

Loading

andreas-hilti commented May 25, 2024 •

edited

Loading

jwfx commented Jun 16, 2024 •

edited

Loading

andreas-hilti commented Jun 18, 2024 •

edited

Loading