Skip to content

Enumerable.Chunk() could reduce memory allocation without reducing its performance #115079

Open
@willdean

Description

@willdean

Description

[ In all the following, I'm ignoring arrays, because they're already special-cased ]

Enumerable.Chunk() uses a "list-like" exponential-growth approach to arriving at the size of buffer it needs to hold a chunk. This optimises the case where the chunk size is much larger than the number of entries in the collection. But in cases where there is at least one whole chunk, then there will have been been a series of "wasted" reallocations as the first chunk buffer is repeatedly found to be too small and resized.

For example, a chunk size of 512 takes 8 allocations/resizes to arrive at its full size: 4,8,16,32,64,128,256,512, allocating almost twice as much space for the first chunk as it really needed, and doing 8x more optimisations than may have been necessary.

Whether or not this matters, or can be improved, depends:

  • For collections with a small number of chunks, the 'wasted' allocations of the first chunk can be as numerous, or more numerous, than the genuine ones.
  • In collections with many chunks, we don't care because the 'waste' is a small proportion of the genuine allocations
  • In uncountable collections, we don't know the size up front so we're stuck with discovering it incrementally
  • But, in collections which are countable, we could trivially avoid the waste by knowing how big the first chunk needs to be

I have read #67132 and it's clear to me that nobody expects a great deal from Chunk() in the way of performance - it's allocation-heavy and that's that. But that issue does leave the door open in #67132 (comment) for "little tweaks" to make it better, and in the spirit of small tweakery, I propose changing the setup of the first chunk buffer in EnumerableChunkIterator from

     int arraySize = Math.Min(size, 4);

to

   int arraySize = Math.Min(size, source.TryGetNonEnumeratedCount(out int sourceCount) ? sourceCount : 4);

I think from the history that there was a time when calling TryGetNonEnumeratedCount might have seemed too expensive, but there is now another type-check looking for the array special case, so it seems that people are now OK with doing low-cost O(1) up-front checks if they help performance.

I tend to assume that TryGetNonEnumeratedCount is not expensive (unless a particular IEnumerable<T> misbehaves), and have not been able to measure the deleterious effect of adding it to any vaguely realistic Chunk() benchmark - I simply can't get it out of the Benchmark.NET noise-floor - see Data below.

Configuration

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3775)
Intel Core Ultra 9 185H, 1 CPU, 22 logical and 16 physical cores
.NET SDK 9.0.203
[Host] : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2
DefaultJob : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2

Regression?

No

Data

Method CollectionSize ChunkSize Mean Gen0 Allocated
Enumerable 10 5 99.53 ns 0.0039 208 B
FastEnumerable 10 5 99.49 ns 0.0039 208 B
ListCollection 10 5 97.82 ns 0.0039 208 B
FastListCollection 10 5 93.75 ns 0.0033 176 B
Enumerable 10 100 112.94 ns 0.0048 256 B
FastEnumerable 10 100 105.72 ns 0.0049 256 B
ListCollection 10 100 108.35 ns 0.0049 256 B
FastListCollection 10 100 79.46 ns 0.0029 152 B
Enumerable 10 512 110.42 ns 0.0049 256 B
FastEnumerable 10 512 104.97 ns 0.0049 256 B
ListCollection 10 512 105.93 ns 0.0049 256 B
FastListCollection 10 512 76.74 ns 0.0029 152 B
Enumerable 1500 5 6,419.44 ns 0.1831 9744 B
FastEnumerable 1500 5 6,486.67 ns 0.1831 9744 B
ListCollection 1500 5 6,955.54 ns 0.1831 9744 B
FastListCollection 1500 5 6,829.16 ns 0.1831 9712 B
Enumerable 1500 100 3,462.08 ns 0.0420 2280 B
FastEnumerable 1500 100 3,586.67 ns 0.0381 2280 B
ListCollection 1500 100 3,806.44 ns 0.0420 2280 B
FastListCollection 1500 100 3,726.25 ns 0.0381 2032 B
Enumerable 1500 512 3,593.68 ns 0.0534 2904 B
FastEnumerable 1500 512 3,350.55 ns 0.0534 2904 B
ListCollection 1500 512 3,888.40 ns 0.0534 2904 B
FastListCollection 1500 512 3,560.02 ns 0.0420 2224 B
Enumerable 15000 5 65,909.87 ns 1.8311 96144 B
FastEnumerable 15000 5 65,062.50 ns 1.8311 96144 B
ListCollection 15000 5 65,179.16 ns 1.8311 96144 B
FastListCollection 15000 5 62,964.80 ns 1.8311 96112 B
Enumerable 15000 100 32,222.36 ns 0.3662 19560 B
FastEnumerable 15000 100 32,730.25 ns 0.3662 19560 B
ListCollection 15000 100 32,583.27 ns 0.3662 19560 B
FastListCollection 15000 100 33,289.45 ns 0.9155 19312 B
Enumerable 15000 512 31,407.68 ns 0.3052 17048 B
FastEnumerable 15000 512 30,778.50 ns 0.3052 17048 B
ListCollection 15000 512 31,196.14 ns 0.3052 17048 B
FastListCollection 15000 512 30,826.80 ns 0.3052 16368 B

Enumerable and ListCollection are running the current framework source code, but compiled in my test project
FastEnumerable and FastListCollection are a copy the same code but with the modification suggested above.
Enumerable is directly using an iterator method that yields bytes.
ListCollection is the same iterator that's been pushed through ToList() before the benchmark

Analysis

Everyone has their own idea about how Chunk() is actually used, and I don't know how many real-world scenarios this change actually improves, but it seems a low-cost way of avoiding a few allocations - the CPU difference is apparently negligible but the number of allocations is definitely reduced for countable collections (primarily IList<>, I would have guessed).

Of course, one could do another fast-path for IList<>, but that would definitely add a lot more additional complexity and risk of regression than this proposal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions