Description
Description
[ In all the following, I'm ignoring arrays, because they're already special-cased ]
Enumerable.Chunk()
uses a "list-like" exponential-growth approach to arriving at the size of buffer it needs to hold a chunk. This optimises the case where the chunk size is much larger than the number of entries in the collection. But in cases where there is at least one whole chunk, then there will have been been a series of "wasted" reallocations as the first chunk buffer is repeatedly found to be too small and resized.
For example, a chunk size of 512 takes 8 allocations/resizes to arrive at its full size: 4,8,16,32,64,128,256,512, allocating almost twice as much space for the first chunk as it really needed, and doing 8x more optimisations than may have been necessary.
Whether or not this matters, or can be improved, depends:
- For collections with a small number of chunks, the 'wasted' allocations of the first chunk can be as numerous, or more numerous, than the genuine ones.
- In collections with many chunks, we don't care because the 'waste' is a small proportion of the genuine allocations
- In uncountable collections, we don't know the size up front so we're stuck with discovering it incrementally
- But, in collections which are countable, we could trivially avoid the waste by knowing how big the first chunk needs to be
I have read #67132 and it's clear to me that nobody expects a great deal from Chunk()
in the way of performance - it's allocation-heavy and that's that. But that issue does leave the door open in #67132 (comment) for "little tweaks" to make it better, and in the spirit of small tweakery, I propose changing the setup of the first chunk buffer in EnumerableChunkIterator
from
int arraySize = Math.Min(size, 4);
to
int arraySize = Math.Min(size, source.TryGetNonEnumeratedCount(out int sourceCount) ? sourceCount : 4);
I think from the history that there was a time when calling TryGetNonEnumeratedCount
might have seemed too expensive, but there is now another type-check looking for the array special case, so it seems that people are now OK with doing low-cost O(1) up-front checks if they help performance.
I tend to assume that TryGetNonEnumeratedCount
is not expensive (unless a particular IEnumerable<T>
misbehaves), and have not been able to measure the deleterious effect of adding it to any vaguely realistic Chunk()
benchmark - I simply can't get it out of the Benchmark.NET noise-floor - see Data below.
Configuration
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3775)
Intel Core Ultra 9 185H, 1 CPU, 22 logical and 16 physical cores
.NET SDK 9.0.203
[Host] : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2
DefaultJob : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2
Regression?
No
Data
Method | CollectionSize | ChunkSize | Mean | Gen0 | Allocated |
---|---|---|---|---|---|
Enumerable | 10 | 5 | 99.53 ns | 0.0039 | 208 B |
FastEnumerable | 10 | 5 | 99.49 ns | 0.0039 | 208 B |
ListCollection | 10 | 5 | 97.82 ns | 0.0039 | 208 B |
FastListCollection | 10 | 5 | 93.75 ns | 0.0033 | 176 B |
Enumerable | 10 | 100 | 112.94 ns | 0.0048 | 256 B |
FastEnumerable | 10 | 100 | 105.72 ns | 0.0049 | 256 B |
ListCollection | 10 | 100 | 108.35 ns | 0.0049 | 256 B |
FastListCollection | 10 | 100 | 79.46 ns | 0.0029 | 152 B |
Enumerable | 10 | 512 | 110.42 ns | 0.0049 | 256 B |
FastEnumerable | 10 | 512 | 104.97 ns | 0.0049 | 256 B |
ListCollection | 10 | 512 | 105.93 ns | 0.0049 | 256 B |
FastListCollection | 10 | 512 | 76.74 ns | 0.0029 | 152 B |
Enumerable | 1500 | 5 | 6,419.44 ns | 0.1831 | 9744 B |
FastEnumerable | 1500 | 5 | 6,486.67 ns | 0.1831 | 9744 B |
ListCollection | 1500 | 5 | 6,955.54 ns | 0.1831 | 9744 B |
FastListCollection | 1500 | 5 | 6,829.16 ns | 0.1831 | 9712 B |
Enumerable | 1500 | 100 | 3,462.08 ns | 0.0420 | 2280 B |
FastEnumerable | 1500 | 100 | 3,586.67 ns | 0.0381 | 2280 B |
ListCollection | 1500 | 100 | 3,806.44 ns | 0.0420 | 2280 B |
FastListCollection | 1500 | 100 | 3,726.25 ns | 0.0381 | 2032 B |
Enumerable | 1500 | 512 | 3,593.68 ns | 0.0534 | 2904 B |
FastEnumerable | 1500 | 512 | 3,350.55 ns | 0.0534 | 2904 B |
ListCollection | 1500 | 512 | 3,888.40 ns | 0.0534 | 2904 B |
FastListCollection | 1500 | 512 | 3,560.02 ns | 0.0420 | 2224 B |
Enumerable | 15000 | 5 | 65,909.87 ns | 1.8311 | 96144 B |
FastEnumerable | 15000 | 5 | 65,062.50 ns | 1.8311 | 96144 B |
ListCollection | 15000 | 5 | 65,179.16 ns | 1.8311 | 96144 B |
FastListCollection | 15000 | 5 | 62,964.80 ns | 1.8311 | 96112 B |
Enumerable | 15000 | 100 | 32,222.36 ns | 0.3662 | 19560 B |
FastEnumerable | 15000 | 100 | 32,730.25 ns | 0.3662 | 19560 B |
ListCollection | 15000 | 100 | 32,583.27 ns | 0.3662 | 19560 B |
FastListCollection | 15000 | 100 | 33,289.45 ns | 0.9155 | 19312 B |
Enumerable | 15000 | 512 | 31,407.68 ns | 0.3052 | 17048 B |
FastEnumerable | 15000 | 512 | 30,778.50 ns | 0.3052 | 17048 B |
ListCollection | 15000 | 512 | 31,196.14 ns | 0.3052 | 17048 B |
FastListCollection | 15000 | 512 | 30,826.80 ns | 0.3052 | 16368 B |
Enumerable
and ListCollection
are running the current framework source code, but compiled in my test project
FastEnumerable
and FastListCollection
are a copy the same code but with the modification suggested above.
Enumerable
is directly using an iterator method that yields bytes.
ListCollection
is the same iterator that's been pushed through ToList()
before the benchmark
Analysis
Everyone has their own idea about how Chunk()
is actually used, and I don't know how many real-world scenarios this change actually improves, but it seems a low-cost way of avoiding a few allocations - the CPU difference is apparently negligible but the number of allocations is definitely reduced for countable collections (primarily IList<>
, I would have guessed).
Of course, one could do another fast-path for IList<>
, but that would definitely add a lot more additional complexity and risk of regression than this proposal.