Skip to content

J-Bax/CSharpBenchmarkExamples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

C# Benchmarks

A collection of random benchmarks in C#.

1M Concurrent Tasks

Inspired by this hackernews post: https://news.ycombinator.com/item?id=36024209 The blog post: https://pkolaczk.github.io/memory-consumption-of-async/

The authors code had some inefficiencies in the C# version. I was curious whether there are big improvements possible without changing the intent of the benchmark.

The problem statement was thus: Let’s launch N concurrent tasks, where each task waits for 10 seconds and then the program exists after all tasks finish. The number of tasks is controlled by the command line argument.

Here is the original code:

List<Task> tasks = new List<Task>();
for (int i = 0; i < numTasks; i++)
{
    Task task = Task.Run(async () =>
    {
        await Task.Delay(TimeSpan.FromSeconds(10));
    });
    tasks.Add(task);
}
await Task.WhenAll(tasks);

Observation 1: Author was not passing an initial size to List.

  • This causes the List to be re-sized and copied each time the internal capacity is exhausted.
  • Causes a large amount of intermediate allocations and GC pressure.
  • Other languages tested have the same inefficiency but this is particularly unfair for managed/garbage collected languages.
  • Depending on how frequently GC can run, those intermediate buffers may stick around for awhile leading to higher peak memory utilization.
  • In an unmanaged language, those intermediate buffers are likely freed as soon as the new buffer is created and copied to
    • The peak memory utilization is likely lower compared to the managed languages.

Let's see what the impact on memory is by passing an initial capacity:

List<Task> tasks = new List<Task>(N);
for (int i = 0; i < N; i++)
{
    Task task = Task.Run(async () =>
    {
        await Task.Delay(TimeSpan.FromMilliseconds(delayMillisec));
    });
    tasks.Add(task);
}
await Task.WhenAll(tasks);
Method Mean Error StdDev Lock Contentions Gen0 Gen1 Gen2 Allocated Alloc Ratio
SetInitialCapacity_List 12.15 s 1.295 s 0.071 s 14070.0000 55000.0000 29000.0000 4000.0000 419.96 MB 0.98
Baseline 12.30 s 3.038 s 0.167 s 14416.0000 55000.0000 29000.0000 4000.0000 428.44 MB 1.00

Surprisingly the result is fairly negligible. The total allocated memory surprised me as well and was a lot higher than I expected. This means that the size of the array itself is negligle compared to the contents of the array. That means that the memory footprint of is pretty high.

  • An array of size 1M is going to use about ~8MB of memory (64-bit/8-byte references * 1M)
  • If the List re-size logic follows exponential growth (1 -> 2 -> 4 -> ... -> ~500K -> ~1M), then the combined size of all intermediate buffers is approximately 1M.
  • This roughly matches the difference we see ~9MB.

To confirm the theory, let's benchmark resizing a list without all of the task stuff.

  • This benchmark adds the same string to a list 1M times, first with an initial capacity on the list of 1M, and second without a capacity.
Method Mean Error StdDev Ratio Gen0 Gen1 Gen2 Allocated Alloc Ratio
SetInitialCapacity_List_String 5.135 ms 1.744 ms 0.0956 ms 0.61 265.6250 265.6250 265.6250 7.63 MB 0.48
Baseline_String 8.401 ms 10.838 ms 0.5940 ms 1.00 453.1250 437.5000 437.5000 16 MB 1.00

Conclusion: Minimal impact on memory usage

  • The lack of initial List capacity doesn't contribute much to the overall peak memory usage.
  • The Task objects themselves contribute significantly more memory.

Observation 2: Delay task is unnecessarily wrapped in Task.Run(...)

  • By default .NET will eagerly execute a Task when it is created.
    • When the first real async operation occurs, usually there is a yield operation that will yield back to the caller.
  • To immediately yield a new Task without running it, you can call Task.Run(...) which creates the Task and schedules it to the Threadpool
    • Typically this is needed for CPU-bound work to avoid blocking the caller.
    • For IO-bound async code, this is typically not necessary.

Task.Delay(...) will immediately yield to the caller, so the Task.Run(...) isn't giving us any benefit here.

Let's try the same benchmark without the Task.Run(...) call:

List<Task> tasks = new List<Task>();
for (int i = 0; i < N; i++)
{
    tasks.Add(Task.Delay(TimeSpan.FromMilliseconds(delayMillisec)));
}
await Task.WhenAll(tasks);
Method Mean Error StdDev Ratio Gen0 Completed Work Items Lock Contentions Gen1 Gen2 Allocated Alloc Ratio
AvoidingTaskRun 10.93 s 1.255 s 0.069 s 0.91 23000.0000 1000001.0000 7336.0000 13000.0000 3000.0000 183.86 MB 0.43
Baseline 12.05 s 0.978 s 0.054 s 1.00 55000.0000 2000046.0000 13132.0000 29000.0000 4000.0000 428.46 MB 1.00

Clearly the result is significant.

  • Memory is roughly halved if we get rid of the Task.
  • Less queueing delay in executing all the tasks with the benchmark completing in ~11s rather than ~12s as before.
    • A perfect result would be ~10s.

Conclusion: Significant impact on memory usage

Observation 3 (from kevingadd): What if a single Task.Delay was used instead of N?

There are two ways I thought of to test this. The first is to create one delay, but continue to use N different tasks. I put N continuations on the delay task to achieve this.

Task delayTask = Task.Delay(TimeSpan.FromMilliseconds(delayMillisec));
List <Task> tasks = new List<Task>();
for (int i = 0; i < N; i++)
{
    tasks.Add(delayTask.ContinueWith((Task _) => { }));
}
await Task.WhenAll(tasks);

The second is to create one delay, with one task and then await on it N times (IMO this doesn't quite match the original intent of "...N concurrent tasks, where each task waits for 10 seconds...") e.g:

Task delayTask = Task.Delay(TimeSpan.FromMilliseconds(delayMillisec));
List <Task> tasks = new List<Task>();
for (int i = 0; i < N; i++)
{
    tasks.Add(delayTask);
}
await Task.WhenAll(tasks);

Lets see what the impact is of both approaches:

Method Mean Error StdDev Ratio Completed Work Items Lock Contentions Gen0 Gen1 Gen2 Allocated Alloc Ratio
AvoidingTaskRun_SharingSameDelay_SameTask 10.03 s 0.074 s 0.004 s 0.83 1.0000 - 1000.0000 1000.0000 1000.0000 39.63 MB 0.09
AvoidingTaskRun_SharingSameDelay_DifferentTasks 10.73 s 1.583 s 0.087 s 0.89 1000001.0000 - 15000.0000 8000.0000 2000.0000 146.45 MB 0.34
Baseline 12.12 s 0.284 s 0.016 s 1.00 2000034.0000 14365.0000 55000.0000 29000.0000 4000.0000 428.38 MB 1.00

Conclusion: Significant impact on memory usage

  • If we allow ourselves to create 1 task and wait on in N times, the improvement is ridiculous, bringing us down to ~39MB or allocations.
  • If we do only a single delay and still create N tasks to wait on it, the improvement is still pretty good.

Observation 4: How do newer .NET runtimes do? .NET 7/.NET 8?

This requires adding a new job to the benchmark config. Here are the results:

Method Runtime Mean Error StdDev Ratio Gen0 Completed Work Items Lock Contentions Gen1 Gen2 Allocated Alloc Ratio
Baseline .NET 6.0 12.12 s 2.904 s 0.159 s 1.00 55000.0000 2000033.0000 14887.0000 29000.0000 4000.0000 428.41 MB 1.00
AvoidingTaskRun .NET 6.0 10.63 s 1.343 s 0.074 s 0.88 23000.0000 1000000.0000 4770.0000 13000.0000 3000.0000 183.85 MB 0.43
Baseline .NET 7.0 11.25 s 0.367 s 0.020 s 1.00 55000.0000 2000002.0000 10811.0000 54000.0000 4000.0000 428.21 MB 1.00
AvoidingTaskRun .NET 7.0 10.65 s 0.234 s 0.013 s 0.95 23000.0000 1000000.0000 4489.0000 22000.0000 3000.0000 183.85 MB 0.43
Baseline .NET 8.0 11.42 s 1.991 s 0.109 s 1.00 53000.0000 2000015.0000 9165.0000 52000.0000 4000.0000 405.34 MB 1.00
AvoidingTaskRun .NET 8.0 10.63 s 0.873 s 0.048 s 0.93 23000.0000 1000000.0000 3882.0000 22000.0000 3000.0000 176.22 MB 0.43

The improvement is minor. The memory usage has a negligible difference from .NET 6 to .NET 7. But look at those lock contentions. It's reduced by about 30% on the baseline example. That reflects improvements in the runtime itself.

Moving to .NET 8 improvements things even more. Allocations are improved a bit on both the baseline and optimized version (without Task.Run(...)). And again the number of lock contentions is reduced.

Conclusion: Minor impact on memory usage. Runtime itself greatly improved from .NET 6 -> .NET 7.

About

Experiments in running tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages