Improve binary endpoint memory usage #189

Daniel-Svensson · 2019-09-27T11:38:46Z

Use the BufferManager to use pooled memory when serializing messages.
This should be able to drastically reduce memory usage compared to the MemoryStream.
And for larger messages the pressure on the LOH and Gen2 GCs should se improvements.

Fixes #177

Add MemoryStream replacement
Stress test
Benchmark
Unit tests (Write based on current "draft" and maybe add some more)

Benchmarks

Before

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i5-8250U CPU 1.60GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
  [Host]    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3815.0
  MediumRun : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3815.0

Job=MediumRun  IterationCount=15  LaunchCount=2  
WarmupCount=10

Method	NumEntities	DomainClient	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
GetCititesUniqueContext	10	WcfBinary	1.453 ms	0.4366 ms	0.6535 ms	0.9640 ms	21.4844	-	-	71.38 KB
GetCititesReuseContext	10	WcfBinary	1.124 ms	0.3025 ms	0.4527 ms	0.9710 ms	19.5313	-	-	62.57 KB
GetCititesUniqueContext	100	WcfBinary	1.933 ms	0.4316 ms	0.6190 ms	1.6856 ms	48.8281	15.6250	-	191.3 KB
GetCititesReuseContext	100	WcfBinary	2.406 ms	0.5691 ms	0.8519 ms	1.9335 ms	46.8750	9.7656	-	163.89 KB
GetCititesUniqueContext	1000	WcfBinary	11.147 ms	1.9772 ms	2.8981 ms	12.3418 ms	367.1875	195.3125	39.0625	1318.2 KB
GetCititesReuseContext	1000	WcfBinary	7.322 ms	1.8208 ms	2.6690 ms	5.6060 ms	273.4375	117.1875	39.0625	1049.08 KB

Final with BinaryWriter

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i5-8250U CPU 1.60GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
  [Host]     : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4018.0
  DefaultJob : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4018.0

Method	NumEntities	DomainClient	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
GetCititesUniqueContext	10	WcfBinary	876.0 us	16.72 us	19.26 us	20.5078	0.9766	-	65.01 KB
GetCititesReuseContext	10	WcfBinary	848.6 us	16.74 us	38.79 us	15.6250	-	-	56.12 KB
GetCititesUniqueContext	100	WcfBinary	1,559.3 us	29.18 us	31.23 us	42.9688	11.7188	-	165.75 KB
GetCititesReuseContext	100	WcfBinary	1,444.0 us	28.37 us	45.00 us	37.1094	5.8594	-	124.98 KB
GetCititesUniqueContext	1000	WcfBinary	6,169.1 us	1,339.30 us	1,187.25 us	343.7500	171.8750	-	1063.03 KB
GetCititesReuseContext	1000	WcfBinary	5,357.5 us	794.58 us	704.37 us	218.7500	93.7500	-	793.62 KB

Only buffermanager stream

These benchmarks are

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362
Intel Core i5-8250U CPU 1.60GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
  [Host]     : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4010.0
  DefaultJob : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4010.0

Method	NumEntities	DomainClient	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
GetCititesUniqueContext	10	WcfBinary	935.1 us	18.51 us	37.802 us	930.9 us	21.4844	0.9766	-	67.74 KB
GetCititesReuseContext	10	WcfBinary	866.5 us	17.19 us	39.843 us	859.4 us	18.5547	-	-	58.85 KB
GetCititesUniqueContext	100	WcfBinary	1,640.7 us	10.18 us	9.518 us	1,640.3 us	42.9688	13.6719	-	168.43 KB
GetCititesReuseContext	100	WcfBinary	1,584.1 us	145.51 us	199.172 us	1,545.4 us	39.0625	7.8125	-	136.03 KB
GetCititesUniqueContext	1000	WcfBinary	8,554.4 us	966.66 us	2,850.210 us	6,184.7 us	343.7500	171.8750	-	1066.23 KB
GetCititesReuseContext	1000	WcfBinary	7,328.3 us	839.02 us	2,473.874 us	5,249.5 us	234.3750	101.5625	-	797.26 KB

Daniel-Svensson · 2019-10-02T07:15:15Z

Add note about looking into using similar behaviour as in binary encoding which performs size prediction that should allow us to skip memory copies and further improve performance.

* By using BufferManager memory preassure and LOH allocation should decrease substantially

Daniel-Svensson · 2019-10-03T11:24:17Z

...ervices.Hosting/Framework/Services/MessageEncoders/PoxBinaryMessageEncodingBindingElement.cs

+                /// that should allow us to skip memory copies and further improve performance.
+                /// 
+                /// We should be able to pool both the stream and the binary writer togheter with size data
+                using (var stream = new BufferManagerStream(bufferManager, messageOffset, minAllocationSize: 2 * 1024, maxAllocationSize: maxMessageSize))


the current min and max size are arbitrary

One could consider setting a fixed max size at 64k since that is just below the LOH threashold.
But since the memory should be pooled anyway the arrays should probably end up in Gen2 when everything works as expected.

It might make sens to set a cap at some reasonable large block size such somewhere between 128Kb (first size in LOH and 1Mb) in order to keep the number of buffer sizes used bounded, with less possible LOH fragmentation

src/OpenRiaServices.DomainServices.Hosting/Test/Data/BufferManagerStreamTests.cs

...ervices.Hosting/Framework/Services/MessageEncoders/PoxBinaryMessageEncodingBindingElement.cs

Daniel-Svensson · 2019-10-03T12:13:01Z

...ervices.Hosting/Framework/Services/MessageEncoders/PoxBinaryMessageEncodingBindingElement.cs

+                    if (count == 0)
+                        return;
+
+                    if (Is64BitProcess)


Blockcopy is faster for sizes above 1024 on net framework for x64, but it may not be significant enough to change the code

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.18362 Intel Core i5-8250U CPU 1.60GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores [Host] : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4010.0 LegacyJitX86 : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.8.4010.0 RyuJitX64 : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4010.0 Runtime=Clr

Method Job Jit Platform NumBytes Mean Error StdDev Median Ratio RatioSD

Buffer_BlockCopy LegacyJitX86 LegacyJit X86 1024 79.36 ns 1.664 ns 4.081 ns 79.36 ns 1.00 0.00

Buffer_MemoryCopy LegacyJitX86 LegacyJit X86 1024 417.77 ns 8.339 ns 20.613 ns 422.18 ns 5.28 0.33

Buffer_BlockCopy RyuJitX64 RyuJit X64 1024 72.26 ns 1.563 ns 4.034 ns 72.05 ns 1.00 0.00

Buffer_MemoryCopy RyuJitX64 RyuJit X64 1024 72.13 ns 1.504 ns 3.004 ns 71.99 ns 1.01 0.07

Buffer_BlockCopy LegacyJitX86 LegacyJit X86 2048 117.32 ns 2.431 ns 6.145 ns 118.53 ns 1.00 0.00

Buffer_MemoryCopy LegacyJitX86 LegacyJit X86 2048 787.86 ns 15.789 ns 30.041 ns 786.57 ns 6.75 0.46

Buffer_BlockCopy RyuJitX64 RyuJit X64 2048 101.15 ns 2.056 ns 4.200 ns 101.48 ns 1.00 0.00

Buffer_MemoryCopy RyuJitX64 RyuJit X64 2048 215.62 ns 22.340 ns 65.870 ns 239.91 ns 1.66 0.55

Buffer_BlockCopy LegacyJitX86 LegacyJit X86 8192 448.76 ns 34.963 ns 103.088 ns 453.13 ns 1.00 0.00

Buffer_MemoryCopy LegacyJitX86 LegacyJit X86 8192 526.23 ns 48.595 ns 143.283 ns 518.02 ns 1.26 0.55

Buffer_BlockCopy RyuJitX64 RyuJit X64 8192 626.81 ns 14.647 ns 43.187 ns 627.00 ns 1.00 0.00

Buffer_MemoryCopy RyuJitX64 RyuJit X64 8192 740.20 ns 14.804 ns 20.265 ns 742.05 ns 1.17 0.06

Daniel-Svensson · 2019-10-06T20:13:14Z

...ervices.Hosting/Framework/Services/MessageEncoders/PoxBinaryMessageEncodingBindingElement.cs

+                        // For x86 it is significantly faster to do copying of int's and longs
+                        // or similar in managed code for smaller counts (below 100-200)
+                        // But we expect most copies to be larger since xml writer buffer around 500 bytes
+                        Buffer.BlockCopy(src, srcOffset, dest, destOffset, count);


x86 copy speed

Number of bytes Buffer_BlockCopy FastCopy_Long

4 25.402 6.404

40 30.176 22.202

200 65.845 54.938

* Also add heurisics for guessing buffer size

Daniel-Svensson added Area-Client Area-Server performance labels Sep 27, 2019

Daniel-Svensson added this to the 5.0 milestone Sep 27, 2019

Daniel-Svensson changed the title ~~Improve binary endpoint memory usage~~ [WIP] Improve binary endpoint memory usage Sep 27, 2019

Daniel-Svensson changed the title ~~[WIP] Improve binary endpoint memory usage~~ Improve binary endpoint memory usage Oct 3, 2019

Daniel-Svensson added 3 commits October 3, 2019 13:42

Add BufferManagerStream to avoid allocations for serialization buffer

80cadfd

* By using BufferManager memory preassure and LOH allocation should decrease substantially

add draft for tests

80baea8

add perf notes

922f9a5

Daniel-Svensson force-pushed the feature/buffermanagerstream branch from a5fd4b8 to 3055fc8 Compare October 3, 2019 11:42

Daniel-Svensson commented Oct 3, 2019

View reviewed changes

add some additional comments

c3f6493

Daniel-Svensson force-pushed the feature/buffermanagerstream branch from 3055fc8 to c3f6493 Compare October 3, 2019 15:15

Daniel-Svensson added 3 commits October 3, 2019 19:35

add tests with starting offset for resulting buffer

0c69963

some minor test improvements

7c4586c

Increase buffer size more aggressivly

1cfacda

Daniel-Svensson commented Oct 6, 2019

View reviewed changes

Daniel-Svensson added 4 commits October 18, 2019 20:10

Add BinaryMessageWriter class to allow reuse of XmlWriter

789cd72

* Also add heurisics for guessing buffer size

cleanup namespaces

1b135f9

Fix Min => Max

a474706

variable name change

cf65b6c

Daniel-Svensson merged commit 0edd1fa into OpenRIAServices:master Oct 30, 2019

Daniel-Svensson deleted the feature/buffermanagerstream branch October 30, 2019 13:17

Daniel-Svensson mentioned this pull request Oct 31, 2019

[Discussion] v5.0 planning and feedback #178

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve binary endpoint memory usage #189

Improve binary endpoint memory usage #189

Daniel-Svensson commented Sep 27, 2019 •

edited

Daniel-Svensson commented Oct 2, 2019

Daniel-Svensson Oct 3, 2019 •

edited

Daniel-Svensson Oct 3, 2019 •

edited

Daniel-Svensson Oct 6, 2019

Method	Job	Jit	Platform	NumBytes	Mean	Error	StdDev	Median	Ratio	RatioSD
Buffer_BlockCopy	LegacyJitX86	LegacyJit	X86	1024	79.36 ns	1.664 ns	4.081 ns	79.36 ns	1.00	0.00
Buffer_MemoryCopy	LegacyJitX86	LegacyJit	X86	1024	417.77 ns	8.339 ns	20.613 ns	422.18 ns	5.28	0.33

Buffer_BlockCopy	RyuJitX64	RyuJit	X64	1024	72.26 ns	1.563 ns	4.034 ns	72.05 ns	1.00	0.00
Buffer_MemoryCopy	RyuJitX64	RyuJit	X64	1024	72.13 ns	1.504 ns	3.004 ns	71.99 ns	1.01	0.07

Buffer_BlockCopy	LegacyJitX86	LegacyJit	X86	2048	117.32 ns	2.431 ns	6.145 ns	118.53 ns	1.00	0.00
Buffer_MemoryCopy	LegacyJitX86	LegacyJit	X86	2048	787.86 ns	15.789 ns	30.041 ns	786.57 ns	6.75	0.46

Buffer_BlockCopy	RyuJitX64	RyuJit	X64	2048	101.15 ns	2.056 ns	4.200 ns	101.48 ns	1.00	0.00
Buffer_MemoryCopy	RyuJitX64	RyuJit	X64	2048	215.62 ns	22.340 ns	65.870 ns	239.91 ns	1.66	0.55

Buffer_BlockCopy	LegacyJitX86	LegacyJit	X86	8192	448.76 ns	34.963 ns	103.088 ns	453.13 ns	1.00	0.00
Buffer_MemoryCopy	LegacyJitX86	LegacyJit	X86	8192	526.23 ns	48.595 ns	143.283 ns	518.02 ns	1.26	0.55

Buffer_BlockCopy	RyuJitX64	RyuJit	X64	8192	626.81 ns	14.647 ns	43.187 ns	627.00 ns	1.00	0.00
Buffer_MemoryCopy	RyuJitX64	RyuJit	X64	8192	740.20 ns	14.804 ns	20.265 ns	742.05 ns	1.17	0.06

Improve binary endpoint memory usage #189

Improve binary endpoint memory usage #189

Conversation

Daniel-Svensson commented Sep 27, 2019 • edited

Benchmarks

Before

Final with BinaryWriter

Only buffermanager stream

Daniel-Svensson commented Oct 2, 2019

Daniel-Svensson Oct 3, 2019 • edited

Choose a reason for hiding this comment

Daniel-Svensson Oct 3, 2019 • edited

Choose a reason for hiding this comment

Daniel-Svensson Oct 6, 2019

Choose a reason for hiding this comment

Daniel-Svensson commented Sep 27, 2019 •

edited

Daniel-Svensson Oct 3, 2019 •

edited

Daniel-Svensson Oct 3, 2019 •

edited