added NBench support + End2End performance spec for TcpSocketChannel #95

Aaronontheweb · 2016-05-01T01:53:34Z

Added NBench support to the build scripts, a sample NBench project, and a port of a benchmark I wrote for Helios that measures duplex end-to-end performance using a realistic encoding pipeline.

azurecla · 2016-05-01T01:53:44Z

Hi @Aaronontheweb, I'm your friendly neighborhood Azure Pull Request Bot (You can call me AZPRBOT). Thanks for your contribution!
You've already signed the contribution license agreement. Thanks!
We will now validate the agreement and then real humans will evaluate your PR.

TTYL, AZPRBOT;

Aaronontheweb · 2016-05-01T01:55:14Z

FYI - this spec touches a bunch of different features all at once (tried to do something I thought an end-user might do) and could probably stand to be optimized in a couple of areas.

Aaronontheweb · 2016-05-01T02:15:50Z

Initial results, which you can view on pull requests going forward by looking inside the "PerfResults" artifacts folder on each TeamCity build:

DotNetty.Transport.Tests.Performance.Sockets.TcpChannelPerfSpecs+TcpChannel_Duplex_Throughput

Measures how quickly and with how much GC overhead a TcpSocketChannel --> TcpServerSocketChannel connection can decode / encode realistic messages
5/1/2016 2:11:51 AM

System Info

NBench=NBench, Version=0.2.1.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=4
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=4

NBench Settings

RunMode=Iterations, TestMode=Measurement
NumberOfIterations=13, MaximumRunTime=00:00:01

Data

Totals

Metric	Units	Max	Average	Min	StdDev
TotalCollections [Gen0]	collections	49.00	47.46	44.00	1.56
TotalCollections [Gen1]	collections	4.00	3.15	2.00	0.99
TotalCollections [Gen2]	collections	1.00	0.62	0.00	0.51
TotalBytesAllocated	bytes	27,154,840.00	23,919,055.38	20,628,272.00	1,976,182.48
[Counter] inbound ops	operations	100,000.00	100,000.00	100,000.00	0.00
[Counter] outbound ops	operations	100,000.00	100,000.00	100,000.00	0.00

Per-second Totals

Metric	Units / s	Max / s	Average / s	Min / s	StdDev / s
TotalCollections [Gen0]	collections	31.89	29.80	27.02	1.43
TotalCollections [Gen1]	collections	2.60	1.98	1.23	0.62
TotalCollections [Gen2]	collections	0.65	0.39	0.00	0.32
TotalBytesAllocated	bytes	16,660,557.86	15,018,960.66	12,119,105.95	1,306,064.28
[Counter] inbound ops	operations	65,076.98	62,779.78	58,749.98	1,616.19
[Counter] outbound ops	operations	65,076.98	62,779.78	58,749.98	1,616.19

Raw Data

TotalCollections [Gen0]

Run #	collections	collections / s	ns / collections
1	47.00	29.98	33,354,048.94
2	48.00	30.28	33,028,789.58
3	46.00	27.02	37,002,786.96
4	47.00	29.21	34,233,634.04
5	46.00	28.40	35,217,239.13
6	49.00	31.79	31,456,200.00
7	49.00	30.06	33,263,016.33
8	47.00	29.59	33,791,874.47
9	49.00	31.89	31,360,032.65
10	47.00	29.75	33,616,906.38
11	49.00	30.92	32,345,485.71
12	44.00	27.89	35,859,540.91
13	49.00	30.68	32,598,748.98

TotalCollections [Gen1]

Run #	collections	collections / s	ns / collections
1	2.00	1.28	783,820,150.00
2	4.00	2.52	396,345,475.00
3	4.00	2.35	425,532,050.00
4	4.00	2.49	402,245,200.00
5	2.00	1.23	809,996,500.00
6	4.00	2.60	385,338,450.00
7	2.00	1.23	814,943,900.00
8	3.00	1.89	529,406,033.33
9	4.00	2.60	384,160,400.00
10	2.00	1.27	789,997,300.00
11	4.00	2.52	396,232,200.00
12	2.00	1.27	788,909,900.00
13	4.00	2.50	399,334,675.00

TotalCollections [Gen2]

Run #	collections	collections / s	ns / collections
1	0.00	0.00	1,567,640,300.00
2	1.00	0.63	1,585,381,900.00
3	1.00	0.59	1,702,128,200.00
4	1.00	0.62	1,608,980,800.00
5	0.00	0.00	1,619,993,000.00
6	1.00	0.65	1,541,353,800.00
7	0.00	0.00	1,629,887,800.00
8	1.00	0.63	1,588,218,100.00
9	1.00	0.65	1,536,641,600.00
10	0.00	0.00	1,579,994,600.00
11	1.00	0.63	1,584,928,800.00
12	0.00	0.00	1,577,819,800.00
13	1.00	0.63	1,597,338,700.00

TotalBytesAllocated

Run #	bytes	bytes / s	ns / bytes
1	24,221,200.00	15,450,738.28	64.72
2	25,680,832.00	16,198,514.69	61.73
3	20,628,272.00	12,119,105.95	82.51
4	24,225,152.00	15,056,209.50	66.42
5	21,819,264.00	13,468,739.68	74.25
6	21,708,192.00	14,083,847.59	71.00
7	27,154,840.00	16,660,557.86	60.02
8	23,396,744.00	14,731,442.74	67.88
9	22,071,608.00	14,363,536.69	69.62
10	23,185,920.00	14,674,683.07	68.14
11	25,863,136.00	16,318,168.99	61.28
12	25,500,864.00	16,162,088.98	61.87
13	25,491,696.00	15,958,854.56	62.66

[Counter] inbound ops

Run #	operations	operations / s	ns / operations
1	100,000.00	63,790.14	15,676.40
2	100,000.00	63,076.28	15,853.82
3	100,000.00	58,749.98	17,021.28
4	100,000.00	62,151.15	16,089.81
5	100,000.00	61,728.66	16,199.93
6	100,000.00	64,878.03	15,413.54
7	100,000.00	61,353.92	16,298.88
8	100,000.00	62,963.64	15,882.18
9	100,000.00	65,076.98	15,366.42
10	100,000.00	63,291.36	15,799.95
11	100,000.00	63,094.32	15,849.29
12	100,000.00	63,378.59	15,778.20
13	100,000.00	62,604.13	15,973.39

[Counter] outbound ops

Run #	operations	operations / s	ns / operations
1	100,000.00	63,790.14	15,676.40
2	100,000.00	63,076.28	15,853.82
3	100,000.00	58,749.98	17,021.28
4	100,000.00	62,151.15	16,089.81
5	100,000.00	61,728.66	16,199.93
6	100,000.00	64,878.03	15,413.54
7	100,000.00	61,353.92	16,298.88
8	100,000.00	62,963.64	15,882.18
9	100,000.00	65,076.98	15,366.42
10	100,000.00	63,291.36	15,799.95
11	100,000.00	63,094.32	15,849.29
12	100,000.00	63,378.59	15,778.20
13	100,000.00	62,604.13	15,973.39

nayato · 2016-05-01T15:21:06Z

test/DotNetty.Transport.Tests.Performance/Sockets/TcpSocketChannelPerfSpec.cs

+            var counterHandler = new CounterHandlerInbound(this._inboundThroughputCounter);
+            this._signal = new ManualResetEventSlimReadFinishedSignal(this.ResetEvent);
+
+            ServerBootstrap sb = new ServerBootstrap().Group(this.ServerGroup, this.WorkerGroup).Channel<TcpServerSocketChannel>()


configure pooled allocator, e.g.: https://github.com/Azure/azure-iot-protocol-gateway/blob/master/host/ProtocolGateway.Host.Common/Bootstrapper.cs#L147

pls line break the whole thing (including initializer below) for better readabilty. You get more LOC as well! 😄

Aaronontheweb · 2016-05-01T16:12:03Z

Ran locally with the pooled bytebuf allocators at a message count of 1 million and the reported throughput is nearly identical to what it was before.

I'll re-run it on the build server though, which thus far has posted consistently higher numbers than my development machine (which is what you'd expect from hardware that is reserved exclusively for builds and tests.)

nayato · 2016-05-01T16:15:03Z

"Ran locally with the pooled bytebuf allocators" -- on both client and server? Well, at that scale, with no warm up it might be expected. Pooled allocates buffers as needed right now. Arena based allocator (coming in next few days) might change things a little.

Aaronontheweb · 2016-05-01T16:16:23Z

"Ran locally with the pooled bytebuf allocators" -- on both client and server? Well, at that scale, with no warm up it might be expected.

Yes - on both sides. There was a noticeable drop in gen 0 GC though. Gen 2 was about the same and I'll have to pull the numbers off of the build server for Gen 1.

Aaronontheweb · 2016-05-01T16:18:04Z

To run these locally, btw:

PS> ./build.cmd nbench

Aaronontheweb · 2016-05-01T16:20:42Z

with no warm up it might be expected

Ah, that's a good point - NBench automatically destroys the test fixture and recreates it in-place exactly like how XUnit does, so we're starting with a fresh IByteBufferAllocator each time.

nayato · 2016-05-01T16:21:00Z

test/DotNetty.Transport.Tests.Performance/Sockets/TcpSocketChannelPerfSpec.cs

+            this.signal = new ManualResetEventSlimReadFinishedSignal(this.ResetEvent);
+
+            // reserve up to 10mb of 16kb buffers on both client and server; we're only sending about 700k worth of messages
+            this.serverBufferAllocator = new PooledByteBufferAllocator(16 * 1024, 10 * 1024 * 1024 / Environment.ProcessorCount);


with this config there's only 640 buffers to use. Considering small message size, can you pls adjust to use 1 KB buffers or even 256 bytes?

Let's try 256

Aaronontheweb · 2016-05-01T16:23:07Z

First results from the initial pooled byte buffer settings:

DotNetty.Transport.Tests.Performance.Sockets.TcpChannelPerfSpecs+TcpChannel_Duplex_Throughput

Measures how quickly and with how much GC overhead a TcpSocketChannel --> TcpServerSocketChannel connection can decode / encode realistic messages
5/1/2016 4:20:03 PM

System Info

NBench=NBench, Version=0.2.1.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=4
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=4

NBench Settings

RunMode=Iterations, TestMode=Measurement
NumberOfIterations=13, MaximumRunTime=00:00:01

Data

Totals

Metric	Units	Max	Average	Min	StdDev
TotalCollections [Gen0]	collections	12.00	11.69	11.00	0.48
TotalCollections [Gen1]	collections	4.00	3.08	1.00	1.44
TotalCollections [Gen2]	collections	1.00	0.69	0.00	0.48
TotalBytesAllocated	bytes	24,327,288.00	19,955,723.08	15,863,512.00	3,002,698.00
[Counter] inbound ops	operations	100,000.00	100,000.00	100,000.00	0.00
[Counter] outbound ops	operations	100,000.00	100,000.00	100,000.00	0.00

Per-second Totals

Metric	Units / s	Max / s	Average / s	Min / s	StdDev / s
TotalCollections [Gen0]	collections	5.36	5.12	4.58	0.23
TotalCollections [Gen1]	collections	1.79	1.35	0.42	0.63
TotalCollections [Gen2]	collections	0.45	0.30	0.00	0.21
TotalBytesAllocated	bytes	10,875,772.56	8,759,659.30	6,605,354.02	1,427,636.66
[Counter] inbound ops	operations	45,157.28	43,828.73	41,597.66	1,203.30
[Counter] outbound ops	operations	45,157.28	43,828.73	41,597.66	1,203.30

Aaronontheweb · 2016-05-01T16:23:58Z

Definitely a drop in throughput, but also a drop in GC.

@nayato what's the recommended way to warmup a pool allocator? We should shave off that overhead if possible.

nayato · 2016-05-01T16:28:56Z

you'd need to iterate through event loops on event loop group and execute "take X buffers, then release them" on each of event loops.

Aaronontheweb · 2016-05-01T16:32:04Z

We'll need to play around with this some more - seems to be topping out around 45k messages per second no matter what i do today.

TBH, I suspect part of the issue here is that we're using auto-scaling on Azure to provision build agents and there's no guarantee that we're going to get identical underlying hardware beneath the VM each time. Could very well be that the VM we used last night is running on newer hardware than the one we're running on right now.

Ran a test locally comparing the pooled byte buffer allocator against the direct one again and the pool one is noticeably more performant even without warmup.

nayato · 2016-05-01T16:32:51Z

again, lack of batching when sending is the biggest issue here I think. I'll have cycles to address it once I'm through with new byte buffer pool.

nayato · 2016-05-01T16:33:28Z

just use Dv2 machines 👍

nayato · 2016-05-01T16:37:43Z

@Aaronontheweb, something failed on build. Worth merging anyhow. Pls break apart IReadFinishedSignal.cs (file per type) and add license header to all the new cs files and should be good to go.

Aaronontheweb · 2016-05-01T16:41:20Z

@nayato yeah, NBench exceeded 15 minutes of runtime so FAKE timed it out :p - I'm increasing the FAKE timeout and decreasing the run interval for NBench, since I'm breaking out the benchmark into 3 parts with different flush intervals.

Aaronontheweb · 2016-05-01T16:43:35Z

Looks like we're using standard D12s. I'll change to using a Dv2.

Aaronontheweb · 2016-05-01T16:48:38Z

Ok, added all copyright headers and applied last formatting changes.

Aaronontheweb · 2016-05-01T17:08:04Z

Removed the additional 10 messages per flush and 1000 messages per flush options - would have taken like 45 minutes to run (PerfCleanup call seems to take a long time to shut everything down in this spec, which I've noticed before in regular unit tests that cleanup IEventLoops.)

azurecla added the cla-not-required label May 1, 2016

nayato reviewed May 1, 2016
View reviewed changes

Aaronontheweb force-pushed the nbench-integration branch from d4f582a to 769e81c Compare May 1, 2016 16:42

Aaronontheweb force-pushed the nbench-integration branch from 769e81c to ea678d6 Compare May 1, 2016 16:48

Aaronontheweb mentioned this pull request May 1, 2016

Build more complete hardware profile in NBench reports petabridge/NBench#107

Open

added NBench support + End2End performance spec for TCP

31864c6

Aaronontheweb force-pushed the nbench-integration branch from ea678d6 to 31864c6 Compare May 1, 2016 17:06

nayato merged commit 401d48f into Azure:dev May 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added NBench support + End2End performance spec for TcpSocketChannel #95

added NBench support + End2End performance spec for TcpSocketChannel #95

Aaronontheweb commented May 1, 2016

azurecla commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

nayato May 1, 2016

nayato May 1, 2016

Aaronontheweb commented May 1, 2016

nayato commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

nayato May 1, 2016 •

edited

Aaronontheweb May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

nayato commented May 1, 2016 •

edited

Aaronontheweb commented May 1, 2016

nayato commented May 1, 2016

nayato commented May 1, 2016

nayato commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

added NBench support + End2End performance spec for TcpSocketChannel #95

added NBench support + End2End performance spec for TcpSocketChannel #95

Conversation

Aaronontheweb commented May 1, 2016

azurecla commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

DotNetty.Transport.Tests.Performance.Sockets.TcpChannelPerfSpecs+TcpChannel_Duplex_Throughput

System Info

NBench Settings

Data

Totals

Per-second Totals

Raw Data

TotalCollections [Gen0]

TotalCollections [Gen1]

TotalCollections [Gen2]

TotalBytesAllocated

[Counter] inbound ops

[Counter] outbound ops

nayato May 1, 2016

Choose a reason for hiding this comment

nayato May 1, 2016

Choose a reason for hiding this comment

Aaronontheweb commented May 1, 2016

nayato commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

nayato May 1, 2016 • edited

Choose a reason for hiding this comment

Aaronontheweb May 1, 2016

Choose a reason for hiding this comment

Aaronontheweb commented May 1, 2016

DotNetty.Transport.Tests.Performance.Sockets.TcpChannelPerfSpecs+TcpChannel_Duplex_Throughput

System Info

NBench Settings

Data

Totals

Per-second Totals

Aaronontheweb commented May 1, 2016

nayato commented May 1, 2016 • edited

Aaronontheweb commented May 1, 2016

nayato commented May 1, 2016

nayato commented May 1, 2016

nayato commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

Aaronontheweb commented May 1, 2016

nayato May 1, 2016 •

edited

nayato commented May 1, 2016 •

edited