Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added NBench support + End2End performance spec for TcpSocketChannel #95

Merged
merged 1 commit into from May 1, 2016

Conversation

Aaronontheweb
Copy link
Contributor

Added NBench support to the build scripts, a sample NBench project, and a port of a benchmark I wrote for Helios that measures duplex end-to-end performance using a realistic encoding pipeline.

@azurecla
Copy link

azurecla commented May 1, 2016

Hi @Aaronontheweb, I'm your friendly neighborhood Azure Pull Request Bot (You can call me AZPRBOT). Thanks for your contribution!
You've already signed the contribution license agreement. Thanks!
We will now validate the agreement and then real humans will evaluate your PR.

TTYL, AZPRBOT;

@Aaronontheweb
Copy link
Contributor Author

FYI - this spec touches a bunch of different features all at once (tried to do something I thought an end-user might do) and could probably stand to be optimized in a couple of areas.

@Aaronontheweb
Copy link
Contributor Author

Initial results, which you can view on pull requests going forward by looking inside the "PerfResults" artifacts folder on each TeamCity build:

DotNetty.Transport.Tests.Performance.Sockets.TcpChannelPerfSpecs+TcpChannel_Duplex_Throughput

Measures how quickly and with how much GC overhead a TcpSocketChannel --> TcpServerSocketChannel connection can decode / encode realistic messages
5/1/2016 2:11:51 AM

System Info

NBench=NBench, Version=0.2.1.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=4
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=4

NBench Settings

RunMode=Iterations, TestMode=Measurement
NumberOfIterations=13, MaximumRunTime=00:00:01

Data


Totals

Metric Units Max Average Min StdDev
TotalCollections [Gen0] collections 49.00 47.46 44.00 1.56
TotalCollections [Gen1] collections 4.00 3.15 2.00 0.99
TotalCollections [Gen2] collections 1.00 0.62 0.00 0.51
TotalBytesAllocated bytes 27,154,840.00 23,919,055.38 20,628,272.00 1,976,182.48
[Counter] inbound ops operations 100,000.00 100,000.00 100,000.00 0.00
[Counter] outbound ops operations 100,000.00 100,000.00 100,000.00 0.00

Per-second Totals

Metric Units / s Max / s Average / s Min / s StdDev / s
TotalCollections [Gen0] collections 31.89 29.80 27.02 1.43
TotalCollections [Gen1] collections 2.60 1.98 1.23 0.62
TotalCollections [Gen2] collections 0.65 0.39 0.00 0.32
TotalBytesAllocated bytes 16,660,557.86 15,018,960.66 12,119,105.95 1,306,064.28
[Counter] inbound ops operations 65,076.98 62,779.78 58,749.98 1,616.19
[Counter] outbound ops operations 65,076.98 62,779.78 58,749.98 1,616.19

Raw Data

TotalCollections [Gen0]

Run # collections collections / s ns / collections
1 47.00 29.98 33,354,048.94
2 48.00 30.28 33,028,789.58
3 46.00 27.02 37,002,786.96
4 47.00 29.21 34,233,634.04
5 46.00 28.40 35,217,239.13
6 49.00 31.79 31,456,200.00
7 49.00 30.06 33,263,016.33
8 47.00 29.59 33,791,874.47
9 49.00 31.89 31,360,032.65
10 47.00 29.75 33,616,906.38
11 49.00 30.92 32,345,485.71
12 44.00 27.89 35,859,540.91
13 49.00 30.68 32,598,748.98

TotalCollections [Gen1]

Run # collections collections / s ns / collections
1 2.00 1.28 783,820,150.00
2 4.00 2.52 396,345,475.00
3 4.00 2.35 425,532,050.00
4 4.00 2.49 402,245,200.00
5 2.00 1.23 809,996,500.00
6 4.00 2.60 385,338,450.00
7 2.00 1.23 814,943,900.00
8 3.00 1.89 529,406,033.33
9 4.00 2.60 384,160,400.00
10 2.00 1.27 789,997,300.00
11 4.00 2.52 396,232,200.00
12 2.00 1.27 788,909,900.00
13 4.00 2.50 399,334,675.00

TotalCollections [Gen2]

Run # collections collections / s ns / collections
1 0.00 0.00 1,567,640,300.00
2 1.00 0.63 1,585,381,900.00
3 1.00 0.59 1,702,128,200.00
4 1.00 0.62 1,608,980,800.00
5 0.00 0.00 1,619,993,000.00
6 1.00 0.65 1,541,353,800.00
7 0.00 0.00 1,629,887,800.00
8 1.00 0.63 1,588,218,100.00
9 1.00 0.65 1,536,641,600.00
10 0.00 0.00 1,579,994,600.00
11 1.00 0.63 1,584,928,800.00
12 0.00 0.00 1,577,819,800.00
13 1.00 0.63 1,597,338,700.00

TotalBytesAllocated

Run # bytes bytes / s ns / bytes
1 24,221,200.00 15,450,738.28 64.72
2 25,680,832.00 16,198,514.69 61.73
3 20,628,272.00 12,119,105.95 82.51
4 24,225,152.00 15,056,209.50 66.42
5 21,819,264.00 13,468,739.68 74.25
6 21,708,192.00 14,083,847.59 71.00
7 27,154,840.00 16,660,557.86 60.02
8 23,396,744.00 14,731,442.74 67.88
9 22,071,608.00 14,363,536.69 69.62
10 23,185,920.00 14,674,683.07 68.14
11 25,863,136.00 16,318,168.99 61.28
12 25,500,864.00 16,162,088.98 61.87
13 25,491,696.00 15,958,854.56 62.66

[Counter] inbound ops

Run # operations operations / s ns / operations
1 100,000.00 63,790.14 15,676.40
2 100,000.00 63,076.28 15,853.82
3 100,000.00 58,749.98 17,021.28
4 100,000.00 62,151.15 16,089.81
5 100,000.00 61,728.66 16,199.93
6 100,000.00 64,878.03 15,413.54
7 100,000.00 61,353.92 16,298.88
8 100,000.00 62,963.64 15,882.18
9 100,000.00 65,076.98 15,366.42
10 100,000.00 63,291.36 15,799.95
11 100,000.00 63,094.32 15,849.29
12 100,000.00 63,378.59 15,778.20
13 100,000.00 62,604.13 15,973.39

[Counter] outbound ops

Run # operations operations / s ns / operations
1 100,000.00 63,790.14 15,676.40
2 100,000.00 63,076.28 15,853.82
3 100,000.00 58,749.98 17,021.28
4 100,000.00 62,151.15 16,089.81
5 100,000.00 61,728.66 16,199.93
6 100,000.00 64,878.03 15,413.54
7 100,000.00 61,353.92 16,298.88
8 100,000.00 62,963.64 15,882.18
9 100,000.00 65,076.98 15,366.42
10 100,000.00 63,291.36 15,799.95
11 100,000.00 63,094.32 15,849.29
12 100,000.00 63,378.59 15,778.20
13 100,000.00 62,604.13 15,973.39

var counterHandler = new CounterHandlerInbound(this._inboundThroughputCounter);
this._signal = new ManualResetEventSlimReadFinishedSignal(this.ResetEvent);

ServerBootstrap sb = new ServerBootstrap().Group(this.ServerGroup, this.WorkerGroup).Channel<TcpServerSocketChannel>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls line break the whole thing (including initializer below) for better readabilty. You get more LOC as well! 😄

@Aaronontheweb
Copy link
Contributor Author

Ran locally with the pooled bytebuf allocators at a message count of 1 million and the reported throughput is nearly identical to what it was before.

I'll re-run it on the build server though, which thus far has posted consistently higher numbers than my development machine (which is what you'd expect from hardware that is reserved exclusively for builds and tests.)

@nayato
Copy link
Member

nayato commented May 1, 2016

"Ran locally with the pooled bytebuf allocators" -- on both client and server? Well, at that scale, with no warm up it might be expected. Pooled allocates buffers as needed right now. Arena based allocator (coming in next few days) might change things a little.

@Aaronontheweb
Copy link
Contributor Author

"Ran locally with the pooled bytebuf allocators" -- on both client and server? Well, at that scale, with no warm up it might be expected.

Yes - on both sides. There was a noticeable drop in gen 0 GC though. Gen 2 was about the same and I'll have to pull the numbers off of the build server for Gen 1.

@Aaronontheweb
Copy link
Contributor Author

To run these locally, btw:

PS> ./build.cmd nbench

@Aaronontheweb
Copy link
Contributor Author

with no warm up it might be expected

Ah, that's a good point - NBench automatically destroys the test fixture and recreates it in-place exactly like how XUnit does, so we're starting with a fresh IByteBufferAllocator each time.

this.signal = new ManualResetEventSlimReadFinishedSignal(this.ResetEvent);

// reserve up to 10mb of 16kb buffers on both client and server; we're only sending about 700k worth of messages
this.serverBufferAllocator = new PooledByteBufferAllocator(16 * 1024, 10 * 1024 * 1024 / Environment.ProcessorCount);
Copy link
Member

@nayato nayato May 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this config there's only 640 buffers to use. Considering small message size, can you pls adjust to use 1 KB buffers or even 256 bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try 256

@Aaronontheweb
Copy link
Contributor Author

First results from the initial pooled byte buffer settings:

DotNetty.Transport.Tests.Performance.Sockets.TcpChannelPerfSpecs+TcpChannel_Duplex_Throughput

Measures how quickly and with how much GC overhead a TcpSocketChannel --> TcpServerSocketChannel connection can decode / encode realistic messages
5/1/2016 4:20:03 PM

System Info

NBench=NBench, Version=0.2.1.0, Culture=neutral, PublicKeyToken=null
OS=Microsoft Windows NT 6.2.9200.0
ProcessorCount=4
CLR=4.0.30319.42000,IsMono=False,MaxGcGeneration=2
WorkerThreads=32767, IOThreads=4

NBench Settings

RunMode=Iterations, TestMode=Measurement
NumberOfIterations=13, MaximumRunTime=00:00:01

Data


Totals

Metric Units Max Average Min StdDev
TotalCollections [Gen0] collections 12.00 11.69 11.00 0.48
TotalCollections [Gen1] collections 4.00 3.08 1.00 1.44
TotalCollections [Gen2] collections 1.00 0.69 0.00 0.48
TotalBytesAllocated bytes 24,327,288.00 19,955,723.08 15,863,512.00 3,002,698.00
[Counter] inbound ops operations 100,000.00 100,000.00 100,000.00 0.00
[Counter] outbound ops operations 100,000.00 100,000.00 100,000.00 0.00

Per-second Totals

Metric Units / s Max / s Average / s Min / s StdDev / s
TotalCollections [Gen0] collections 5.36 5.12 4.58 0.23
TotalCollections [Gen1] collections 1.79 1.35 0.42 0.63
TotalCollections [Gen2] collections 0.45 0.30 0.00 0.21
TotalBytesAllocated bytes 10,875,772.56 8,759,659.30 6,605,354.02 1,427,636.66
[Counter] inbound ops operations 45,157.28 43,828.73 41,597.66 1,203.30
[Counter] outbound ops operations 45,157.28 43,828.73 41,597.66 1,203.30

@Aaronontheweb
Copy link
Contributor Author

Definitely a drop in throughput, but also a drop in GC.

@nayato what's the recommended way to warmup a pool allocator? We should shave off that overhead if possible.

@nayato
Copy link
Member

nayato commented May 1, 2016

you'd need to iterate through event loops on event loop group and execute "take X buffers, then release them" on each of event loops.

@Aaronontheweb
Copy link
Contributor Author

We'll need to play around with this some more - seems to be topping out around 45k messages per second no matter what i do today.

TBH, I suspect part of the issue here is that we're using auto-scaling on Azure to provision build agents and there's no guarantee that we're going to get identical underlying hardware beneath the VM each time. Could very well be that the VM we used last night is running on newer hardware than the one we're running on right now.

Ran a test locally comparing the pooled byte buffer allocator against the direct one again and the pool one is noticeably more performant even without warmup.

@nayato
Copy link
Member

nayato commented May 1, 2016

again, lack of batching when sending is the biggest issue here I think. I'll have cycles to address it once I'm through with new byte buffer pool.

@nayato
Copy link
Member

nayato commented May 1, 2016

just use Dv2 machines 👍

@nayato
Copy link
Member

nayato commented May 1, 2016

@Aaronontheweb, something failed on build. Worth merging anyhow. Pls break apart IReadFinishedSignal.cs (file per type) and add license header to all the new cs files and should be good to go.

@Aaronontheweb
Copy link
Contributor Author

@nayato yeah, NBench exceeded 15 minutes of runtime so FAKE timed it out :p - I'm increasing the FAKE timeout and decreasing the run interval for NBench, since I'm breaking out the benchmark into 3 parts with different flush intervals.

@Aaronontheweb
Copy link
Contributor Author

Looks like we're using standard D12s. I'll change to using a Dv2.

@Aaronontheweb
Copy link
Contributor Author

Ok, added all copyright headers and applied last formatting changes.

@Aaronontheweb
Copy link
Contributor Author

Removed the additional 10 messages per flush and 1000 messages per flush options - would have taken like 45 minutes to run (PerfCleanup call seems to take a long time to shut everything down in this spec, which I've noticed before in regular unit tests that cleanup IEventLoops.)

@nayato nayato merged commit 401d48f into Azure:dev May 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants