Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBitcoin doesn't connect to the Bitcoin P2P network only on OSX under 8GB memory #345

Closed
nopara73 opened this issue Feb 22, 2018 · 38 comments

Comments

@nopara73
Copy link
Contributor

nopara73 commented Feb 22, 2018

Update. The AddressManager fails on OSX High Sierra under 8GB memory. It doesn't fail with or over 8GB.

I created a repository that reproduces this issue with a few lines of code in the Program.cs.
Repository: https://github.com/nopara73/NBitcoinBitcoinP2pOsxBug

I also added travis CI integration that shows it runs on Linux, but it doesn't run on OSX. (It runs on Windows, too.)

Travis builds: https://travis-ci.org/nopara73/NBitcoinBitcoinP2pOsxBug/builds/344928253

@lontivero
Copy link
Contributor

@NicolasDorier I would like to fix this but i need some help. The point is that i cannot configure the @nopara73's test application in order to get the logs, it seems System.Diagnostics classes cannot be configured in dotnet core 2.0 (take a look at this and this). Troubleshooting without logs is a bit hard, i am blind dealing with this.

Do you have any recomendation? I started writting another logging mechanism to replace System.Diagnostics classes but it doesn't look right. (Just as a side note: dotnet core doesn't provide a console listener neither - i can be worng)

@nopara73
Copy link
Contributor Author

nopara73 commented Feb 23, 2018

As alternative strategies I have a few things I can think of:

  • Travis: 1. Fork NBitcoin, 2. Remove tests, 3. Add test for this 4. Add osx travis build to your forked repo.
  • Mac: If you have a mac then you can do it similarly, but without travis and with Visual Studio debugger. This would be ideal.

@NicolasDorier
Copy link
Collaborator

I think NBitcoin should use ILogger from .NETStandard. I am unsure when they started doing it.
This ILogger should be injected into the NodeConnectionParameters. You just need to make sure that builds which does not support ILogger still compile though.

@NicolasDorier
Copy link
Collaborator

I will take a look to improve logging on NBitcoin.

@NicolasDorier
Copy link
Collaborator

@nopara73 right now nbitcoin is using AppVeyor. I am not against adding travis support with running all tests on mac os as well.

@NicolasDorier
Copy link
Collaborator

Add a way to extend log right now...

@NicolasDorier
Copy link
Collaborator

@lontivero can you use this f1dcd9a I added an extension point for logs for environment without TraceSource.

I need to eventually drop TraceSource completely.

@nopara73
Copy link
Contributor Author

@dev0tion if I remember well you are using OSX. Did you ever experienced it?
@dangershony @bokobza @Aprogiena ? This is a critical problem, how come we never encountered this before? Even downgrading NBitcoin fails: nopara73/NBitcoinBitcoinP2pOsxBug#1

@nopara73
Copy link
Contributor Author

nopara73 commented Feb 26, 2018

Oh as I see Stratis excludes these tests from CI. Is it possible it never even worked? https://github.com/stratisproject/StratisBitcoinFullNode/blob/master/build.sh

# exclude integration tests
if [[ "$testProject" == *"Integration.Tests"* ]] || [[ "$testProject" == *"IntegrationTests"* ]] || [[ "$testProject" == *"NBitcoin.Tests"* ]] ; then
    continue
fi

@NicolasDorier
Copy link
Collaborator

NicolasDorier commented Feb 27, 2018

@nopara73 I fixed a bug in the netcore version. You can now inject a logger through TraceSourceFactory.CreateTraceSourceFactory at the start of your app. Can you do this ?

By default it is this one, but you can make your own to print into console logs of NBitcoin.

class NullTraceSource : TraceSource
	{
		string n;
		public NullTraceSource(string n)
		{
			this.n = n;
		}
		public Switch Switch
		{
			get;
			set;
		} = new Switch();

		public void TraceEvent(TraceEventType traceEventType, int eventId, string msg, object[] args)
		{
			
		}

		public void TraceEvent(TraceEventType traceEventType, int eventId, string msg)
		{
			
		}

		public void TraceInformation(string msg)
		{
			
		}

		public void TraceTransfer(int eventId, string p2, Guid activity)
		{
			
		}
	}

@nopara73
Copy link
Contributor Author

Logging was @lontivero's request. @lontivero Can you progress with this logging? I am not sure it helps me.

I'm trying to go on another route, trying to get an OSX virtualmachine, if possible, so I'll be able to debug into the code.

@bokobza
Copy link

bokobza commented Feb 27, 2018

@nopara73 yeah we exclude integration tests from the builds on Travis as it's sometimes unreliable. Better run it on a real Mac and see what you get 😰

@knocte
Copy link
Contributor

knocte commented Feb 27, 2018

as it's sometimes unreliable

By "unreliable" I guess you mean that some tests are flaky (sometimes pass, sometimes fail). The ideal thing to do in this situation is finding which of the tests are flaky and disable them, not disable the whole test suite.

@nopara73
Copy link
Contributor Author

By "unreliable" I guess you mean that some tests are flaky (sometimes pass, sometimes fail).

Raw code should never fail. But code that communicates with a server can fail sometimes. Code that communicates with a peer to peer network is even more unreliable.

Anyway I got an OSX virtualmachine set up, so let's see.

@nopara73
Copy link
Contributor Author

nopara73 commented Feb 27, 2018

TLDR: The AddressManager fails on OSX High Sierra under 8GB memory. It doesn't fail with 8GB.

@NicolasDorier @lontivero Just to further complicate the issue: it works fine with my virtual machine: OSX High Sierra, the same what the CI uses. (OS X 10.12.6) Any idea what's going on?

Update

I lowered the requirements to match exactly of the CI. And it failed this time. I'll try to repeat it see if I get consistent results and report back:
8GB memory -> 4GB memory
4 CPU -> 2 CPU

Increased the CPU count and memory back: it works again (tested 2 times.)
I decreased the CPU count and memory to CI levels and it failed again.

I think we can say it's consistent now. Let's figure out if it's CPU or memory.
I elevated the CPU count (without memory) and it failed.
I set back the CPU count to 2 and elevated the memory to 8GB and it worked.
I lowered the memory to 5GB and it failed.
6GB failed.
7GB failed.
8GB worked again.

@knocte
Copy link
Contributor

knocte commented Feb 27, 2018

Raw code should never fail.

I guess by Raw code you mean unit tests. If yes, I agree.

But code that communicates with a server

An integration test (instead of unit), which should be placed in a different test suite.

can fail sometimes

If the server is not part of the test fixture setup, yes. Ideally, integration tests should set up and bootstrap all the elements they will need themselves. When this doesn't happen, it's when flaky tests start to appear.

@nopara73
Copy link
Contributor Author

The summary of my testing of under what conditions this bug is present: OSX under 8GB memory.

@nopara73 nopara73 changed the title NBitcoin doesn't connect to the Bitcoin P2P network only on OSX NBitcoin doesn't connect to the Bitcoin P2P network only on OSX under 8GB memory Feb 27, 2018
@NicolasDorier
Copy link
Collaborator

what @nopara73 could you check logs ?

@nopara73
Copy link
Contributor Author

@NicolasDorier I cannot find the logs. Where should I look for them?

@NicolasDorier
Copy link
Collaborator

inject TraceSourceFactory.CreateTraceSourceFactory with an instance of Func<TraceSource>. (see class NullTraceSource to see an empty implementation)

@nopara73
Copy link
Contributor Author

nopara73 commented Feb 27, 2018

Meh, I don't get DI. Could you make a PR? I guess it's a one liner: https://github.com/nopara73/NBitcoinBitcoinP2pOsxBug/blob/master/NBitcoinP2pOsxBug/Program.cs

@nopara73
Copy link
Contributor Author

nopara73 commented Feb 27, 2018

@NicolasDorier @lontivero Here you go I've found the problem, not sure about the solution yet.

An exception is thrown here: https://github.com/MetacoSA/NBitcoin/blob/master/NBitcoin/Protocol/Node.cs#L561

var node = Node.Connect(network, addr.Endpoint, param2);

Which is catched and swallowed here: https://github.com/MetacoSA/NBitcoin/blob/master/NBitcoin/Protocol/Node.cs#L569-L572

catch(SocketException)
{
	parameters.ConnectCancellation.WaitHandle.WaitOne(500);
}

With this, a node.Connect() function never returns.

The exception is: "no socket buffer space available."

@knocte
Copy link
Contributor

knocte commented Feb 27, 2018

It seems that this catch block is swallowing too much. The SocketException class has an ErrorCode property that can make error handling more specific. The commit that introduced the call to WaitOne(500) should have made sure it only did it for the specific errorCode that it was expecting, instead of all error codes. Any chance this errorCode can be reproduced again?

@nopara73
Copy link
Contributor Author

nopara73 commented Feb 27, 2018

@knocte The full exception, this exception is thrown by a million times:

System.Net.Sockets.SocketException (0x80004005): No buffer space available
   at System.Net.Sockets.Socket.UpdateStatusAfterSocketErrorAndThrowException(SocketError error, String callerName)
   at System.Net.Sockets.Socket.SetSocketOption(SocketOptionLevel optionLevel, SocketOptionName optionName, Int32 optionValue, Boolean silent)
   at System.Net.Sockets.Socket.SetSocketOption(SocketOptionLevel optionLevel, SocketOptionName optionName, Int32 optionValue)
   at System.Net.Sockets.Socket.set_ReceiveBufferSize(Int32 value)
   at NBitcoin.Protocol.Node..ctor(NetworkAddress peer, Network network, NodeConnectionParameters parameters) in /Users/user/Documents/NBitcoinBitcoinP2pOsxBug/NBitcoin/Protocol/Node.cs:line 667
   at NBitcoin.Protocol.Node.Connect(Network network, IPEndPoint endpoint, NodeConnectionParameters parameters) in /Users/user/Documents/NBitcoinBitcoinP2pOsxBug/NBitcoin/Protocol/Node.cs:line 634
   at NBitcoin.Protocol.Node.Connect(Network network, NodeConnectionParameters parameters, IPEndPoint[] connectedEndpoints, Func`2 getGroup) in /Users/user/Documents/NBitcoinBitcoinP2pOsxBug/NBitcoin/Protocol/Node.cs:line 561

@nopara73
Copy link
Contributor Author

@wintercooled tried it on 4GB Sierra High and it worked for him. Maybe this problem is specific for virtual machines?

@lontivero
Copy link
Contributor

In my short experience with OSX i could see that it imposes more strict limits in many aspects. I had to deal with ulimit (max opened files) days ago. In this case the problem seems to be the sockets send/receive buffers.

I've changed hardcoded smaller buffer size values and the @nopara73 test now pass green in travis.

I must make the test with the appropiate NodeConnectionParameters values and try again. In case it works (it has to) i will let you know then, @NicolasDorier can dicatate the way to go (probably reducing the default values in NodeConnectionParameters

@lontivero
Copy link
Contributor

I can make the tests pass simply changing the NodeConnectionParameters parameters (below my change):

-             var connectionParameters = new NodeConnectionParameters();
+            var connectionParameters = new NodeConnectionParameters()
+            {
+                ReceiveBufferSize = 100 * 5000,
+                SendBufferSize = 100 * 1000
+            };

The article Mac OSX Tuning says that

Starting with OSX 10.5, Apple added "self tuning TCP", and increased the default maximum socket buffer to 4MB

The article also says that the default new values for Settings for OSX Yosemite/Mavericks/Sierra are:

net.inet.tcp.win_scale_factor: 3 (or 5 in newer versions)
net.inet.tcp.autorcvbufmax: 1048576 (or 2097152 in newer versions )
net.inet.tcp.autosndbufmax: 1048576

in NBitcoin, the default values are hardcoded as follow:

ReceiveBufferSize = 1000 * 5000;  
SendBufferSize = 1000 * 1000;

IMO, even when NBitcoin works perfectly well and can be parametrized easily, I think those defaults values should be modify in a way that osx programmers don't have to deal with this. I'm not sure what the best strategy is (changing the values to fit the known max limits or let the default values or other)

This is not a blocker for us.

@knocte
Copy link
Contributor

knocte commented Feb 28, 2018

I think those defaults values should be modify in a way that osx programmers don't have to deal with this

This is a good suggestion.
Either way, the catch block should be modified to rethrow this exception in this case, so that it stops failing silently and the user can adjust these values manually in case he needs to, to make it work.

@NicolasDorier
Copy link
Collaborator

@lontivero can you check that setting both to 1048576 works fine?
Also is there an exception thrown if it is above ?

@NicolasDorier
Copy link
Collaborator

Made a commit which should fix it, can you try just to be sure ?

@NicolasDorier
Copy link
Collaborator

Can you also check if MacOS is throwing exception if it is above that I can catch?

@NicolasDorier
Copy link
Collaborator

@lontivero last question, can you also try

socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.SendBuffer, size);

with size being above the threshold?

@lontivero
Copy link
Contributor

@NicolasDorier it works with the the documented max values. Here you can see my commit and the travis CI build is still green

About you last question, I am working with linux right now (this env is new for me) and i don't have ILSpy but I am 99.99% sure that socket.ReceiveBufferSize = an_int_value is the same that socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.SendBuffer, size);

@NicolasDorier
Copy link
Collaborator

I am wondering if an exception is thrown so I can attempt to cap the value and retry at Node.cs level in case it is higher.

@lontivero
Copy link
Contributor

I have no way to test it. I just believe that what @nopara73 said it accurate and it is swallowing the exceptions.

And yes, I was right, ReceiveBufferSize is implemented using SetSocketOption so, there is no need to test it. Here the source: https://referencesource.microsoft.com/#System/net/System/Net/Sockets/Socket.cs,635

@NicolasDorier
Copy link
Collaborator

Thanks for your testing, I am publishing new NBitcoin version now.

@NicolasDorier
Copy link
Collaborator

Ok pushed 4.0.0.57 should be available in 10-60 minutes on nuget.

@NicolasDorier
Copy link
Collaborator

Closing this, reopen if problem persist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants