Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hackney checkout_failure which was working before 1.16.0 #689

Open
SoniCoder opened this issue Jun 14, 2021 · 15 comments
Open

Hackney checkout_failure which was working before 1.16.0 #689

SoniCoder opened this issue Jun 14, 2021 · 15 comments
Assignees
Labels

Comments

@SoniCoder
Copy link

SoniCoder commented Jun 14, 2021

2021-06-14 13:47:38.633 [warning] <0.1002.0>@butler_client_hackney:request:121 HTTP Request unsuccessful: Reason checkout_failure, Retrying after 4000 Request: {"http://192.168.5.107:8080/api-gateway/auth-service/platform-auth/oauth/token",[{"Authorization","Basic ABCD"},{"Cache-Control","no-cache"}],"application/x-www-form-urlencoded","grant_type=client_credentials&client_id=butlerx&client_secret=butlerx"} Method: post

This error goes away when I use: {use_default_pool, false}

@benoitc
Copy link
Owner

benoitc commented Jun 16, 2021

did you try latest master?

@SoniCoder
Copy link
Author

SoniCoder commented Jun 17, 2021 via email

@benoitc
Copy link
Owner

benoitc commented Jun 17, 2021

can you send a trace of the request? Can you share a snippet? also ho many connections in the pool?

@mattbaker
Copy link

mattbaker commented Jul 7, 2021

We started seeing this too with a connection pool size set to 400. We've been running f876781 since it was merged into master.

Unfortunately I don't have much useful information to share yet, but if we come up with something useful I will post it here. So far the only interesting thing was noticing that the message queue of the hackney_pool process started growing quickly at a linear rate so something was stopping it from responding to incoming messages. We saw memory usage on the node increase as well but that might just be the message queue size increasing.

y axis is message queue length, x is time:

image

The only other piece of vague data we have is that this might correlate with having very slow hosts responding to us. It's not consistent though, it only happens in select cases, and not to all nodes.

If we learn more we will let you know!

Thanks for all your hard work on this library @benoitc, I know being a sole maintainer is a very hard job, hopefully I will have something helpful after we try to reproduce the issue.

@SoniCoder
Copy link
Author

I was facing this issue with the default hackney settings

@mattbaker
Copy link

Yeah, unfortunately we need the connection pool.

@benoitc
Copy link
Owner

benoitc commented Jul 7, 2021

@mattbaker Can you share a s trace and snippet of the code doing the request?

@mattbaker
Copy link

Yeah I'll work on finding an example, there's some abstraction that makes a simple copy-paste example hard

@leonardb
Copy link
Contributor

leonardb commented Sep 21, 2021

@SoniCoder @mattbaker If you've not already solved this you may want to check if you have somehow pinned an older version of unicode_util_compat which does not contain the string_compat module. Unfortunately xref on your top-level project won't find that.

I was running into exactly this issue when upgrading hackney on a project and after debugging discovered I had a manual include of an older version which was causing exactly this checkout_failure.

Temporarily added extra trace point of the try/catch in hackney_pool:checkout/4 and the stack trace showed the missing module.

[hackney trace 60 <0.2606.0> 2021:09:21 15:44:07 4274] checkout failure 
   Content: [{module,hackney_pool},
             {line,71},
             {error,undef,
                 [{string_compat,to_lower,["google.com"],[]},
                  {hackney_connection,new,1,
                      [{file,
                           "....hackney_connection.erl"},
                       {line,32}]},
                  {hackney_pool,do_checkout,5,
                      [{file,
                           "....hackney_pool.erl"},
                       {line,97}]},
                  {hackney_pool,'-checkout/4-fun-0-',6,
                      [{file,
                           "....hackney_pool.erl"},
                       {line,69}]}]}]
[hackney trace 80 <0.2551.0> 2021:09:21 15:44:07 4274] connect error ~p 
   Content: [{module,hackney_connect},{line,233},{error,checkout_failure}]

@mattbaker
Copy link

I don't :/ Thanks though! We ended up switching to Finch

@mjaow
Copy link

mjaow commented Nov 11, 2021

we also saw checkout_failure when traffic go through proxy

@wizardone
Copy link

wizardone commented Jan 18, 2022

Just bumping this up. Started encountering this after an elixir upgrade.
Currently running on 1.10.4 and using httpoison 1.7.0, hackney 1.17.4
I found some vague options online and tried playing with downgrading/upgrading versions, but could not get it to work.
P.S. To clarify: I'm using httpoison, but I would assume the error bubbles up from hackney.

@mruoss
Copy link

mruoss commented Mar 13, 2022

Seeing this now after upgrading to OTP 24.3. Maybe related to erlang/otp#5783?

@ankhers
Copy link

ankhers commented Oct 26, 2022

I have been looking into this for the last couple weeks for a project at work. We are using 1.18.1, which is currently the latest release on hex.

The issue seems to be related to the hackney_connections gen server. I have measured the mailbox size (erlang:process_info(Pid, message_queue_len)) being over 500k messages. Killing the process and having it restart seems to temporarily "fix" the issue until it comes back again.

I can start a new ticket if needed, but can you @benoitc describe what exactly this process is meant for? I can't seem to find any reference to the id elsewhere in the code other than creating and setting it in the #connection record. Is this something planned for later? As it stands, this is happening before the pool gets involved, so this genserver is acting as a bottleneck for every request that is attempting to go through hackney.

We are compiling against OTP 25.1.2.

I would be more than happy to run any tests in order to help find and fix this issue. The problem is that I do not know how to trigger the issue.

@ankhers
Copy link

ankhers commented Oct 28, 2022

I believe we fixed our issue.

We had a function that looked something like this.

foo(A, B) ->
  case parse_string(A, B) of
    {ok, Val} -> Val;
    {error, {_, reason}} ->
      hackney:request ...
      
      parse_string(?DEFAULT_VALUE, B)
  end.

Essentially if the function returned the error variant, we were not matching on the actual error we cared about for the retry. So when we received a different error message, we were calling this function infinitely. And because of that, we were sending an infinite number of requests to hackney. Once we had multiple processes calling this function, it was filling up the hackney_connections mailbox.


For anyone that may be having this issue, the following is how I was able to find out where the issue was coming from.

To start, I needed to crawl through hackneys code in order to figure out what was actually calling the checkout_failure. In my case, I found out it was because the hackney_connections genserver was timing out. After this I took a look into that processes mailbox using the following code in a remote shell on the node that was having issues.

1> Pid = erlang:whereis(hackney_connections).
2> erlang:process_info(Pid, message_queue_len).

Here I found out the process had over 500k messages in the mailbox. So I took a look into the mailbox to see what those messages were.

3> {messages, Messages} = erlang:process_info(Pid, messages)

That dumped me out the mailbox at that particular point in time. I then used the following (this may not be exact, I don't have the code on my machine and I may be getting the format slightly wrong here)

4> Domains = lists:foldl(fun({'$gen_call', _, {lookup, Domain, _, _}}, Acc) -> 
  case maps:get(Domain, Acc) of
    {bad_key, _} -> maps:put(Domain, 1, Acc);
    Val -> maps:put(Domain Val + 1, Acc)
end, Messages).

Which gave me a map with the domains being the key and the value being the number of requests in the mailbox for that given domain.

At this point I had an idea of where the requests were going.

So I wrote a little gen server that could run and constantly checked the hackney_connections message queue length, and when it went over 1000 messages, I triggered recon_trace:calls({hackney, request, return_trace}, 100). This told me the information that was sent by hackney and it was fairly trivial to find in our code where this was being sent from. Which is where we noticed that recursive function mentioned above.

I hope this can help someone else find issues in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants