Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
431 lines (286 sloc) 10.9 KB
<!DOCTYPE html>
<html>
<head>
<title>Title</title>
<meta charset="utf-8">
<style>
@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
@import url(https://fonts.googleapis.com/css?family=Abel);
@import url(https://fonts.googleapis.com/css?family=Ubuntu);
@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
@import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
body { font-family: 'sans-serif'; background-color: black;}
h1, h2, h3 {
font-family: 'Abel';
font-weight: normal;
}
.remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
.small-caps { font-variant: small-caps }
.small { font-size: 70%; }
/* Thanks to http://www.partage-it.com/animez-vos-presentations-remark-js/ (in French) */
.remark-fading { z-index: 9; }
.remark-slide-container {transition: opacity 0.5s ease-out; opacity: 0;}
.remark-visible {transition: opacity 0.5s ease-out; opacity: 1;}
.ably > img { width: 80%; }
.lmars > img { width: 40%; }
</style>
</head>
<body>
<textarea id="source">
class: center, middle
# Heisenbug
*/ˈhaɪzənˌbʌɡ/ (n)* A bug that disappears or alters its behavior when one attempts to probe or isolate it.
*- The Jargon File*
## Simon Woolf .small[&nbsp;&nbsp; simon@ably.io &nbsp;&nbsp; github.com/simonwoolf]
--
--
???
Hi. I'm Simon, and this is a talk about a bug.
---
# .small-caps[It started] with a 500
```sh
$ curl http://eu-west-1-a-dev-translator-realtime.ably.io/nonexistantpath -v
< HTTP/1.1 500 Internal Server Error
* Server Cowboy is not blacklisted
< server: Cowboy
```
???
It started with a 500 error when getting an unknown path, on a staging server. Which shouldn't happen; I have a catchall cowboy handler that serves a nice 404.
First step: check the logs for that machine. Here's the logfile:
---
# The logfile:
```sh
```
???
Uh. It's empty.
This shouldn't be possible. Anything which causes a 500 usually prints an error log and a crash trace or two.
Maybe there's some sort of stdout buffering that only happens in production mode?
Maybe if I do the same thing a bunch more times I'll get something. And:
---
# woah
```erlang
=INFO REPORT====
application: logger
exited: shutdown
type: permanent
=ERROR REPORT====
Error in process <0.438.0> with exit value:
{[{reason,undef},
{mfa,{'Elixir.AppName.Web.CatchAll',handle,2}},
{stacktrace,
[{'Elixir.Inspect.List',inspect,
[[],
#{'__struct__' => 'Elixir.Inspect.Opts',
base => decimal,
binaries => infer,
char_lists => infer,
limit => 50,
pretty => false,
safe => true,
structs => true,
width => 80}],
[]},
{'Elixir.Kernel',inspect,2,[{file,"lib/kernel.ex"},{line,1566}]},
{'Elixir.AppName.Web.CatchAll',handle,2,
[{file,"lib/web/catch_all.ex"},{line,11}]},
{cowboy_handler,handler_handle,4,
[{file,"src/cowboy_handler.erl"},{line,111}]},
{cowboy_protocol,execute,4,
[{file,"src/cowboy_protocol.erl"},{line,442}]}]},
{req,
[{socket,#Port<0.8619>},
{transport,ranch_tcp},
{connection,keepalive},
{pid,<0.438.0>},
{method,<<"GET">>},
{version,'HTTP/1.1'},
{peer,{{172,17,0,1},40454}},
{host,<<"172.17.0.3">>},
{host_info,undefined},
{port,6100},
{path,<<"/badpagr">>},
{path_info,[<<"badpagr">>]},
{qs,<<>>},
{qs_vals,undefined},
{bindings,[]},
{headers,
[{<<"user-agent">>,<<"curl/7.35.0">>},
{<<"host">>,<<"172.17.0.3:6100">>},
{<<"accept">>,<<"*/*">>}]},
{p_headers,[]},
{cookies,undefined},
{meta,[]},
{body_state,waiting},
{buffer,<<>>},
{multipart,undefined},
{resp_compress,false},
{resp_state,waiting},
{resp_headers,[]},
{resp_body,<<>>},
{onresponse,undefined}]},
{state,nil}],
[{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]}
```
???
Success! ...sortof.
What's going on here?
"application: logger,
exited: shutdown".
the logger's crashed!
I guess that explains why logging wasn't working too well.
OK, so, the module it crashed in my cowboy 404 handler.
And the reason is:
"undef" - a function's undefined, which was "inspect"
being called with two arguments, an empty list -- the thing being inspected - and some options.
wat?
Let's have a look at the line it crashed on:
---
# the offending code (?)
```elixir
Logger.warn "Unknown path: #{:cowboy_req.path(request) |> elem(0)},
params: #{:cowboy_req.qs_vals(request) |> elem(0) |> inspect}"
```
???
Here it is. The stacktrace pointed to the second line, so the
expression that crashed was:
--
```elixir
:cowboy_req.qs_vals(request) |> elem(0) |> inspect`
```
???
which for this request was equivalent to
--
```elixir
inspect([])
```
???
inspecting an empty list.
What's going on here? Inspect generally works, the things I log on startup inspect stuff. Just... not lists?
And only happening in production, too (well - a staging cluster, not really production, but as far a mix is concerned, it's production). Never seen it on my local machine, never seen it in ci.
So what's different between prod and locally? I think I remember hearing someone talking about elixir 'consolidating protocols' in production. I don't really know what that means, but I think Inspect is a protocol, right? so lets try compiling and running locally with MIX_ENV=prod, and...
---
class: center, middle
# 😞
???
No luck.
OK, time to get serious!
---
class: center, middle
# 😠
???
(that's me being serious)
Over some time and with help from the Elixir community, we worked out that:
---
# Debugging results
```elixir
:erlang.function_exported(Elixir.Inspect.List, :inspect, 2)
# => true locally
# => false in production
```
???
Elixir.Inspect.List doesn't seem to be available as a function, but:
--
- The `Elixir.Inspect.List.beam` file is present and correct
???
The `Elixir.Inspect.List.beam` file is there in the slug -- I downloaded the slug and checked -- and it seems to have been compiled correctly. And
---
# Debugging results
- Sometimes, `Code.ensure_loaded?(Elixir.Inspect.List)` returns `true`. In this case:
- Subsequent `function_exported(Elixir.Inspect.List, :inspect, 2)` calls return `true`
- all subsequent `inspect` calls work correctly (whether List, Map, whatever). (After a successful `inspect` of a type, the relevant `function_exported` calls start returning true)
- Sometimes, `Code.ensure_loaded?(Elixir.Inspect.List)` returns `false`. In this case:
- all `function_exported(Elixir.Inspect.List, :inspect, 2)` calls continue to return `false`
- all subsequent `inspect` calls fail (or if done by Logger, crash logger)
???
Code.ensure_loaded -- which loads a module if it isn't already -- just seems to behave completely randomly. And to be clear, this isn't just between builds, it's between deploys of a single build -- the same slug, the same compiled artifacts.
---
# Debugging results
???
[IF RUNNING OUT OF TIME]
I'm running out of time, so I'll skip some more debugging blind paths -- the short version is it seemed to depend on how soon after app startup inspect with an array was first called. And get to:
[ELSE]
After some more experimentation, it looked like this problem was only ever happening if the very first call to `inspect` for some datatype was done *from a cowboy handler process*.
So I could mitigate the problem by calling this:
--
```elixir
Logger.debug "Some data structures: \
#{inspect {%{}, [], MapSet.new, 0.1, 1..2, ~r//, :a, <<1>>}}"
```
???
on app startup -- inspecting every datatype I could think of -- so that the first call to `inspect` with an array didn't happen in a cowboy handler. Which seemed to indicate that this was a bug in Erlang's code loader, which lead to all sorts of scary labels:
--
![scary labels](./scary-labels.png)
???
on a github Elixir issue.
Eventually worked out that this was a bit of a red herring -- the reason that the code being in a cowboy handler made a difference was just because if it ran on app startup, it was running earlier. Being in a cowboy handler was irrelevant.
I don't want to keep you all in suspense, so I'll jump to:
---
# The answer
???
The answer!
---
# The answer (tldr: it was our fault)
???
Firstly I should say it wasn't me who solved it -- this guy:
---
# The answer
.lmars[![Lewis Marshall](./lmars.jpg)]
### Lewis Marshall, who found the answer
???
Lewis, our head of infrastructure, who's awesome, solved it.
A quick note about our deployment process:
we were mounting complied slugs with overlayfs, which means any filesystem writes by the application don't change the actual slug. The upper (mutable) layer was in /tmp (in a directory unique to each docker container instance).
Unrelatedly, we had a maintenance script that runs *immediately after a deploy completes*, that among other things:
---
# The answer
```ruby
if File.exists?('/tmp')
logger.info 'Cleaning up temp files older than 2 days'
warn 'Could not delete old temp files' unless system(
'sudo find /tmp -type f -mtime +2 -exec rm -f {} \;'
)
end
```
???
cleans up any files in /tmp with an mtime (last-modified time) of older than 2 day ago.
You see where this is going.
Since the things in /tmp are overlayfs upper directories, this cleanup script didn't change the actual slug, so whenever I looked at that, it was all correct. But it did mark files as 'deleted' in overlayfs, so they were hidden from the docker container.
Result: sadness, depending mostly on whether a given BEAM file had already been loaded by the erlang vm at the time the cleanup script ran.
(So the problem only showed up with inspect because all the other files and protocols were loaded and exercised earlier on, on app startup -- but, as it happens, not Inspect.List).
---
# Solution
--
```diff
def slug_volume(slug_path)
# use an overlayfs mount so that the container cannot change files in the
# underlying slug directory
- tmp = Dir.mktmpdir("slug-")
- run "sudo mount -t overlayfs \
- -o lowerdir=#{slug_path},upperdir=#{tmp} overlayfs #{tmp}"
+ FileUtils.mkdir_p "/slugs/mnt"
+ mnt = Dir.mktmpdir("slug-", "/slugs/mnt")
+ run "sudo mount -t overlayfs \
+ -o lowerdir=#{slug_path},upperdir=#{mnt} overlayfs #{mnt}"
{
- host: tmp,
+ host: mnt,
guest: "/app"
}
end
```
???
Solution: move the overlayfs upper directories out of /tmp.
---
class: center, middle
# Thank you
### Simon Woolf
### simon@ably.io &nbsp;&nbsp;&nbsp; github.com/simonwoolf
.ably[![Default-aligned image](./ably-banner.png)]
</textarea>
<script src="remark-latest.min.js">
</script>
<script>
var slideshow = remark.create({highlightStyle: 'github'});
</script>
</body>
</html>