Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent analytics backlog from blowing up system memory #233

Closed
GUI opened this issue May 22, 2015 · 2 comments
Closed

Prevent analytics backlog from blowing up system memory #233

GUI opened this issue May 22, 2015 · 2 comments

Comments

@GUI
Copy link
Member

GUI commented May 22, 2015

We hit a situation earlier today where the analytics stopped logging (which is really a separate issue that ideally shouldn't have happened). When this occurs, we begin queuing the analytics log data for future processing, so we don't actually lose any analytics data. The downside of this queueing approach is that it happens in Redis on the web servers, which means it's all kept in memory. This causes Redis's memory to begin to grow indefinitely. Aside from the unbounded memory growth, which is bad, this also eventually begins to degrade the system performance, mainly due to Redis's disk-flushing (since once the in-memroy gets big enough, then the periodic flushing to disk begins to hurt the system just in terms of disk I/O and the amount of data that needs to be flushed).

I believe I've gotten things behaving again for tonight by getting the analytics database back in action, so the queue can continue to process, and then also disabling disk flushing on Redis for now, which greatly reduces the disk churn and system load (which allows us to actually make some headway on the backlog). However, there are some underlying issues here with how we buffer the analytics data in the event of failures that would ideally be revisited. I have some thoughts on better, simpler approaches than what we're doing now, I mainly just just wanted to get this issue added before forgetting.

@GUI
Copy link
Member Author

GUI commented May 27, 2015

I had to do a few other things to get things back in action (see #234), but the underlying issue that log data gets queued up in memory is still a more fundamental issue that would be good to address.

In terms of solutions, I've actually prototyped something using Heka as our log processor. Heka largely plays the same role as our current log processor, except it's more generic and should be much better in nearly all respects. We can send log data to it, and it will send them in batches to ElasticSearch (see ElasticSearch output). It supports retries, and importantly also buffers to disk (rather than memory) in the event of failures. It should also be more performant. Heka is similar in scope to something like LogStash or fluentd, but there's a couple reasons I'd prefer Heka (it's lighter weight and also allows for customization via Lua scripts which seems quite handy for where we're possibly headed).

This prototype with Heka is currently implemented off in the Lua/OpenResty experimental branch of API Umbrella (see NREL/api-umbrella#86 for context and logging implementation here: NREL/api-umbrella-gatekeeper@04e5673). How we currently log analytics is unfortunately one of the nastier and more over-complicated bits of code we currently have, so I mainly implemented this Heka experiment on the Lua fork, since it offered a much cleaner fresh start (and I hadn't implemented logging in that branch yet). There's several characteristics of the experimental Lua branch that make this much easier, but if needs be, we could probably port this Heka implementation to our current stack if the Lua branch doesn't pan out soon enough (although I'm hopeful that might actually become a feasible option sometime soon).

@GUI
Copy link
Member Author

GUI commented Nov 27, 2015

Fixed by lua rollout.

@GUI GUI closed this as completed Nov 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant