Memory Leak #165
#80 is so last week, #165 is the new crippling crit. This what's causing the server to go down from time to time. The RAM fills up very slowly, you can just watch it in htop increasing bit by bit. Once it hits about 4GB, #80 kicks in and crashes the server. It's racer-db-mongo promises not being fulfilled, I don't know if that's due to Mongo overload causing stray promise events, or the server being too slow due to the memory build up to fulfill the Mongo request - but it's actually a "good thing" that #80 crashes the server at that point, because then it restarts and we have fresh resources. (Obviously need to fix it, but just sayin').
When you start Habit in production mode, you get:
Warning: connection.session() MemoryStore is not
designed for a production environment, as it will leak
memory, and will not scale past a single process.
Which is weird because as you can see in server/index.coffee we're not using MemoryStore, but MongoStore. I'll explore this as a culprit first in any case.
My progress is somehow completely wiped out, and now I have to start over again. I'm not sure if this count as memory leak though.
I think I got it guys! lefnire/derby-auth@8aff1c7 . I'll keep an eye on it and re-open if it's an issue again
There's still a leak somewhere. It's not as horrific as the one that was fixed, but still requires we restart the server periodically, which then triggers #413
Is there no way to run the test server in some kind of debug mode or profiler? It would be good to test this with a structured approach and identify all the stuff once and for all. I think we can easily get enough people on there to make it feel real to the server. The question is if something like that exists :x
With the risk of me sounding stupid, after quickly googling for a javascript profiler I found this? Might not be applicable to a server though :-/ But I guess it would be worth looking into something like this ?
I know one of @lefnire 's priorities is a testing suite and although you did some good research, Google Chrome essentially has firebug built in. (Though the last time I used Firebug was years ago and it didn't have a profiler which I'm fairly sure Chrome doesn't so I could be talking uselessness)
Firebug / Chrome Console will help us find issues on the client, our issue's on the serve. Two excellent approaches for profile server code are node-webkit-agent and node-memwatch. We may also want to look into Nodetime, a commercial profiler we'd setup running on the live server. These are cool tools, but to be frank I couldn't get them to tell me where our problem is. I think I need to work with them some more to figure out how to read their info properly.
the only thing i can think of is the party query subscriptions, but even when I run profiling locally in a solo group (in which case party subscriptions are bypassed), I see a leak. I'm playing with these profilers to figure it out, I'm not sure what to do with them - they tell me like "String objects take up 1GB RAM" - ok.... strings.. thanks.
The other thing is Racer uses MemoryStore session middleware, which in fact throws a warning on launch about a memory leak. It ties in closesly with racer, so it might be difficult to gut their session store in lieu of another. I hit errors just trying to replace it with Redis store or Mongo store, but it might be worth another shot.
I'm kinda hazy on this but I know Android runs a Garbage cleaner in the background that hunts down unused variables and frees up memory. It's possible that what the memory profiler is trying to tell you is that string variables are being created and not erased. You either need to create your own garbage cleaner(which has a danger of nuking user data) or ensure that each and every variable you have is gracefully destroyed after being used.
Do you not have something that records variables being created as you're operating. Maybe from there you could identify a string variable group that is getting out of control.
If you want me to give an example of the above then give me a shout.
thanks @jasonmanners for all the help looking into these issues btw!
@jasonmanners is onto something:
Its not a bug or anything, it is how we are accessing the userId.
In my tests, even without auth, if I even access model.session.userId I would get a leak.
I also noticed a piece of code in the racer session that inits a _userId variable on the model, which grabs it from session.userId or session.auth.userId.
SO instead of grabbing the userId with model.session.userId, grab it with model.get('_userId')
This may not solve the issue but I think accessing model.session. on the client routes causes a leak, this is just speculation at the moment.
I would say test it out using model.get('_userId') and see if you get the same application behavior while checking on the memory usage.
It may not solve it completely but it may help.Here is a gist with the change I am talking about: https://gist.github.com/jasonmanners/d079ff582df691fb0f06
Let me know if that works, its late and this just came to me so I may be missing something. But in my tests it does not leak, so heres to hoping.
Edit: After reading through your code on habitrpg it has occurred to me that you may already know this.
That being said even the or calls that include model.session.userId in app/party may cause this, not sure but something to consider.
I'll give changing that a shot, though it may be difficult. store accessControl callbacks require this.session to perform their checks. Is it just on the client routes that have these issues (aka, get('/') within app/index.js), and not in other locations (standard express routes, passport callbacks, store access callbacks)?
So I will put this on github after a few more tests, but it does seem that accessing model.session.userId in any route, server side or client side does lead to a leak.
That being said you can get the userId in both places using model.get('_userId'), on the server it would be req.getModel().get('_userId').
Using the session.userId in the accessControl callbacks does not cause a leak. It seems that if you can access the session within the store its fine, this.session. Its only when you access model.session.userId that the problem occurs, from my very rough tests. Still investigating though.
Looking into https://github.com/lefnire/derby-auth/blob/master/index.js#L316 (mongo re-init)
looks like the most likely culprit this point is model hanging around on the server, not being garbage-collected. Maybe in a closure somewhere. @jasonmanners
Instructions for testing:
- Checkout the
memleakbranch,npm install(it adds node-webkit-agent) -
node server.js- follow the instructions it tells - Load localhost:3000, take a memory heap snapshot in webkit-agent
- Refresh localhost:3000, take another heap snapshot
- Compare the diff between the snapshots
It'll show you which objects are still around which are large. It's typically something about model: data.splits which is a dead giveaway that model is sticking around somewhere on the server. Read through the previous comments on this thread, and the linkbacks - almost certainly the memory leak is model not getting garbage collected on the server. My theory is the require()s in src/app/index.coffee outside the get() & ready() functions?
One thing that will help is updating to latest SLaks/racer#master, but our custom update function is incompatible with that new racer code. I need to merge this properly - this will be a job for me most likely, just thinking out loud.
This seems to be the core issue at the moment. Plenty of bug reports stem from it, and it's the main thing that's preventing fluent operation right now. I'd like to put a reward of 500 mBTC (current value ~$64) on the problem, to be awarded to the person or people that solve it.
For discussion, please visit http://community.habitrpg.com/content/considering-putting-reward-memory-leak-issue
Can someone post the results of grep mongod /var/log/messages from the failing servers?
Is the memory leak rate proportional to the login rate, or UNIQUE USER login rate?
Just a suggestion that might work, even though its a bit crude:
If the profiler is still saying it are strings you might try the following.
Wait till the the server has grown big and then take a memory dump.
Then either use something along the lines of:
gcore PID | strings | sort | uniq -c | sort -n
Note: don't have gcore on this pc so couldn't verify if it's the 100% correct syntax, works for /proc/kcore though.
OR
Use your favourite blob editor.
Turns out this was Racer after all. When I deployed the 0.5 upgrade, the memory leak went away completely. Unfortunately the 0.5 upgrade didn't work out and we need to rewrite habit in Angular. This means that this is an unfixable bug, until we migrate.
Update btw, there's no fixing this. It's Racer 0.3 issues, I've 100% removed all possible model persistence from server routes. Just gotta hang tight till the Angular rewrite

There's a memory leak, necessitating a server restart, which causes #322