Memory Leak #165

Closed
lefnire opened this Issue Jan 22, 2013 · 36 comments

Projects

None yet

10 participants

@lefnire
HabitRPG member

There's a memory leak, necessitating a server restart, which causes #322

@lefnire
HabitRPG member

#80 is so last week, #165 is the new crippling crit. This what's causing the server to go down from time to time. The RAM fills up very slowly, you can just watch it in htop increasing bit by bit. Once it hits about 4GB, #80 kicks in and crashes the server. It's racer-db-mongo promises not being fulfilled, I don't know if that's due to Mongo overload causing stray promise events, or the server being too slow due to the memory build up to fulfill the Mongo request - but it's actually a "good thing" that #80 crashes the server at that point, because then it restarts and we have fresh resources. (Obviously need to fix it, but just sayin').

When you start Habit in production mode, you get:

Warning: connection.session() MemoryStore is not
designed for a production environment, as it will leak
memory, and will not scale past a single process.

Which is weird because as you can see in server/index.coffee we're not using MemoryStore, but MongoStore. I'll explore this as a culprit first in any case.

@lefnire
HabitRPG member
@lefnire
HabitRPG member

possible lead: derbyjs/racer#72

@lefnire lefnire referenced this issue in derbyjs/racer Jan 30, 2013
Closed

data.splits in treeLookup leaks memory #72

@cirby1995

My progress is somehow completely wiped out, and now I have to start over again. I'm not sure if this count as memory leak though.

@TrevorJin

My progress is gone as well.

@lefnire
HabitRPG member

Is that #268 you're both experiencing?

@StanLindsey

Uh oh - One of my dailys seems to have replaced all my other dailys bar three:

dailys

@StanLindsey

That ended up fixing itself by the by.

@lefnire
HabitRPG member

@switz the memory leak

@lefnire
HabitRPG member

I think I got it guys! lefnire/derby-auth@8aff1c7 . I'll keep an eye on it and re-open if it's an issue again

@lefnire lefnire closed this Feb 13, 2013
@lefnire
HabitRPG member

There's still a leak somewhere. It's not as horrific as the one that was fixed, but still requires we restart the server periodically, which then triggers #413

@lefnire lefnire reopened this Feb 18, 2013
@Pandoro

Is there no way to run the test server in some kind of debug mode or profiler? It would be good to test this with a structured approach and identify all the stuff once and for all. I think we can easily get enough people on there to make it feel real to the server. The question is if something like that exists :x

With the risk of me sounding stupid, after quickly googling for a javascript profiler I found this? Might not be applicable to a server though :-/ But I guess it would be worth looking into something like this ?

@StanLindsey

I know one of @lefnire 's priorities is a testing suite and although you did some good research, Google Chrome essentially has firebug built in. (Though the last time I used Firebug was years ago and it didn't have a profiler which I'm fairly sure Chrome doesn't so I could be talking uselessness)

@lefnire
HabitRPG member

Firebug / Chrome Console will help us find issues on the client, our issue's on the serve. Two excellent approaches for profile server code are node-webkit-agent and node-memwatch. We may also want to look into Nodetime, a commercial profiler we'd setup running on the live server. These are cool tools, but to be frank I couldn't get them to tell me where our problem is. I think I need to work with them some more to figure out how to read their info properly.

@safarijacksa

What other suspects do you have for the memory leak other than #332 and #413?

@lefnire
HabitRPG member

the only thing i can think of is the party query subscriptions, but even when I run profiling locally in a solo group (in which case party subscriptions are bypassed), I see a leak. I'm playing with these profilers to figure it out, I'm not sure what to do with them - they tell me like "String objects take up 1GB RAM" - ok.... strings.. thanks.

@lefnire
HabitRPG member

The other thing is Racer uses MemoryStore session middleware, which in fact throws a warning on launch about a memory leak. It ties in closesly with racer, so it might be difficult to gut their session store in lieu of another. I hit errors just trying to replace it with Redis store or Mongo store, but it might be worth another shot.

@safarijacksa

I'm kinda hazy on this but I know Android runs a Garbage cleaner in the background that hunts down unused variables and frees up memory. It's possible that what the memory profiler is trying to tell you is that string variables are being created and not erased. You either need to create your own garbage cleaner(which has a danger of nuking user data) or ensure that each and every variable you have is gracefully destroyed after being used.

Do you not have something that records variables being created as you're operating. Maybe from there you could identify a string variable group that is getting out of control.

If you want me to give an example of the above then give me a shout.

@lefnire
HabitRPG member

thanks @jasonmanners for all the help looking into these issues btw!

@lefnire
HabitRPG member

@jasonmanners is onto something:

Its not a bug or anything, it is how we are accessing the userId.

In my tests, even without auth, if I even access model.session.userId I would get a leak.

I also noticed a piece of code in the racer session that inits a _userId variable on the model, which grabs it from session.userId or session.auth.userId.


SO instead of grabbing the userId with model.session.userId, grab it with model.get('_userId')

This may not solve the issue but I think accessing model.session. on the client routes causes a leak, this is just speculation at the moment.

I would say test it out using model.get('_userId') and see if you get the same application behavior while checking on the memory usage.
It may not solve it completely but it may help.

Here is a gist with the change I am talking about: https://gist.github.com/jasonmanners/d079ff582df691fb0f06

Let me know if that works, its late and this just came to me so I may be missing something. But in my tests it does not leak, so heres to hoping.

Edit: After reading through your code on habitrpg it has occurred to me that you may already know this.
That being said even the or calls that include model.session.userId in app/party may cause this, not sure but something to consider.

@lefnire
HabitRPG member

I'll give changing that a shot, though it may be difficult. store accessControl callbacks require this.session to perform their checks. Is it just on the client routes that have these issues (aka, get('/') within app/index.js), and not in other locations (standard express routes, passport callbacks, store access callbacks)?

@lefnire
HabitRPG member

So I will put this on github after a few more tests, but it does seem that accessing model.session.userId in any route, server side or client side does lead to a leak.

That being said you can get the userId in both places using model.get('_userId'), on the server it would be req.getModel().get('_userId').

Using the session.userId in the accessControl callbacks does not cause a leak. It seems that if you can access the session within the store its fine, this.session. Its only when you access model.session.userId that the problem occurs, from my very rough tests. Still investigating though.

@lefnire lefnire was assigned Feb 22, 2013
@lefnire
HabitRPG member
@lefnire
HabitRPG member

Wasn't it, check out derbyjs/racer#72 (comment)

@lefnire
HabitRPG member

looks like the most likely culprit this point is model hanging around on the server, not being garbage-collected. Maybe in a closure somewhere. @jasonmanners

@lefnire
HabitRPG member

derbyjs/racer#100, going to test today!

@lefnire
HabitRPG member

Instructions for testing:

  • Checkout the memleak branch, npm install (it adds node-webkit-agent)
  • node server.js - follow the instructions it tells
  • Load localhost:3000, take a memory heap snapshot in webkit-agent
  • Refresh localhost:3000, take another heap snapshot
  • Compare the diff between the snapshots

It'll show you which objects are still around which are large. It's typically something about model: data.splits which is a dead giveaway that model is sticking around somewhere on the server. Read through the previous comments on this thread, and the linkbacks - almost certainly the memory leak is model not getting garbage collected on the server. My theory is the require()s in src/app/index.coffee outside the get() & ready() functions?

@lefnire
HabitRPG member

One thing that will help is updating to latest SLaks/racer#master, but our custom update function is incompatible with that new racer code. I need to merge this properly - this will be a job for me most likely, just thinking out loud.

@Mqrius

This seems to be the core issue at the moment. Plenty of bug reports stem from it, and it's the main thing that's preventing fluent operation right now. I'd like to put a reward of 500 mBTC (current value ~$64) on the problem, to be awarded to the person or people that solve it.

For discussion, please visit http://community.habitrpg.com/content/considering-putting-reward-memory-leak-issue

@zenware

Can someone post the results of grep mongod /var/log/messages from the failing servers?

@CoachHuff

Is the memory leak rate proportional to the login rate, or UNIQUE USER login rate?

@qantourisc

Just a suggestion that might work, even though its a bit crude:

If the profiler is still saying it are strings you might try the following.
Wait till the the server has grown big and then take a memory dump.

Then either use something along the lines of:

gcore PID | strings | sort | uniq -c | sort -n
Note: don't have gcore on this pc so couldn't verify if it's the 100% correct syntax, works for /proc/kcore though.

OR
Use your favourite blob editor.

@lefnire
HabitRPG member

Turns out this was Racer after all. When I deployed the 0.5 upgrade, the memory leak went away completely. Unfortunately the 0.5 upgrade didn't work out and we need to rewrite habit in Angular. This means that this is an unfixable bug, until we migrate.

@lefnire
HabitRPG member

closing, naught that we can do until we migration

@lefnire lefnire closed this Jul 23, 2013
@lefnire
HabitRPG member

Update btw, there's no fixing this. It's Racer 0.3 issues, I've 100% removed all possible model persistence from server routes. Just gotta hang tight till the Angular rewrite

@khipkin khipkin referenced this issue Jun 30, 2016
Closed

500.html is out of date #7745

0 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment