Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server down/stalled... #944

Closed
Martii opened this Issue Apr 5, 2016 · 59 comments

Comments

Projects
None yet
6 participants
@Martii
Copy link
Member

Martii commented Apr 5, 2016

I'm unable to get into the VPS to restart it and it's spinning in a web browser. NOTE: This is purely a VPS issue with our provider and not the project nor the node configuration.

Messaged @sizzlemctwizzle Cc: @jonleibowitz


Last script update on local pro at 2016-04-05T12:07:05.214Z

Refs:

@Martii Martii added the expedite label Apr 5, 2016

Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue Apr 6, 2016

Martii
Some dep updates
* Reinstate *toobusy-js*... at least one of their timers has been fixed on shutdown. See OpenUserJS#354, OpenUserJS#353, OpenUserJS#352 and base issue of OpenUserJS#345 ... loosely related to OpenUserJS#249 and attempt to address OpenUserJS#944 with a work-around... VPS should be faster than our old one so perhaps the timers don't make as much of a difference. Start with our old default lag value... this may introduce too many 503's again but hopefully not
* Retested delete op
* Bug fixes, tests, and docs updates... please read their CHANGELOGS
* Shutdown the server on SIGINT
* Modify db closure to not have dependents

@Martii Martii referenced this issue Apr 6, 2016

Merged

Some dep updates #945

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 6, 2016

Found a break in the system... either a hiccup on whatever is causing this... or my distro update on laptop which may have a compatible client to connect to the latest Debian... or sizzle... although I did try Windows VM (virtual machine) and PM (physical machine), Debian VM, ArchLinux VM, ArchLinux PM, and other Linux PM too and those failed... so not entirely sure. (too many inet issues today everywhere)

I have already seen a 503 with toobusy-js on login that is probably GH's issue (still guessing here)... leaving this open for a more detailed investigation over the next few days. Apologies for this unscheduled outage... definitely out of my control at this time.

Btw dist-upgrade yielded no further updates. :\

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 6, 2016

Looks like it's still down.

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 6, 2016

No news yet.

@Martii Martii added bug HOST labels Apr 6, 2016

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 6, 2016

PENDING!... got 5 "too busy's" on login... but I have now have access... we'll see how long this stays up on the VPS.

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 6, 2016

One server restart detected... investigating.

@lelinhtinh

This comment has been minimized.

Copy link

lelinhtinh commented Apr 6, 2016

503 ...

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 6, 2016

I know... it's going to take a bit to resolve this... something is chewing up memory and causing the VPS to crash... this was before the 503 addition of toobusy-js... I'm probably going to take the server down, do some recompilations, and see if that helps. e.g. that is why it's PENDING status right now.

Patience please. :)

@Martii Martii self-assigned this Apr 6, 2016

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 6, 2016

Downgrading node didn't help... seems that the malloc (or whichever lowlevel lib is being used) isn't freeing up memory in the distro/VPS.

I'm going to try disabling script minification just to be sure... with an environment variable to be added... don't worry I'll have it pass-through to the unminified so it doesn't break scripts.

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 6, 2016

Still losing memory, although slower, with disabling of script minification... e.g. the VPS is going to crash again... watching it right now go down, down, up, down, down, down, up, down, down, down, etc... until eventually there is zero free memory.

So systematically ruled out our project and reaffirmed this is a distro/VPS issue. :\

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 6, 2016

And there it goes. :\

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 6, 2016

Have to AFK for a few hours... will be back to try some other things as soon as I can. :\ Leaving the site OFFLINE for the moment.

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 7, 2016

@sizzlemctwizzle and anyone watching,
So I've put up a constant 503 on all routes at the moment... it's not very pretty but it will at least let everyone know that "we're busy... try again later" (better than nothing). This is hard-coded into the app.js with a manual FORCE_BUSY='true' in the env and not here on GH dev yet... still running some tests to see if this portion stays up. So far we are around a constant ~6% memory usage... will monitor this for a few hours... sleep in between... waking up... seeing if Debian has an update that fixes this before I make more reports, etc.

I've tried many different versions of node all result in the same issue with this kernel image on the VPS.. e.g. memory gets eaten up. Using the precompiled _node_s, the server lasts for less than 5 minutes... with a manual build from _node_s source I can sometimes get about 45 minutes of uptime. NEITHER OPTION IS SUITABLE as I can't babysit the server that constantly.

I've also looked into backing out/rolling back the last dist-upgrade and of course the old packages aren't available on the official repos... so that will fail.

Only three options are left that I can think of...

  1. Run the kernel recovery, assuming this works, and see if that works. There is exactly one snapshot, and only one total of any snapshot, of this bad VM... so twiddling can be undone now. Eventually there will be a decent snapshot that we can rollback to.
  2. Recreate the VM from scratch and see if it still does this memory leak... if it does switch distros in a new VM. Some of this is beyond my access as well as @sizzlemctwizzle did some configuration that I'm not aware of (yet?).
  3. Wait......................... (as for this option... adding tracking upstream... I'll have to create an issue on Debian first, then nodejs Cc: @mikeal ... after slumber though)

Just a sidenote... all script sources are intact as far as I can see in local pro. e.g. this is not a DB issue. (also made the HOST label here on GH as you might have noticed already)

Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue Apr 7, 2016

Martii
Create a BUSY landing page
* `BUSY_LAG` environment var so this can be twiddled with later
* `FORCE_BUSY` environment var to indicate technical difficulties with styling
* `FORCE_BUSY_ABSOLUTE` environment var to indicate technical difficulties with no UI
* Change the messages to suit

**NOTE**
This is also to test to see where the memory leak is happening... *mu2* isn't leaking since the hard-coded 503 has been in place and the average memory usage is around ~6%

Applies to OpenUserJS#944, OpenUserJS#249 and OpenUserJS#37
@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 7, 2016

~6.5% peak memory usage with styling applied to 503's

Manually enabling /about routes to test stability


Reinstalled all deps, and their deps, and so on... no dist-upgrade available.

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 7, 2016

~6.5% nominal and ~15% peak memory usage with /about routes ... no leaks detected

Manually enabling /users route to test stability

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 7, 2016

~7.1% nominal and ~8.6% peak memory usage with /users route ... slightly slower to release memory on /users/username/comments ... this will be cumulative during testing.

Manually enabling /forum route to test stability

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 7, 2016

~6.4% nominal and ~7.4% peak memory usage with /forum route

Manually enabling all other discussions except /scripts route to issue discussions to test stability

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 7, 2016

~7.1% nominal and ~8.8% peak memory usage for global discussions

Manually enabling /group route excluding api search to test stability

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 7, 2016

~7.5% nominal and ~8.9% peak memory usage for /groups route

Manually enabling /libs route excluding general / route ... this doesn't include script installations just yet but does show Source Code tab ... to test stability

@Martii

This comment has been minimized.

Copy link
Member Author

Martii commented Apr 7, 2016

~7.5% nominal and ~7.7% peak memory usage for /libs route

Manually enabling /scripts route excluding general / route ... this also doesn't include script installations just yet but does show Source Code tab... to test stability

Martii added a commit that referenced this issue Apr 25, 2017

Rework *express-brute* instances to manage .meta.js vs .user.js reque…
…sts (#1095)

* Set back to no free retries since .meta.js and .user.js requests are handled with separate instances. AOM/.user.js engines should always send the accept header regardless of FAIL state. .meta.js currently has a shorter wait period as it's less intensive.

Reapplies to #944

Auto-merge

Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue May 3, 2017

Martii
Implement maxLag for testing
* Rework/revisit *toobusy-js* a little to improve middleware performance... does slow start up time a bit
* Some immediate returns added instead of just fallthrough

**NOTES**
* Discovered VPS memory size change **decrease** :\

Applies to OpenUserJS#944 and OpenUserJS#430

Martii added a commit that referenced this issue May 3, 2017

Implement maxLag for testing (#1101)
* Rework/revisit *toobusy-js* a little to improve middleware performance... does slow start up time a bit
* Some immediate returns added instead of just fallthrough

**NOTES**
* Discovered VPS memory size change **decrease** :\

Applies to #944 and #430

Auto-merge

Martii added a commit to Martii/OpenUserJS.org that referenced this issue Oct 30, 2017

Some misc near parallel UI changes
* Change some icons around
* Add a tooltip
* Move default licensing arbitration to view instead of code... allows visual detection

Post OpenUserJS#1204 OpenUserJS#191 Loosely related to OpenUserJS#116, and OpenUserJS#944 via OpenUserJS#970

Martii added a commit that referenced this issue Oct 30, 2017

Some misc near parallel UI changes (#1208)
* Change some icons around
* Add a tooltip
* Move default licensing arbitration to view instead of code... allows visual detection

Post #1204 #191 Loosely related to #116, and #944 via #970

Auto-merge

This was referenced Jun 23, 2018

Martii added a commit that referenced this issue Jun 23, 2018

A dep update (#1441)
* *express-brute-mongo* has returned at some point to the dead with archiving the upstream project so moving to our git for maintenance.

Applies to #944

Martii added a commit to Martii/OpenUserJS.org that referenced this issue Dec 15, 2018

Repair lockdown meta url notice on Source Code page
* Link in the FAQ for this

Post OpenUserJS#944 OpenUserJS#970 OpenUserJS#389 ... missed somewhere around OpenUserJS#976 to OpenUserJS#1208 *(vaguely recall this was on the script homepage originally and moved to source code page)*. Needed for OpenUserJS#1548 to calm network traffic issues which appear to be global with Level3. Over 17,000 sites are down according to pingdom.

Martii added a commit that referenced this issue Dec 15, 2018

Repair lockdown meta url notice on Source Code page (#1549)
* Link in the FAQ for this

Post #944 #970 #389 ... missed somewhere around #976 to #1208 *(vaguely recall this was on the script homepage originally and moved to source code page)*. Needed for #1548 to calm network traffic issues which appear to be global with Level3. Over 17,000 sites are down according to pingdom.

Auto-merge

Martii added a commit to Martii/OpenUserJS.org that referenced this issue Dec 20, 2018

Some dep exchanges
* Give these functionally similar new deps to our current protection a try for DoS detections.
* Slightly broader coverage but as is, is more detectable
* Use statics until potential other values are determined

NOTES:
* `RetryAfter` may become hidden again after some more tests and possible changes. We'll see.
* Still need to tweak the max value. If you are running the current max scripts then guess what... ~"You get to wait still".
* May add the sweet factor back in but for now same managed
* Tested on dev and local pro

Applies to post OpenUserJS#944 and will give it a try for OpenUserJS#1548

Martii added a commit that referenced this issue Dec 20, 2018

Some dep exchanges (#1559)
* Give these functionally similar new deps to our current protection a try for DoS detections.
* Slightly broader coverage but as is, is more detectable
* Use statics until potential other values are determined

NOTES:
* `RetryAfter` may become hidden again after some more tests and possible changes. We'll see.
* Still need to tweak the max value. If you are running the current max scripts then guess what... ~"You get to wait still".
* May add the sweet factor back in but for now same managed
* Tested on dev and local pro

Applies to post #944 and will give it a try for #1548

Auto-merge

Martii added a commit to Martii/OpenUserJS.org that referenced this issue Jan 11, 2019

Change limits... stricter
* We'll try this "pretty" messaged for lists for the moment. Visitor in Germany seems to have given up for the time being.
* Limit a few more missed routes to indicate the seriousness of this!
* Per seat is affected behind proxies/vpn's *(same as brute limitations)*. Don't abuse the privilege please.

Applies to OpenUserJS#1548 OpenUserJS#944 and post OpenUserJS#1559

Martii added a commit that referenced this issue Jan 11, 2019

Change limits... stricter (#1569)
* We'll try this "pretty" messaged for lists for the moment. Visitor in Germany seems to have given up for the time being.
* Limit a few more missed routes to indicate the seriousness of this!
* Per seat is affected behind proxies/vpn's *(same as brute limitations)*. Don't abuse the privilege please.

Applies to #1548 #944 and post #1559 

Auto-merge

Martii added a commit to Martii/OpenUserJS.org that referenced this issue Jan 13, 2019

Martii added a commit that referenced this issue Jan 13, 2019

Martii added a commit to Martii/OpenUserJS.org that referenced this issue Jan 14, 2019

Restrict a few more routes
* These may be unlikely but could be hotlinked/bookmarked

Applies to OpenUserJS#1548 OpenUserJS#944 and post OpenUserJS#1559 OpenUserJS#1569

Martii added a commit that referenced this issue Jan 14, 2019

Restrict a few more routes (#1571)
* These may be unlikely but could be hotlinked/bookmarked

Applies to #1548 #944 and post #1559 #1569 

Auto-merge

Martii added a commit to Martii/OpenUserJS.org that referenced this issue Jan 15, 2019

Taper off on notices
* Useful for those ignoring
* Update responses to match status code elsewhere. This is handled differently in some browsers.

Applies to OpenUserJS#1548 OpenUserJS#944 and post OpenUserJS#1559 OpenUserJS#1569

Martii added a commit that referenced this issue Jan 15, 2019

Taper off on notices (#1572)
* Useful for those ignoring
* Update responses to match status code elsewhere. This is handled differently in some browsers.

Applies to #1548 #944 and post #1559 #1569

Auto-merge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.