Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heads up] Cloud currently down... #1211

Closed
TheLogFather opened this issue Oct 6, 2016 · 109 comments

Comments

Projects
None yet
@TheLogFather
Copy link

commented Oct 6, 2016

No connection for https://scratch.mit.edu/projects/21501896/

Other projects are similarly not seeing the cloud values arrive after "connecting to server" box.

Also, cannot connect via API (so custom cloud clients are not working).

However, values are visible through varserver URL, and cloud logs look ok.

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 6, 2016

UPDATE: cloud values for a project do arrive... eventually...
Taking a minute or two, though.

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 6, 2016

Ah, fixed... great stuff! :)

@colbygk

This comment has been minimized.

Copy link
Member

commented Oct 6, 2016

I have restarted the varserver for http.

@colbygk colbygk closed this Oct 6, 2016

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 7, 2016

Cloud has been going through some rough patches today, on and off since ~1500UT.

It's currently really iffy – particularly new change getting reported back by the varserver URL.
See here for some real-time info: https://scratch.mit.edu/projects/119629398/

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 7, 2016

Looking good now...
(Was there a problem with a cloud server process? Or just high traffic?)

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 9, 2016

Cloud is down again – for the last few hours (since ~1500UT)...
(If it happens again, should I just keep commenting on this, or open a new topic?)

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 9, 2016

OK, looks like it's come back again just now. :)

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 16, 2016

Cloud down again for last few hours. Taking a minute or so again for projects to retrieve cloud values.
(Was also down for a few hours around midnight last night, GMT).

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 16, 2016

Cloud was back ~1647UT, but was looking pretty dodgy from ~1730UT, basically unusable from ~1800UT, and failed soon after for a bit, until ~1840UT.

It was iffy again from ~1900UT, failing completely at ~1930UT, and is still down now...

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 17, 2016

Cloud came back at 0223UT, and it's been basically OK since.

However, I'm currently seeing frequent latency spikes for the latest value getting reported back by the varserver URL (up to 1.5s delay about every 10-20 seconds).

I've seen this sort of behaviour in the past when cloud has been on the verge of failing, so it may be that it will be down again within the next hour or two...

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 17, 2016

I think it's getting worse, so perhaps it'll be even less than hour...

Yup, hardly ever seeing a varserver delay less than a second now, and sometimes even >3s. :(

@colbygk

This comment has been minimized.

Copy link
Member

commented Oct 17, 2016

Looking at the servers (the software), they appear to be making normal progress and we may be running up against limits of the hardware in terms of where they are running. Further analysis is being tracked in another repo.

@colbygk colbygk reopened this Oct 17, 2016

@colbygk

This comment has been minimized.

Copy link
Member

commented Oct 17, 2016

FYI, last month we turned on a global throttling mechanism for certain classes of requests that arrive via http. By far, the type of query caught by the throttler are too many requests being sent too quickly to the cloud vars server. The response is a 429 to those requests.

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 17, 2016

It looks like cloud is responding somewhat more sensibly now, after going down for about a minute at 1624UT.

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 17, 2016

BTW, what I'm looking at is not in a project, but in a cloud client. Every two seconds it updates a cloudvar value, and then checks with the project's varserver URL to see if the new value is 'visible' there.

It checks the varserver URL about every 0.15s (up to 5s max) after setting the new value, until it finds the new value has appeared. (Is that too often? –It fairly often finds the value is updated by the first test, which comes back at ~0.2s. If not, it's nearly always there by the second test, which comes back ~0.35s. Occasionally it sees a 'lag spike' of a bit longer [maybe 0.8ish]. If cloud is getting dodgy, as above, then it sees these spikes more regularly.)
UPDATE: Actually, I'm seeing long delays of 1-3s already...
UPDATE2: Oh... cloud was 'dead' for a few secs @1648UT, and now it's looking better again.
(Did you do something?)

@colbygk

This comment has been minimized.

Copy link
Member

commented Oct 17, 2016

Every 0.15s for 5 seconds is not too often

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 17, 2016

OK, that's good to know – though it's very rare that it goes up to 5s at that rate.

Looking at the stats, if cloud is behaving well then the cloud client is typically doing three (occasionally a few more) cloudvar sets every 2 secs, one of which is followed 0.15s later by a single varserver URL check, sometimes followed by another 0.15s later if the first didn't show the update. (And every few minutes or so there's a 'lag spike' and it takes a few more checks, separated by 0.15s.)

I guess the downside of the way this works is that if cloud is already not behaving well for some reason then it'll end up sending more of those varserver requests, which is probably not ideal for cloud. :/

Given above, I've made my client now increase the delay between each new varserver request by 0.05s after the first two checks.

BTW: I'm currently getting long delays of one to four secs again for the varserver URL to notice updates.

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 17, 2016

Oops, now it look like cloud has (just about) gone... :(
(Often taking 20-30s for a project to collect its cloudvars after project loads.)
UPDATE: cloud is back! (@2032UT) :)

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 18, 2016

Cloud is currently looking iffy, as far as varserver URLs are concerned – getting delays of one to four seconds before updates are seen through varserver URL.

However, the behaviour of the rest of cloud appears not too bad (projects are getting their values within a second or two after first loading, and the round-trip speed is at about a second, which is higher than the usual ~250ms for me, but still basically workable for most projects).

@thisandagain thisandagain modified the milestones: Backlog, October 20 Oct 18, 2016

@jwzimmer

This comment has been minimized.

Copy link
Member

commented Oct 18, 2016

Another user reported running into something that seems like it was caused by this issue in freshdesk no. 59768:

I have been on Scratch many years on my normal account, and have used them many times. On this test account, I test my multiplayer games on my normal account and I make animations. Today I was going to make a voting thing, and I know how to make them. I did everything, then at the top it said: "Cloud variables only store numbers; see FAQ" bla bla bla... But the variable was not created. For the split second it has some loading thing, so it is an error after making it. Also, I tried on my other internet browser, Microsoft Edge, and that did not work either. I don't know if the servers have come down, or what.

@TheLogFather

This comment has been minimized.

Copy link
Author

commented Oct 18, 2016

Looks like cloud is totally down at the moment – cloudvars not arriving at projects even several minutes after project loads. Also unable to create a viable cloud session via API.

@colbygk

This comment has been minimized.

Copy link
Member

commented Oct 18, 2016

It does appear that the tcp version might be stuck, while the http server is processing ~475 successful requests per second at the moment. Restarting tcp.

@encloinc

This comment has been minimized.

Copy link

commented Feb 22, 2017

Any updates on this?

@CatsAreFluffy

This comment has been minimized.

Copy link

commented May 17, 2017

Bump.

@PolyEdge

This comment has been minimized.

Copy link

commented Jun 3, 2017

Any idea on what's happening with the TCP interface?

@encloinc

This comment has been minimized.

Copy link

commented Jun 3, 2017

Nobody knows. Im sick of fallback to be honest, its too slow

@mooshoe

This comment has been minimized.

Copy link

commented Jun 3, 2017

@ModernFeelGames

This comment has been minimized.

Copy link

commented Jun 4, 2017

dude, that'd be sick bro

@PolyEdge

This comment has been minimized.

Copy link

commented Jun 8, 2017

Last time they said anything was in like October and they said the new system would come in January/Febuary this year

@PolyEdge

This comment has been minimized.

Copy link

commented Jun 8, 2017

Also I'm hoping if they even finish this, they'll keep the format the same so that I don't have to rewrite the cloud part of scratchapi yet again 📦

@towerofnix

This comment has been minimized.

Copy link

commented Jun 8, 2017

@PolyEdge Last time they said anything was about two hours ago, here:

We're going to have a new cloud data infrastructure Very Soon™ which will use websockets, so keep an eye on cloud data in the next couple of months!

@PolyEdge

This comment has been minimized.

Copy link

commented Jun 8, 2017

@liam4 Very Soon™ 👍

@PolyEdge

This comment has been minimized.

Copy link

commented Jun 8, 2017

Also that ninja was insane

@encloinc

This comment has been minimized.

Copy link

commented Aug 18, 2017

3 months later

@jwzimmer

This comment has been minimized.

Copy link
Member

commented Aug 22, 2017

The new cloud data work has been deployed as a soft launch (accessible by specific URLs) on Production.

If anyone would like to help test (which would be greatly appreciated!), here's what you need to know:

  • Make sure the URL of the project you are editing ends in ?newcloud
    • If you load a project, the URL needs to end in (random project number for example) /projects/171781902/?newcloud
    • If you are going back and forth between the editor & project page, the URL still needs to have ?newcloud in it, like /?newcloud#editor or /?newcloud#player
    • If you refresh the page or create a new project, the ?newcloud part needs to be re-added
  • Cloud data in the "old"/ existing system on Production is not available in the new soft launch version, but it will be added when the system is rolled out globally (rather than as a soft launch)
    • When you are testing a project with e.g. a high score on Production, that high score won't be available at the ?newcloud version of the project
    • Data you create during the soft launch of the new cloud data system will be erased when the new work is deployed globally
  • The cloud monitor page is accessible at (random project number for example) /cloudmonitor/171781902/soft/

Please report any bugs you find here! Thank you.

@CosmicWebServices

This comment has been minimized.

Copy link

commented Aug 22, 2017

Made a small test for it... works great! https://scratch.mit.edu/projects/171867511/?newcloud

@joker314

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2017

Could be just me, but this is my console output.

image

It continuously reconnects before giving an error of "WebSocket is already in CLOSING or CLOSED state.", then reconnects again, then gives the same error message, over and over again.

However, the cloud log correctly updates the variable. Meaning that despite the errors everything is working well.

So, just error messages -- no actual problem.

I'm using a Windows 10 computer, with my browser being Google Chrome, version 60.0.3112.101

@griffpatch

This comment has been minimized.

Copy link

commented Aug 22, 2017

@thisandagain

This comment has been minimized.

Copy link
Member

commented Aug 23, 2017

@joker314 Thanks for reporting! We'll take a look.

/cc @jwzimmer @colbygk

@thisandagain

This comment has been minimized.

Copy link
Member

commented Aug 23, 2017

@griffpatch The system is still throttled (otherwise we simply would not be able to scale and maintain it) but much more liberally than what is in place right now. The goal is to get everyone moved over to this new websocket-based system in the short term and then we'll continue to evaluate and add new features on top of this new infrastructure after Scratch 3.0 is released.

@joker314

This comment has been minimized.

Copy link
Contributor

commented Aug 24, 2017

(Bug persists when all extensions are disabled)

@CosmicWebServices

This comment has been minimized.

Copy link

commented Aug 24, 2017

On Firefox (latest stable) I get
Websocket closed, code:1000 reason: project_base.js:1:5740
07:31:14 PM | wrn | Connection closed to cloud server Object { reconnect: 1649.0207351744175, attempts: 3 } 171867511
07:31:16 PM | inf | Attempt reconnect to cloud server. 171867511
07:31:20 PM | inf | Successfully connected to cloud data server

Every 5 seconds or so

@prail

This comment has been minimized.

Copy link

commented Aug 25, 2017

@thisandagain You say that cloud is still throttled very well below the rates required for realtime multiplayer. I have a project here: https://scratch.mit.edu/projects/171930267/?newcloud that works quite well with some very basic multiplayer functionality. (Movement and costume changes.) Has this changed since you posted last?

@jwzimmer

This comment has been minimized.

Copy link
Member

commented Aug 25, 2017

@CosmicWebServices Thanks for reporting that. Did you notice any problems in the functionality of your project when you saw the error? If you have a link to the project that occurred in as well as steps that cause it to happen, that would be helpful, too. 👍

@CosmicWebServices

This comment has been minimized.

Copy link

commented Aug 27, 2017

@jwzimmer no not really the link is already above (the one I posted)

Suggestion: https://scratch.mit.edu/discuss/topic/274278/?page=2#post-2799178

@colbygk

This comment has been minimized.

Copy link
Member

commented Sep 11, 2017

Clouddata has now been migrated to the new websockets based platform and ?newcloud is no longer required to have a project use it.

@griffpatch

This comment has been minimized.

Copy link

commented Sep 22, 2017

@griffpatch

This comment has been minimized.

Copy link

commented Sep 22, 2017

@thisandagain

This comment has been minimized.

Copy link
Member

commented Sep 22, 2017

Thanks @griffpatch. I'm going to move this over to another issue.

@colbygk

This comment has been minimized.

Copy link
Member

commented Sep 22, 2017

@griffpatch Thanks for working on cloud data projects!

Last night (21 Sep 2017), I deployed some changes to the new cloud data service that has improved connectivity issues.

Could you point me at a project where you're seeing the high score and data integrity issues?

@griffpatch

This comment has been minimized.

Copy link

commented Sep 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.