Skip to content
This repository has been archived by the owner on Sep 3, 2019. It is now read-only.

[Heads up] Cloud currently down... #1211

Closed
TheLogFather opened this issue Oct 6, 2016 · 109 comments
Closed

[Heads up] Cloud currently down... #1211

TheLogFather opened this issue Oct 6, 2016 · 109 comments
Assignees
Milestone

Comments

@TheLogFather
Copy link

TheLogFather commented Oct 6, 2016

No connection for https://scratch.mit.edu/projects/21501896/

Other projects are similarly not seeing the cloud values arrive after "connecting to server" box.

Also, cannot connect via API (so custom cloud clients are not working).

However, values are visible through varserver URL, and cloud logs look ok.

@TheLogFather
Copy link
Author

UPDATE: cloud values for a project do arrive... eventually...
Taking a minute or two, though.

@TheLogFather
Copy link
Author

Ah, fixed... great stuff! :)

@colbygk
Copy link

colbygk commented Oct 6, 2016

I have restarted the varserver for http.

@colbygk colbygk closed this as completed Oct 6, 2016
@TheLogFather
Copy link
Author

Cloud has been going through some rough patches today, on and off since ~1500UT.

It's currently really iffy – particularly new change getting reported back by the varserver URL.
See here for some real-time info: https://scratch.mit.edu/projects/119629398/

@TheLogFather
Copy link
Author

Looking good now...
(Was there a problem with a cloud server process? Or just high traffic?)

@TheLogFather
Copy link
Author

TheLogFather commented Oct 9, 2016

Cloud is down again – for the last few hours (since ~1500UT)...
(If it happens again, should I just keep commenting on this, or open a new topic?)

@TheLogFather
Copy link
Author

OK, looks like it's come back again just now. :)

@TheLogFather
Copy link
Author

Cloud down again for last few hours. Taking a minute or so again for projects to retrieve cloud values.
(Was also down for a few hours around midnight last night, GMT).

@TheLogFather
Copy link
Author

Cloud was back ~1647UT, but was looking pretty dodgy from ~1730UT, basically unusable from ~1800UT, and failed soon after for a bit, until ~1840UT.

It was iffy again from ~1900UT, failing completely at ~1930UT, and is still down now...

@TheLogFather
Copy link
Author

TheLogFather commented Oct 17, 2016

Cloud came back at 0223UT, and it's been basically OK since.

However, I'm currently seeing frequent latency spikes for the latest value getting reported back by the varserver URL (up to 1.5s delay about every 10-20 seconds).

I've seen this sort of behaviour in the past when cloud has been on the verge of failing, so it may be that it will be down again within the next hour or two...

@TheLogFather
Copy link
Author

TheLogFather commented Oct 17, 2016

I think it's getting worse, so perhaps it'll be even less than hour...

Yup, hardly ever seeing a varserver delay less than a second now, and sometimes even >3s. :(

@colbygk
Copy link

colbygk commented Oct 17, 2016

Looking at the servers (the software), they appear to be making normal progress and we may be running up against limits of the hardware in terms of where they are running. Further analysis is being tracked in another repo.

@colbygk colbygk reopened this Oct 17, 2016
@colbygk
Copy link

colbygk commented Oct 17, 2016

FYI, last month we turned on a global throttling mechanism for certain classes of requests that arrive via http. By far, the type of query caught by the throttler are too many requests being sent too quickly to the cloud vars server. The response is a 429 to those requests.

@TheLogFather
Copy link
Author

It looks like cloud is responding somewhat more sensibly now, after going down for about a minute at 1624UT.

@TheLogFather
Copy link
Author

TheLogFather commented Oct 17, 2016

BTW, what I'm looking at is not in a project, but in a cloud client. Every two seconds it updates a cloudvar value, and then checks with the project's varserver URL to see if the new value is 'visible' there.

It checks the varserver URL about every 0.15s (up to 5s max) after setting the new value, until it finds the new value has appeared. (Is that too often? –It fairly often finds the value is updated by the first test, which comes back at ~0.2s. If not, it's nearly always there by the second test, which comes back ~0.35s. Occasionally it sees a 'lag spike' of a bit longer [maybe 0.8ish]. If cloud is getting dodgy, as above, then it sees these spikes more regularly.)
UPDATE: Actually, I'm seeing long delays of 1-3s already...
UPDATE2: Oh... cloud was 'dead' for a few secs @1648UT, and now it's looking better again.
(Did you do something?)

@colbygk
Copy link

colbygk commented Oct 17, 2016

Every 0.15s for 5 seconds is not too often

@TheLogFather
Copy link
Author

OK, that's good to know – though it's very rare that it goes up to 5s at that rate.

Looking at the stats, if cloud is behaving well then the cloud client is typically doing three (occasionally a few more) cloudvar sets every 2 secs, one of which is followed 0.15s later by a single varserver URL check, sometimes followed by another 0.15s later if the first didn't show the update. (And every few minutes or so there's a 'lag spike' and it takes a few more checks, separated by 0.15s.)

I guess the downside of the way this works is that if cloud is already not behaving well for some reason then it'll end up sending more of those varserver requests, which is probably not ideal for cloud. :/

Given above, I've made my client now increase the delay between each new varserver request by 0.05s after the first two checks.

BTW: I'm currently getting long delays of one to four secs again for the varserver URL to notice updates.

@TheLogFather
Copy link
Author

TheLogFather commented Oct 17, 2016

Oops, now it look like cloud has (just about) gone... :(
(Often taking 20-30s for a project to collect its cloudvars after project loads.)
UPDATE: cloud is back! (@2032UT) :)

@TheLogFather
Copy link
Author

Cloud is currently looking iffy, as far as varserver URLs are concerned – getting delays of one to four seconds before updates are seen through varserver URL.

However, the behaviour of the rest of cloud appears not too bad (projects are getting their values within a second or two after first loading, and the round-trip speed is at about a second, which is higher than the usual ~250ms for me, but still basically workable for most projects).

@thisandagain thisandagain modified the milestones: Backlog, October 20 Oct 18, 2016
@jwzimmer-zz
Copy link

Another user reported running into something that seems like it was caused by this issue in freshdesk no. 59768:

I have been on Scratch many years on my normal account, and have used them many times. On this test account, I test my multiplayer games on my normal account and I make animations. Today I was going to make a voting thing, and I know how to make them. I did everything, then at the top it said: "Cloud variables only store numbers; see FAQ" bla bla bla... But the variable was not created. For the split second it has some loading thing, so it is an error after making it. Also, I tried on my other internet browser, Microsoft Edge, and that did not work either. I don't know if the servers have come down, or what.

@TheLogFather
Copy link
Author

Looks like cloud is totally down at the moment – cloudvars not arriving at projects even several minutes after project loads. Also unable to create a viable cloud session via API.

@colbygk
Copy link

colbygk commented Oct 18, 2016

It does appear that the tcp version might be stuck, while the http server is processing ~475 successful requests per second at the moment. Restarting tcp.

@encloinc
Copy link

Any updates on this?

@CatsAreFluffy
Copy link

Bump.

@PolyEdge
Copy link

PolyEdge commented Jun 3, 2017

Any idea on what's happening with the TCP interface?

@encloinc
Copy link

encloinc commented Jun 3, 2017

Nobody knows. Im sick of fallback to be honest, its too slow

@jacobduba
Copy link

jacobduba commented Jun 3, 2017 via email

@JayTeeJayArgh
Copy link

dude, that'd be sick bro

@PolyEdge
Copy link

PolyEdge commented Jun 8, 2017

Last time they said anything was in like October and they said the new system would come in January/Febuary this year

@PolyEdge
Copy link

PolyEdge commented Jun 8, 2017

Also I'm hoping if they even finish this, they'll keep the format the same so that I don't have to rewrite the cloud part of scratchapi yet again 📦

@towerofnix
Copy link

@PolyEdge Last time they said anything was about two hours ago, here:

We're going to have a new cloud data infrastructure Very Soon™ which will use websockets, so keep an eye on cloud data in the next couple of months!

@PolyEdge
Copy link

PolyEdge commented Jun 8, 2017

@liam4 Very Soon™ 👍

@PolyEdge
Copy link

PolyEdge commented Jun 8, 2017

Also that ninja was insane

@encloinc
Copy link

3 months later

@jwzimmer-zz
Copy link

The new cloud data work has been deployed as a soft launch (accessible by specific URLs) on Production.

If anyone would like to help test (which would be greatly appreciated!), here's what you need to know:

  • Make sure the URL of the project you are editing ends in ?newcloud
    • If you load a project, the URL needs to end in (random project number for example) /projects/171781902/?newcloud
    • If you are going back and forth between the editor & project page, the URL still needs to have ?newcloud in it, like /?newcloud#editor or /?newcloud#player
    • If you refresh the page or create a new project, the ?newcloud part needs to be re-added
  • Cloud data in the "old"/ existing system on Production is not available in the new soft launch version, but it will be added when the system is rolled out globally (rather than as a soft launch)
    • When you are testing a project with e.g. a high score on Production, that high score won't be available at the ?newcloud version of the project
    • Data you create during the soft launch of the new cloud data system will be erased when the new work is deployed globally
  • The cloud monitor page is accessible at (random project number for example) /cloudmonitor/171781902/soft/

Please report any bugs you find here! Thank you.

@CosmicWebServices
Copy link

Made a small test for it... works great! https://scratch.mit.edu/projects/171867511/?newcloud

@joker314
Copy link
Contributor

Could be just me, but this is my console output.

image

It continuously reconnects before giving an error of "WebSocket is already in CLOSING or CLOSED state.", then reconnects again, then gives the same error message, over and over again.

However, the cloud log correctly updates the variable. Meaning that despite the errors everything is working well.

So, just error messages -- no actual problem.

I'm using a Windows 10 computer, with my browser being Google Chrome, version 60.0.3112.101

@griffpatch
Copy link

griffpatch commented Aug 22, 2017 via email

@thisandagain
Copy link
Contributor

@joker314 Thanks for reporting! We'll take a look.

/cc @jwzimmer @colbygk

@thisandagain
Copy link
Contributor

thisandagain commented Aug 23, 2017

@griffpatch The system is still throttled (otherwise we simply would not be able to scale and maintain it) but much more liberally than what is in place right now. The goal is to get everyone moved over to this new websocket-based system in the short term and then we'll continue to evaluate and add new features on top of this new infrastructure after Scratch 3.0 is released.

@joker314
Copy link
Contributor

(Bug persists when all extensions are disabled)

@CosmicWebServices
Copy link

On Firefox (latest stable) I get
Websocket closed, code:1000 reason: project_base.js:1:5740
07:31:14 PM | wrn | Connection closed to cloud server Object { reconnect: 1649.0207351744175, attempts: 3 } 171867511
07:31:16 PM | inf | Attempt reconnect to cloud server. 171867511
07:31:20 PM | inf | Successfully connected to cloud data server

Every 5 seconds or so

@prail
Copy link

prail commented Aug 25, 2017

@thisandagain You say that cloud is still throttled very well below the rates required for realtime multiplayer. I have a project here: https://scratch.mit.edu/projects/171930267/?newcloud that works quite well with some very basic multiplayer functionality. (Movement and costume changes.) Has this changed since you posted last?

@jwzimmer-zz
Copy link

@CosmicWebServices Thanks for reporting that. Did you notice any problems in the functionality of your project when you saw the error? If you have a link to the project that occurred in as well as steps that cause it to happen, that would be helpful, too. 👍

@CosmicWebServices
Copy link

@jwzimmer no not really the link is already above (the one I posted)

Suggestion: https://scratch.mit.edu/discuss/topic/274278/?page=2#post-2799178

@colbygk
Copy link

colbygk commented Sep 11, 2017

Clouddata has now been migrated to the new websockets based platform and ?newcloud is no longer required to have a project use it.

@griffpatch
Copy link

griffpatch commented Sep 22, 2017 via email

@griffpatch
Copy link

griffpatch commented Sep 22, 2017 via email

@thisandagain
Copy link
Contributor

Thanks @griffpatch. I'm going to move this over to another issue.

@colbygk
Copy link

colbygk commented Sep 22, 2017

@griffpatch Thanks for working on cloud data projects!

Last night (21 Sep 2017), I deployed some changes to the new cloud data service that has improved connectivity issues.

Could you point me at a project where you're seeing the high score and data integrity issues?

@griffpatch
Copy link

griffpatch commented Sep 22, 2017 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests