Autoflagging broken; Metasmoke runs into Stack API rate limits #622

tripleee · 2019-04-08T08:50:28Z

I am creating this issue for visibility, but currently don't have a good idea of what exactly broke or how to approach the problem.

Chat transcript: https://chat.stackexchange.com/transcript/message/49820653#49820653

Please feel free to update this ticket description into a proper bug report.

makyen · 2019-04-08T09:32:16Z

Commit a3e1fea. resolved the problem which resulted in not obeying the backoff that was (probably) supplied by the SE API.

MS has done very similar accesses every day at that time for the last 2 years. Thus, there's a related issue that the SE API triggers for this type of thing appear to have been changed to be more strict. In addition, the criteria that SE uses to determine when to unlock an IP also appears to have changed. In the past, IIRC, it was, at most, 24 hours.

@undo has sent a tweet to Nick_Craver, who will hopefully resolve the issue on SE's side.

tripleee · 2019-04-11T15:46:58Z

In Tavern on the Meta, Nick Craver writes:

hey sorry we've been slammed - I've confirmed what's happening (the bot was legit banned, temporarily...which is somehow not releasing). We're digging into what behavior changed in HAProxy here...it was't intentional in our upgrade process at all.

tripleee · 2019-04-11T16:46:23Z

Allegedly resolved now.

angussidney · 2019-04-13T06:28:20Z

The issue has arisen again, sometime between 21:36:39 UTC and 02:36:42UTC (suspiciously traversing the 02:00 UTC mark again)

makyen · 2019-04-13T07:35:02Z

Just for the record:
After fixing the previous block, in Charcoal HQ, Nick Craver wrote:

If it happens again, please ping me - we just found that there is a bug, but need to work with HAProxy to resolve it, we've been slammed and couldn't find out what we did to cause it and...I don't think we did, we just found an odd bug with stick-tables.

Nick Craver has already been pinged for the current block.

tripleee · 2019-04-15T07:31:40Z

Seems to be fixed now, can we reclose this?

makyen · 2019-04-15T08:00:53Z

While we're no longer blocked, we haven't identified the reason we became blocked the second time. The last the problem was looked at (i.e. after the first block), it was believed to be fixed such that we wouldn't be blocked again. If possible, we should get info from SE as to which endpoint we were hitting without obeying backoff. That would be one way for us to narrow down where the problem is. Alternately, we might get a confirmation that we were inaccurately blocked the second time.

Otherwise, we're going to need to take a look at what might have been running during the time-span in which we know the block occurred. Of note was that we were blocked 7 days after the first block and the time of day was similar.

On SE's side, the only thing that's been stated as broken is that IP addresses aren't removed from the list of blocked IPs in the time-frames which are expected (i.e. IP addresses get added correctly, but not removed). Thus, at this time, there's nothing indicating we were blocked inappropriately (i.e. it currently appears that MS was still interacting with the SE API in a way SE doesn't want).

Undo1 · 2019-04-15T13:47:04Z

Likely the diamond check is respecting back off, but only barely. Then some other process makes a req or two and HAProxy doesn't like it.

ArtOfCode- · 2019-04-15T18:04:14Z

I don't think this is a backoff issue - if you violate a backoff, you get a 400 and a JSON error, you don't get 429 and a HTML TMR. I think this is just a bug on SE's end

Undo1 · 2019-04-15T18:05:39Z

I think violating a backoff enough gets you the HAProxy thing, though. So "not violating the backoff" should be a good strategy to avoid the SE-side bug.

makyen · 2019-04-15T18:32:21Z

For the SE API, backoff is on a per-method basis (see the next to last paragraph here). A quick search indicates the only place the /users/{ids}/associated method is used on MS is in update_moderator_sites. Is update_moderator_sites being called concurrently from multiple processes? If so, then, yeah, that's going to be bad.

Reducing calls to that endpoint

If that endpoint/method remains as a problem, we can reduce the number of requests we make, to some extent. That endpoint accepts a list of up to 100 IDs. Currently, we're requesting a single user at a time. We could reduce the number of requests we make by requesting multiple users at once. How many requests this would actually save depends on the average number of SE sites on which each MS user has accounts. If all users are signed up for all SE sites, it would only reduce the number of requests by 1/8. If all users were only signed up for one SE site, then we'd reduce it by a factor of 100. Obviously, it will be somewhere in the middle. If we assume, on average, our users are signed up for 33 SE sites, then it would save 2/3 of the requests.

Can make fewer SE API calls if we use `/users/moderators`

I expect that when the code was written we had significantly fewer users than there were SE sites. Now that there are substantially more users than SE sites, it would be more efficient for us to use the /users/moderators endpoint. That endpoint provides a list of users who are moderators on a per-site basis. We'd have to make a request for each SE site, but doing so would reduce the number of requests we currently make by about 2/3 (i.e. we currently make at least 1 request per user and have 562 users). It would also result in the same number of requests, regardless of the number of users we have in the future.

Generally obeying `backoff`

Briefly looking through the /app/models/user.rb file, it appears that the two other requests which are made in that file (to other endpoints) don't check for backoff. It might be a good idea for us to go through the code and make sure that backoff is being checked for and obeyed for all requests to the SE API.

It would be reasonably easy to implement backoff delays in a way similar to what's currently being done for /users/{ids}/associated. When backoff exists, that implementation just sleeps the process for the number of seconds defined in backoff after the request. Such an implementation would at least cover the backoff issue for a single process not making too many requests. Obviously, that implementation won't prevent concurrent processes from making requests. In addition, it, potentially, will add delays where a delay is not actually needed (i.e. a delay is not actually needed unless another request is going to be made to the same endpoint).

Undo1 · 2019-04-15T22:57:29Z

We could just hit https://stackexchange.com/about/moderators instead and get 'em all in one quick request. Needs a bit more parsing, but probably cheaper for everyone involved.

makyen · 2019-04-16T00:40:53Z

That's a good point. While SE has clearly said they don't want pages scraped that a user isn't actually looking at, loading and parsing a single page appears to be quite a bit better than a couple hundred (more or less) SE API requests.

Undo1 · 2019-04-16T00:42:36Z

@makyen Got a link handy to where they said that? Sounds familiar, but I can't place it.

makyen · 2019-04-16T02:10:18Z

@Undo1 I'm not finding an actual reference for that statement. I could be wrong about it, but it's the attitude which I've internalized from reading Metas, etc. I mostly associate it with discussions of getting information from deleted questions and answers. I certainly thought it originated from at least one SE employee.

The one actual reference I was able to find in a short time is: Etiquette of Screen-scraping Stack Overflow? That post was written prior to the existence of the SE API, but has been updated at after the SE API was released. Essentially, the answer says that scraping is OK, if you do it responsibly and follow their guidelines.

tripleee · 2019-04-17T04:50:54Z

How often do we really need to update the mod information? Could we cache it between elections (with an option for manual updates in between, of course)?

_{Sent with GitHawk}

angussidney · 2019-04-17T08:34:26Z

It's important to keep the mod information up to date so that we never accidentally cast an autoflag using a moderator account after an election. Only checking for mod info right after elections would reduce the number of API calls required, but is there an easy way to monitor elections network-wide programmatically?

Undo1 · 2019-04-17T13:50:03Z

Not really, and you still have the problem of people getting called up later or appointed as pro-tems

tripleee · 2019-04-17T16:22:29Z

I'm thinking more along the lines of reducing the API calls if we have multiple metasmoke restarts in rapid succession. Maybe cache mod information for something like 12 hours?

_{Sent with GitHawk}

Undo1 · 2019-04-17T18:35:32Z

It's a cron job, restarts have no effect.

makyen · 2019-04-22T04:49:36Z

We're blocked again. The first flag that was blocked was at 02:01:57UTC.

Undo1 · 2019-05-28T01:58:48Z

Declaring this 'fixed', since we haven't seen it in a long while. Likely a combination of SE-side fixes and our rate changes for diamond checks.

tripleee closed this as completed Apr 11, 2019

angussidney reopened this Apr 13, 2019

Undo1 closed this as completed May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoflagging broken; Metasmoke runs into Stack API rate limits #622

Autoflagging broken; Metasmoke runs into Stack API rate limits #622

tripleee commented Apr 8, 2019

makyen commented Apr 8, 2019 •

edited

tripleee commented Apr 11, 2019 •

edited

tripleee commented Apr 11, 2019

angussidney commented Apr 13, 2019 •

edited

makyen commented Apr 13, 2019

tripleee commented Apr 15, 2019

makyen commented Apr 15, 2019

Undo1 commented Apr 15, 2019

ArtOfCode- commented Apr 15, 2019

Undo1 commented Apr 15, 2019

makyen commented Apr 15, 2019

Undo1 commented Apr 15, 2019

makyen commented Apr 16, 2019

Undo1 commented Apr 16, 2019

makyen commented Apr 16, 2019

tripleee commented Apr 17, 2019

angussidney commented Apr 17, 2019

Undo1 commented Apr 17, 2019

tripleee commented Apr 17, 2019

Undo1 commented Apr 17, 2019

makyen commented Apr 22, 2019

Undo1 commented May 28, 2019

Autoflagging broken; Metasmoke runs into Stack API rate limits #622

Autoflagging broken; Metasmoke runs into Stack API rate limits #622

Comments

tripleee commented Apr 8, 2019

makyen commented Apr 8, 2019 • edited

tripleee commented Apr 11, 2019 • edited

tripleee commented Apr 11, 2019

angussidney commented Apr 13, 2019 • edited

makyen commented Apr 13, 2019

tripleee commented Apr 15, 2019

makyen commented Apr 15, 2019

Undo1 commented Apr 15, 2019

ArtOfCode- commented Apr 15, 2019

Undo1 commented Apr 15, 2019

makyen commented Apr 15, 2019

Reducing calls to that endpoint

Can make fewer SE API calls if we use /users/moderators

Generally obeying backoff

Undo1 commented Apr 15, 2019

makyen commented Apr 16, 2019

Undo1 commented Apr 16, 2019

makyen commented Apr 16, 2019

tripleee commented Apr 17, 2019

angussidney commented Apr 17, 2019

Undo1 commented Apr 17, 2019

tripleee commented Apr 17, 2019

Undo1 commented Apr 17, 2019

makyen commented Apr 22, 2019

Undo1 commented May 28, 2019

makyen commented Apr 8, 2019 •

edited

tripleee commented Apr 11, 2019 •

edited

angussidney commented Apr 13, 2019 •

edited

Can make fewer SE API calls if we use `/users/moderators`

Generally obeying `backoff`