Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoflagging broken; Metasmoke runs into Stack API rate limits #622

Closed
tripleee opened this issue Apr 8, 2019 · 22 comments
Closed

Autoflagging broken; Metasmoke runs into Stack API rate limits #622

tripleee opened this issue Apr 8, 2019 · 22 comments

Comments

@tripleee
Copy link
Member

tripleee commented Apr 8, 2019

I am creating this issue for visibility, but currently don't have a good idea of what exactly broke or how to approach the problem.

Chat transcript: https://chat.stackexchange.com/transcript/message/49820653#49820653

Please feel free to update this ticket description into a proper bug report.

@makyen
Copy link
Contributor

makyen commented Apr 8, 2019

Commit a3e1fea. resolved the problem which resulted in not obeying the backoff that was (probably) supplied by the SE API.

MS has done very similar accesses every day at that time for the last 2 years. Thus, there's a related issue that the SE API triggers for this type of thing appear to have been changed to be more strict. In addition, the criteria that SE uses to determine when to unlock an IP also appears to have changed. In the past, IIRC, it was, at most, 24 hours.

@undo has sent a tweet to Nick_Craver, who will hopefully resolve the issue on SE's side.

@tripleee
Copy link
Member Author

tripleee commented Apr 11, 2019

In Tavern on the Meta, Nick Craver writes:

hey sorry we've been slammed - I've confirmed what's happening (the bot was legit banned, temporarily...which is somehow not releasing). We're digging into what behavior changed in HAProxy here...it was't intentional in our upgrade process at all.

@tripleee
Copy link
Member Author

Allegedly resolved now.

@angussidney
Copy link
Member

angussidney commented Apr 13, 2019

The issue has arisen again, sometime between 21:36:39 UTC and 02:36:42UTC (suspiciously traversing the 02:00 UTC mark again)

@angussidney angussidney reopened this Apr 13, 2019
@makyen
Copy link
Contributor

makyen commented Apr 13, 2019

Just for the record:
After fixing the previous block, in Charcoal HQ, Nick Craver wrote:

If it happens again, please ping me - we just found that there is a bug, but need to work with HAProxy to resolve it, we've been slammed and couldn't find out what we did to cause it and...I don't think we did, we just found an odd bug with stick-tables.

Nick Craver has already been pinged for the current block.

@tripleee
Copy link
Member Author

Seems to be fixed now, can we reclose this?

@makyen
Copy link
Contributor

makyen commented Apr 15, 2019

While we're no longer blocked, we haven't identified the reason we became blocked the second time. The last the problem was looked at (i.e. after the first block), it was believed to be fixed such that we wouldn't be blocked again. If possible, we should get info from SE as to which endpoint we were hitting without obeying backoff. That would be one way for us to narrow down where the problem is. Alternately, we might get a confirmation that we were inaccurately blocked the second time.

Otherwise, we're going to need to take a look at what might have been running during the time-span in which we know the block occurred. Of note was that we were blocked 7 days after the first block and the time of day was similar.

On SE's side, the only thing that's been stated as broken is that IP addresses aren't removed from the list of blocked IPs in the time-frames which are expected (i.e. IP addresses get added correctly, but not removed). Thus, at this time, there's nothing indicating we were blocked inappropriately (i.e. it currently appears that MS was still interacting with the SE API in a way SE doesn't want).

@Undo1
Copy link
Member

Undo1 commented Apr 15, 2019

Likely the diamond check is respecting back off, but only barely. Then some other process makes a req or two and HAProxy doesn't like it.

@ArtOfCode-
Copy link
Member

I don't think this is a backoff issue - if you violate a backoff, you get a 400 and a JSON error, you don't get 429 and a HTML TMR. I think this is just a bug on SE's end

@Undo1
Copy link
Member

Undo1 commented Apr 15, 2019

I think violating a backoff enough gets you the HAProxy thing, though. So "not violating the backoff" should be a good strategy to avoid the SE-side bug.

@makyen
Copy link
Contributor

makyen commented Apr 15, 2019

For the SE API, backoff is on a per-method basis (see the next to last paragraph here). A quick search indicates the only place the /users/{ids}/associated method is used on MS is in update_moderator_sites. Is update_moderator_sites being called concurrently from multiple processes? If so, then, yeah, that's going to be bad.

Reducing calls to that endpoint

If that endpoint/method remains as a problem, we can reduce the number of requests we make, to some extent. That endpoint accepts a list of up to 100 IDs. Currently, we're requesting a single user at a time. We could reduce the number of requests we make by requesting multiple users at once. How many requests this would actually save depends on the average number of SE sites on which each MS user has accounts. If all users are signed up for all SE sites, it would only reduce the number of requests by 1/8. If all users were only signed up for one SE site, then we'd reduce it by a factor of 100. Obviously, it will be somewhere in the middle. If we assume, on average, our users are signed up for 33 SE sites, then it would save 2/3 of the requests.

Can make fewer SE API calls if we use /users/moderators

I expect that when the code was written we had significantly fewer users than there were SE sites. Now that there are substantially more users than SE sites, it would be more efficient for us to use the /users/moderators endpoint. That endpoint provides a list of users who are moderators on a per-site basis. We'd have to make a request for each SE site, but doing so would reduce the number of requests we currently make by about 2/3 (i.e. we currently make at least 1 request per user and have 562 users). It would also result in the same number of requests, regardless of the number of users we have in the future.

Generally obeying backoff

Briefly looking through the /app/models/user.rb file, it appears that the two other requests which are made in that file (to other endpoints) don't check for backoff. It might be a good idea for us to go through the code and make sure that backoff is being checked for and obeyed for all requests to the SE API.

It would be reasonably easy to implement backoff delays in a way similar to what's currently being done for /users/{ids}/associated. When backoff exists, that implementation just sleeps the process for the number of seconds defined in backoff after the request. Such an implementation would at least cover the backoff issue for a single process not making too many requests. Obviously, that implementation won't prevent concurrent processes from making requests. In addition, it, potentially, will add delays where a delay is not actually needed (i.e. a delay is not actually needed unless another request is going to be made to the same endpoint).

@Undo1
Copy link
Member

Undo1 commented Apr 15, 2019

We could just hit https://stackexchange.com/about/moderators instead and get 'em all in one quick request. Needs a bit more parsing, but probably cheaper for everyone involved.

@makyen
Copy link
Contributor

makyen commented Apr 16, 2019

That's a good point. While SE has clearly said they don't want pages scraped that a user isn't actually looking at, loading and parsing a single page appears to be quite a bit better than a couple hundred (more or less) SE API requests.

@Undo1
Copy link
Member

Undo1 commented Apr 16, 2019

@makyen Got a link handy to where they said that? Sounds familiar, but I can't place it.

@makyen
Copy link
Contributor

makyen commented Apr 16, 2019

@Undo1 I'm not finding an actual reference for that statement. I could be wrong about it, but it's the attitude which I've internalized from reading Metas, etc. I mostly associate it with discussions of getting information from deleted questions and answers. I certainly thought it originated from at least one SE employee.

The one actual reference I was able to find in a short time is: Etiquette of Screen-scraping Stack Overflow? That post was written prior to the existence of the SE API, but has been updated at after the SE API was released. Essentially, the answer says that scraping is OK, if you do it responsibly and follow their guidelines.

@tripleee
Copy link
Member Author

How often do we really need to update the mod information? Could we cache it between elections (with an option for manual updates in between, of course)?

Sent with GitHawk

@angussidney
Copy link
Member

It's important to keep the mod information up to date so that we never accidentally cast an autoflag using a moderator account after an election. Only checking for mod info right after elections would reduce the number of API calls required, but is there an easy way to monitor elections network-wide programmatically?

@Undo1
Copy link
Member

Undo1 commented Apr 17, 2019

Not really, and you still have the problem of people getting called up later or appointed as pro-tems

@tripleee
Copy link
Member Author

I'm thinking more along the lines of reducing the API calls if we have multiple metasmoke restarts in rapid succession. Maybe cache mod information for something like 12 hours?

Sent with GitHawk

@Undo1
Copy link
Member

Undo1 commented Apr 17, 2019

It's a cron job, restarts have no effect.

@makyen
Copy link
Contributor

makyen commented Apr 22, 2019

We're blocked again. The first flag that was blocked was at 02:01:57UTC.

@Undo1
Copy link
Member

Undo1 commented May 28, 2019

Declaring this 'fixed', since we haven't seen it in a long while. Likely a combination of SE-side fixes and our rate changes for diamond checks.

@Undo1 Undo1 closed this as completed May 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants