Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite and address random 404s #23

Merged
merged 10 commits into from
Apr 6, 2021
Merged

Rewrite and address random 404s #23

merged 10 commits into from
Apr 6, 2021

Conversation

Naramsim
Copy link
Member

@Naramsim Naramsim commented Mar 30, 2021

This PR:

  • drops Typescript, which wasn't properly used and was more of a hassle.
  • drops Yarn in favor of NPM, Yarn was partially incompatible with Firebase and the engine field of our package.json
  • drops Axios in favor of Got
  • with Got, enables GET timeouts of 8 seconds and a single retry in case the requested index.json is not found/timed-out
  • caches 404 for 1h instead of 24h
  • returns different and more suitable HTTP Status Codes

So, during my analysis, I think I found out why we have random 404 errors. Looking at the logs of our cloud function it happens that now and then the function times out (10s) and thus, I think, Cloudflare caches the empty response. So I came up with a solution that handles timeout problems with a retry logic. A response will always be returned by the function in max 8+8 seconds.

External changes:

  • Disable Cache Everything feature of Cloudflare, since we already have Cloudflare Cache TTL set to respect headers. In this way, I hope we will cache our 200s for 1d, our 404s for 1h, while possible other errors (504) will not be cached and thus retried in the future by someone else without serving cached 504s.
  • Increase GCP function timeout from the current 10s to 20s. The important is that the new limit will be higher than 8+8s (1 timed-out request and 1 timed-out retry)

https://staging.pokeapi.co/api/v2/

@Naramsim Naramsim requested a review from a team April 2, 2021 20:36
@Kronopt
Copy link
Member

Kronopt commented Apr 5, 2021

Hey @Naramsim.
I looked at the PR, but I'm not really familiar with firebase and cloudflare to be able to give that much of an insight.

I can, however, comment on your analysis.

Looking at the logs of our cloud function it happens that now and then the function times out (10s) and thus, I think, Cloudflare caches the empty response.

We should probably look into why these timeouts are happening.
Other than that, the timeout retry logic is nice, but why are we caching results != 200 anyway? Shouldn't we just be caching 200's and ignoring everything else?

Copy link
Member

@phalt phalt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me based on my familiarity with Firebase.

- checkout
- run: yarn --ignore-engines --cwd functions
# TODO: write some tests and run them here
# test:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just drop this?

@Naramsim
Copy link
Member Author

Naramsim commented Apr 6, 2021

@Kronopt we need to cache calls for non-existent objects.

For example, the call to https://pokeapi.co/api/v2/pokemon/cat is a valid call and should return a 404.


About the timeouts, I struggle to find a reason. It's just my bet that that's the problem causing the / issue, but maybe it's not. Anyway, those timeouts happen and the most logical reason is that they happen inside Axios when requesting the index.json file from our Object Storage. I find it hard to believe it though... It's strange that either GCP or Axios has this many timeouts.

Also, the code didn't seem to contain any loops or deadlocks, so I don't think it's a problem with the code itself.

I also remember that this issue manifested itself at a certain random point in time, not after a specific change in the code.

@Naramsim
Copy link
Member Author

Naramsim commented Apr 6, 2021

I'll merge this then. I'll keep an eye on the logs and if needed I'll roll back.

@Naramsim Naramsim merged commit a8b88cc into master Apr 6, 2021
@Naramsim Naramsim deleted the staging branch April 6, 2021 18:16
@Naramsim Naramsim restored the staging branch April 6, 2021 18:16
@Naramsim
Copy link
Member Author

Naramsim commented Apr 7, 2021

Quick update, since yesterday (12h ago) we returned 10 504 and 0 500. We got countless timeouts that were successfully retried. Good thing is that the 504 weren't cached, so the next call returned the proper object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants