Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issue #21

Closed
timbuchwaldt opened this issue Aug 17, 2016 · 17 comments · Fixed by #44
Closed

Performance Issue #21

timbuchwaldt opened this issue Aug 17, 2016 · 17 comments · Fixed by #44

Comments

@timbuchwaldt
Copy link

timbuchwaldt commented Aug 17, 2016

As the readme doesn't really state it: django-rest-knox has a major performance issue.

https://github.com/James1345/django-rest-knox/blob/master/knox/auth.py#L50

This line leads to major performance hits once one reaches significant user numbers, as each API request leads to all tokens being sent do the Django instances. Once reaching a few tens of thousands of users and more than a handful of requests per second, the application is spending more time receiving/parsing SQL and guessing tokens than doing some actual work.

One way to change this would be the following: Split the token in 2 parts:

  • Part one is an identifier for a database column, e.g. a random string
  • Part two is the very token knox currently uses

This would lead to only querying for a single row (the one matching part on), then comparing one hash.

Funfact: While loadtesting I watched my servers handle > 100mbit/s of SQL due to this bug.

@jerzyk
Copy link

jerzyk commented Aug 26, 2016

@timbuchwaldt agreed, this is an real issue, especially when your user base is growing and those users are opening multiple sessions

I was thinking how to limit a subset of the data that is fetched from the database e.g. by filtering only active tokens or by extending model and include e.g. ip address and filter in the first place all tokens with specific ip

this is similar to your approach, but then we are loosing one functionality - self cleaning of the token table

finally what we have implemented here is pretty simple: we are using cache to resolve tokens:

  • when user is accessing site first time, his token is not in cache, so there is call to the database, when successful - cache is being set
  • subsequent calls are getting data from the cache itself, very fast as this is direct key access

this is very simple solution and save us a lot of hassle

one more place that can improve performance is setting cache after successful login (at token creation), to not delay cache setting till first token request,but we did not implemented it.

there was one more idea - to hash a keys in the cache itself - e.g. to get md5 or sha-xxx of the token and use it as a key in the cache - but in our situation it would not bring any additional security level and will put some more load, delaying whole process.

@timbuchwaldt
Copy link
Author

Limiting to just a subset of data is basically what I proposed (and @Raphaa implemented in a fork), by just adding a second column to contain part of the token. Limiting to IPs has obvious problems (e.g. taking your device to another place).

Self cleaning seems like a nice property, but this could easily be implemented as a task run by cron, celery beat or similar systems. Heck, even running DELETE FROM tokens WHERE created_at < DATEADD(day, -30, GETDATE() at a percentage of requests would do the trick. But passing arround all data on every request is just slow.

Regarding caching: Say once has like 50 Django instances running - so the user might turn up on any of the 50, so the cache would be empty most of the time.

@jerzyk
Copy link

jerzyk commented Aug 26, 2016

Self cleaning seems like a nice property, but this could easily be implemented as a task run by cron, celery beat or similar systems. Heck, even running DELETE FROM tokens WHERE created_at < DATEADD(day, -30, GETDATE() at a percentage of requests would do the trick. But passing arround all data on every request is just slow.

sure, but this is another deployment task to remember

Regarding caching: Say once has like 50 Django instances running - so the user might turn up on any of the 50, so the cache would be empty most of the time.

wait ,what? are you talking about local cache? this shouldn't be used in production... memcache/redis and there are no issue at all

@jasjukaitis
Copy link
Contributor

Using memcache/redis is just sellotaping. The issue (broken by design) would be still there and also requires an additional deployment task as cleaning the database. Self cleaning at every request is also bad design for my taste. Deploying Django is always more than copy pasting via FTP, so I don't know why that should be unreasonable.

@James1345
Copy link
Member

@timbuchwaldt Yes. I agree. no debate needed. Will add to the todos for dev.

@tumbak
Copy link

tumbak commented Oct 30, 2016

Are there any updates on this? We are having major performance issues and so far we've pin pointed the issue to this exact case.

I am willing to write a pull request for this, just need to make sure it aligns with the developers' suggested solution.

@mheppner
Copy link

@James1345 do you have any updates for this? Can @Raphaa submit a pull request of his fork?

@rootvar
Copy link

rootvar commented Dec 1, 2016

As much as I appreciate the work that has been done on this app, it's not production ready because of this bug.

@tumbak solution seems to work fine. One suggestion: you have commented out expired token deletion but didn't add any test in the same method. It still needs this line imo
if auth_token.expires < timezone.now():

Having a management command is helpful but is not clear to novice users (or easy to forget in general). If we keep deletion where it was originally what will be the overhead? The possibility of having duplicate token_slice is very small, so filtering by that and then deleting expired doesn't seem to present a performance issue.

@jasjukaitis
Copy link
Contributor

I've added my performance fix in two pull requests (two versions). #28 and #29

@belugame
Copy link
Collaborator

belugame commented Dec 4, 2016

@Raphaa Thanks a lot. Just to be sure, for a production system updating to the #28 version mean that all current tokens become invalid, right? Should we then possibly add a data migration to delete all tokens before? Or some way of making devs aware that this will break current tokens.

Update: My uncertainty is if this should be considered a backwards incompatible release then.

@James1345
Copy link
Member

@belugame @Raphaa I wouldn't have thought of invalidating all tokens as a backward incompatible fix provided the migration deletes them all as that's the same effect as a logout/token expiring. Provided the mechanism for getting/submitting tokens is unchanged it would be transparent to rest clients connecting to an API using knox authentication, and the migration should be trivial for the maintainer of the API to apply.

@jasjukaitis
Copy link
Contributor

jasjukaitis commented Dec 8, 2016 via email

@bf0
Copy link

bf0 commented Dec 25, 2016

@Raphaa

I'm trying to estimate the at-scale performance with your changes, could I possibly have your input? Say for 10 million stored tokens, we will retrieve all with matching first 8 char (expense 1), then iterate over them to find a match (expense2). How big might these expenses be? I'm not sure on some of the math.

I was starting to use knox, saw the authenticate_credentials() method and knew it wouldn't scale. Now I'm determining if I should continue / if your updates would resolve the problem. Intuitively your fix should resolve the issue, just double checking.

@rootvar
Copy link

rootvar commented Dec 25, 2016

@bf0

not @Raphaa but it should be something like 62^8 (62 alphanumeric characters to the 8th power)

@bf0
Copy link

bf0 commented Dec 25, 2016

@rootvar Thanks. Hmm, how can I translate it into the average number of tokens sharing the same first 8, in order to estimate the time taken for the db lookup (expense 1)?

edit - oh I see what you're saying; the chances of sharing the same first 8 should be ~1:62^8 i.e., very rarely will there be > 1 token in the query, if I have that right?

@jasjukaitis
Copy link
Contributor

I can't say any numbers. But it won't be that often > 1. Would say, that other things in your app are more expensive. ;)

@belugame
Copy link
Collaborator

@Raphaa I can't merge your pull request at the moment as the tests are failing, could you look into it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants