Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Browser ID in k-anonymity submissions #1000

Open
martinthomson opened this issue Jan 22, 2024 · 3 comments
Open

Browser ID in k-anonymity submissions #1000

martinthomson opened this issue Jan 22, 2024 · 3 comments

Comments

@martinthomson
Copy link

The explainer talks about having a low entropy identifier for each browser. This appears to be so that Join requests from the same browser can be linked. That is, a browser can refresh Join they previously made without contributing toward the threshold.

This might also reduce the total state exposure of the k-anonymity server. If $j$ bits of data are used, the server only needs to keep $2^j$ items for each hash. That's still an awful lot, but there is now a cap.

Concentrating on the first one, I think that this is better addressed by having the client simply indicate when it is refreshes an existing item. The client could provide an approximate (noised) estimate of when the last entry was added or how long since the item was last added. My sense is that you could tolerate a lot of noise on this estimate and still have a functioning system that doesn't have a persistent identifier client involved, even a low entropy one. The time granularity only needs to be the period ($p$).

The state limiting thing is not really that big of a deal. A server can maintain a count for each period in the tracked window ($w$). Then the server only has to maintain $\frac{w}{p}=720$ numbers for each, which is plumb between $j=9$ and $j=10$ if you are just counting state1. Refreshes just mean moving one count from a previous period to the latest, as opposed to just adding to the latest.

There is some clever rate-limiting token stuff that can be used to prevent a client from refreshing too often, if you really want to go there. You could also issue unlinkable cookies that are bound to a specific period. I don't think that you want to do any of that, but I'm mentioning it just in case.

Footnotes

  1. If we're talking about state, you don't need to keep 720 values if you have far more than the threshold in recent buckets, so more popular hashes can be even cheaper to track2.

  2. I don't know what that does for timing side channel leaks, but if the hash really is popular, it shouldn't matter because it will return an over-threshold result. I'd be more concerned about leaks distinguishing zero from one.

@palenica
Copy link

The $j$-bit identifier is intended as a measure to prevent malicious clients from invoking Join on the same object repeatedly. (Honest clients are capable of rate-limiting themselves.) The $j$-bit id is cryptographically linked via private state tokens to a full entropy identifier (a Google account credential) that is difficult for an attacker to forge.

We chose the j-bit design for Chrome's initial implementation as it could be built largely out of existing components. Going forward, we are exploring the use of "Anonymous Counting Tokens" as a replacement for the $j$-bit identifier.

Happy to discuss the relative merits of these or other anti-abuse techniques.

@martinthomson
Copy link
Author

I did not realize that you would want to authenticate this identifier. That seems entirely unnecessary to me.

For a non-abusive client (determined by means unspecified, but you might assume the use of your PST design), what risk is there of a client decrementing a counter from a previous period? You are already extending them the power to increment the current count, so if they simultaneously decrement a previous one, what advantage have they gained?

The difference is between sustaining the count above the threshold (something they are already able to do) and doing so for maybe a shorter duration1. I see no value in the additional complication of binding of identity to tokens.

I can see some value in rate limiting additions from the same client, maybe. Maybe you don't want a single client to be able to dump multiple values in a bucket if they are able to obtain a source of tokens. However, cryptographic protection is a mighty large hammer to apply to this tiny nut. Recall also that your double-spend protection needs to span every bucket in the system, as we have some hard constraints on the problem: clients need lots of tokens to be available as they might view a great many advertisements; and, the threshold is very low. An attacker with access to just over 50 clients could reasonably fill a very large number of buckets without violating any form of cryptographic Sybil protection you might deploy. To the extent that a service like this is useful (I have serious reservations) I tend to think that solutions are better sought elsewhere.

Footnotes

  1. Imagine you have buckets a < b < c, where c = current and the other two are earlier. If the count of joins in a+b+c is over the threshold, but b+c is not, then you have two cases. In the first, the client adds to c. If that causes b+c to exceed the threshold, the client has changed the outcome when a is excluded by the passing of time. In the second, the client moves from bucket b to c, which does not change the outcome when only a is excluded. Once b is also excluded, the two cases are identical.

@palenica
Copy link

I think of k-anonymity as a (very imperfect) anti-abuse mechanism. If all users of the Protected Audience API were nice enough to show the same ad to many people and promptly delete data associated with any ad that fails to reach a minimum impression threshold, we would not need enforcement.
There are costs and benefits to any enforcement action, and it's fair to ask whether the benefits outweigh the costs. Having chosen to enforce, we'd want to make the enforcement mechanism costly to subvert. Otherwise the enforcement will be imposing a cost on "honest" API users, while doing little to deter "abusive" behavior.

I'd say our design was guided by the following criteria:

  • Callers of the Join API should be user agents (devices) representing actual human users.
  • There should be a limit on how many Join calls can be made by a caller per day. This limit can't be too low, given the number of ads a typical user encounters during their day.
  • There should be a much stricter limit on (ideally 1) on how many times a caller can be counted towards satisfying the minimum impression requirement for a given ad.
  • The actual logic that determines whether an ad is eligible to serve should reasonably well approximate a simple to state rule ("at least K distinct users saw or would have seen this ad over the last D days").

We could have made different design choices, but this is what felt like a reasonable tradeoff among engineering complexity, privacy, and ability to resist abuse. You rightly point out that our design is still vulnerable to attacks by a fleet of >=K authenticated devices (unless we start looking for signs of abusive behavior of individual Google accounts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants