New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark cookie-sharing pixels as tracking #2147
Conversation
Add new heuristic to detect cookie syncing via tracking pixels. This includes one of the tracking methods used by google analytics. Call pixelCookieSyncAccounting from heuristicBlockingAccounting.
ba28512
to
e99466f
Compare
Change it from a "longest common substring" finder to an "all common substrings longer than X" finder. Rename some variables and update comments.
Since details.initiator isn't available in firefox (and isn't reliable in chrome), create a new data structure for tabURLs that saves the full URLs of first-party pages indexed by tab IDs. Check common substrings to make sure they are not substrings or superstrings of the tab URL.
Just tested this on the top 50 Majestic sites. This heuristic triggers a lot -- the scan found 66 potential pixel-tracking actions, compared to 121 third-party cookies, 16 supercookies, and 1 fingerprint. There are a lot of false positives, but most of the "trackers" actually look like trackers. Here are some examples that look legit:
And some likely false positives:
Edit: After looking at all 66, I'd estimate 13 of the tracking strings were false alarms (most of them are pasted above), and the other 53 were legit. This might be a good time to improve the cookie entropy heuristic (https://github.com/EFForg/privacybadger/blob/master/src/js/heuristicblocking.js#L222). |
Finished a scan of 1,200 sites (it errored out before it finished 2k). Since we started scanning, badger-sett has learned to block 1389 domains at least once. With the new heuristic, the scan learns to block 64 domains that have never been blocked before. Here's a snapshot of the most prominent new tracking domains:
At a glance, most of these do look like trackers. Once a clean scan of 2k completes, maybe we can comb through what the badger learns and try to work around false positives. |
Try to better estimate entropy of a string by guessing which group of characters it was created from.
Add list of common query substrings derived from actual requests, and filter those out before estimating the entropy of a string.
ecf0db0
to
66871d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, this is definitely going to be a multi-stage review.
My first concern is adequate performance. We've got a quadruple loop now in the common case (no cookie syncing). Can we remove any nesting? Can we precompute/cache anything?
When it comes to the heuristic, I'd rather catch fewer things now (to avoid false positives, improve performance, simplify the code maybe), and widen the net later.
Looking forward to the fixes and super excited about releasing this!
@@ -108,8 +108,7 @@ function explodeSubdomains(fqdn, all) { | |||
/* | |||
* Estimate the max possible entropy of str using min and max | |||
* char codes observed in the string. | |||
* Tends to overestimate in many cases, e.g. hexadecimals. | |||
* Also, sensitive to case, e.g. bad1dea is different than BAD1DEA | |||
* Sensitive to case, e.g. bad1dea is different than BAD1DEA | |||
*/ | |||
function estimateMaxEntropy(str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes beneficial for the local storage ("supercookie") case as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still a question whether these changes are beneficial for/applicable to the local storage ("supercookie") case as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they should be. These changes are all meant to tighten up the way the heuristic worked before -- entropy = (size of character set) * (number of characters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add unit tests to exercise estimateMaxEntropy
code paths.
Only get cookies from the containing frame's origin; get cookies after checking whether a request's type is "image"; pass more arguments to accounting function to reduce util calls; better comments; consistent use of "share" instead of "sync".
09bc72f
to
7eb885b
Compare
Move cookie sharing detection logic out of heuristicBlockingAccounting, and call it directly from the listener.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It may be best to keep using tabOrigins
/tabUrls
for now, unfortunately.
@@ -108,8 +108,7 @@ function explodeSubdomains(fqdn, all) { | |||
/* | |||
* Estimate the max possible entropy of str using min and max | |||
* char codes observed in the string. | |||
* Tends to overestimate in many cases, e.g. hexadecimals. | |||
* Also, sensitive to case, e.g. bad1dea is different than BAD1DEA | |||
* Sensitive to case, e.g. bad1dea is different than BAD1DEA | |||
*/ | |||
function estimateMaxEntropy(str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still a question whether these changes are beneficial for/applicable to the local storage ("supercookie") case as well.
Merge pixel-sharing listener back into heuristicBlockingAccounting. Add back tabOrigins and tabURLs data structures. Clean up comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Short circuit when there aren't any cookies or cookie values are too short. Don't check for cookie sharing inside onBeforeSendHeaders synchfonous listener.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Some more feedback.
src/js/heuristicblocking.js
Outdated
* @param cookies are the result of chrome.cookies.getAll() | ||
* @returns {*} | ||
*/ | ||
pixelCookieShareAccounting: function (details, cookies) { |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a functional test for pixel cookie sharing detection. Like our localStorage tracking tests but for cookie sharing.
// Adapted from https://gist.github.com/jaewook77/cd1e3aa9449d7ea4fb4f | ||
// Find all common substrings more than 8 characters long, using DYNAMIC | ||
// PROGRAMMING | ||
function findCommonSubstrings(str1, str2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add unit tests to exercise findCommonSubstrings
code paths.
@@ -108,8 +108,7 @@ function explodeSubdomains(fqdn, all) { | |||
/* | |||
* Estimate the max possible entropy of str using min and max | |||
* char codes observed in the string. | |||
* Tends to overestimate in many cases, e.g. hexadecimals. | |||
* Also, sensitive to case, e.g. bad1dea is different than BAD1DEA | |||
* Sensitive to case, e.g. bad1dea is different than BAD1DEA | |||
*/ | |||
function estimateMaxEntropy(str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add unit tests to exercise estimateMaxEntropy
code paths.
rename variables to snake_case; remove _extractArgs and inline URL arg parsing because we don't capture POST requests anyway; conserve variables and remove redundant definitions.
b69319c
to
9147d2d
Compare
If we've seen a particular origin tracking on a particular first party before, stop checking for tracking actions.
9147d2d
to
e71e272
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright! Let's merge to master so we can get a good week's worth of badger-sett runs. We'll follow up with tests and any other performance/bug fixes with separate PRs.
Detect cookie sharing via image (pixel) querystrings.
And top-level document/main frame images. Following up on #2147
Add new heuristic to detect cookie sharing via tracking pixels.
Logic works like this:
Ever time there is a request for an "image" resource to a third party, PB inspects all the query args in the request, and compares them against all the cookies on the page belonging to other domains. If any query argument has a significantly long substring in common with any cookie values, PB marks that request as tracking.
In practice, several third-party analytics services work like this. For example, https://nytimes.com uses Google Analytics. When a user visits for the first time, it sets a first-party cookie like
_gid=GA1.3.1327319847.1535103025
, and then loads a tracking pixel from www.google-analytics.com with a request including the argument?_gid=1327319847.1535103025
. bluekai.com is another service that appears to do the same thing.This will probably have some false positives, so we should try to figure out how to exclude common substrings from the entropy estimation. Right now I'm ignoring all substrings that have a part of the page URL in them -- otherwise we'd mark requests like
example-cdn.com/resource?site=nytimes.com
as tracking.Also: do we need to limit this to "pixels," or can we apply this to every kind of request?
I think this addresses most of #367. Closes #340, closes #2088. Part of #2114.