Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark cookie-sharing pixels as tracking #2147

Merged
merged 15 commits into from Jun 19, 2019
Merged

Conversation

bcyphers
Copy link
Contributor

@bcyphers bcyphers commented Aug 24, 2018

Add new heuristic to detect cookie sharing via tracking pixels.

Logic works like this:
Ever time there is a request for an "image" resource to a third party, PB inspects all the query args in the request, and compares them against all the cookies on the page belonging to other domains. If any query argument has a significantly long substring in common with any cookie values, PB marks that request as tracking.

In practice, several third-party analytics services work like this. For example, https://nytimes.com uses Google Analytics. When a user visits for the first time, it sets a first-party cookie like _gid=GA1.3.1327319847.1535103025, and then loads a tracking pixel from www.google-analytics.com with a request including the argument ?_gid=1327319847.1535103025. bluekai.com is another service that appears to do the same thing.

This will probably have some false positives, so we should try to figure out how to exclude common substrings from the entropy estimation. Right now I'm ignoring all substrings that have a part of the page URL in them -- otherwise we'd mark requests like example-cdn.com/resource?site=nytimes.com as tracking.

Also: do we need to limit this to "pixels," or can we apply this to every kind of request?

I think this addresses most of #367. Closes #340, closes #2088. Part of #2114.

@bcyphers bcyphers changed the title Mark cookie-syncing tracking pixels as tracking Mark cookie-syncing pixels as tracking Aug 24, 2018
Add new heuristic to detect cookie syncing via tracking pixels. This
includes one of the tracking methods used by google analytics. Call
pixelCookieSyncAccounting from heuristicBlockingAccounting.
@bcyphers bcyphers added enhancement heuristic Badger's core learning-what-to-block functionality labels Aug 24, 2018
@bcyphers bcyphers changed the title Mark cookie-syncing pixels as tracking Mark cookie-sharing pixels as tracking Aug 24, 2018
Change it from a "longest common substring" finder to an "all common
substrings longer than X" finder. Rename some variables and update
comments.
Since details.initiator isn't available in firefox (and isn't reliable
in chrome), create a new data structure for tabURLs that saves the full
URLs of first-party pages indexed by tab IDs. Check common substrings to
make sure they are not substrings or superstrings of the tab URL.
@bcyphers
Copy link
Contributor Author

bcyphers commented Aug 29, 2018

Just tested this on the top 50 Majestic sites. This heuristic triggers a lot -- the scan found 66 potential pixel-tracking actions, compared to 121 third-party cookies, 16 supercookies, and 1 fingerprint.

There are a lot of false positives, but most of the "trackers" actually look like trackers. Here are some examples that look legit:

Tracker domain First-party domain Tracking string
google-analytics.com linkedin.com 418299633.1535508492
facebook.com adobe.com W4YARwAAAK13z0nI
hexagon-analytics.com flickr.com 82d07aeb-ce29-4e6d-a35f-76ce3545fa38
doubleclick.net linkedin.com 418299633.1535508492
rlcdn.com sourceforge.net 6472222808240991374
media.net nytimes.com 6537396857404675838
crsspxl.com sourceforge.net 78cf5b85-f0e3-4500-807f-471419b49974
crsspxl.com sourceforge.net 6537396857404675838
chartbeat.net nytimes.com ZQtbgC42A0aDnCOORiLzMjV-iNK

And some likely false positives:

Tracker domain First-party domain Tracking string
keywee.co nytimes.com 1152x574
pinterest.com nytimes.com :false,
myvisualiq.net soundcloud.com https%3A%2F%2F
everesttech.net adobe.com www.facebook.com
wp.com wordpress.com logged-out-homepage
wp.com wordpress.com x86_64
yahoo.com flickr.com &b=3&s=
githubapp.com github.com 1152x574

Edit: After looking at all 66, I'd estimate 13 of the tracking strings were false alarms (most of them are pasted above), and the other 53 were legit. This might be a good time to improve the cookie entropy heuristic (https://github.com/EFForg/privacybadger/blob/master/src/js/heuristicblocking.js#L222).

@bcyphers
Copy link
Contributor Author

bcyphers commented Aug 29, 2018

Finished a scan of 1,200 sites (it errored out before it finished 2k).

Since we started scanning, badger-sett has learned to block 1389 domains at least once. With the new heuristic, the scan learns to block 64 domains that have never been blocked before.

Here's a snapshot of the most prominent new tracking domains:

Tracker Times caught tracking
google-analytics.com 72
bluekai.com 54
chartbeat.net 53
crwdcntrl.net 46
visualwebsiteoptimizer.com 20
alexametrics.com 18
pippio.com 15
mfadsrvr.com 15
nexac.com 14
ssix.io 13
bouncex.net 12
trustx.org 10

At a glance, most of these do look like trackers. Once a clean scan of 2k completes, maybe we can comb through what the badger learns and try to work around false positives.

Try to better estimate entropy of a string by guessing which group of
characters it was created from.
Add list of common query substrings derived from actual requests, and
filter those out before estimating the entropy of a string.
Copy link
Member

@ghostwords ghostwords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, this is definitely going to be a multi-stage review.

My first concern is adequate performance. We've got a quadruple loop now in the common case (no cookie syncing). Can we remove any nesting? Can we precompute/cache anything?

When it comes to the heuristic, I'd rather catch fewer things now (to avoid false positives, improve performance, simplify the code maybe), and widen the net later.

Looking forward to the fixes and super excited about releasing this!

src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
@@ -108,8 +108,7 @@ function explodeSubdomains(fqdn, all) {
/*
* Estimate the max possible entropy of str using min and max
* char codes observed in the string.
* Tends to overestimate in many cases, e.g. hexadecimals.
* Also, sensitive to case, e.g. bad1dea is different than BAD1DEA
* Sensitive to case, e.g. bad1dea is different than BAD1DEA
*/
function estimateMaxEntropy(str) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes beneficial for the local storage ("supercookie") case as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a question whether these changes are beneficial for/applicable to the local storage ("supercookie") case as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they should be. These changes are all meant to tighten up the way the heuristic worked before -- entropy = (size of character set) * (number of characters).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add unit tests to exercise estimateMaxEntropy code paths.

src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
Only get cookies from the containing frame's origin; get cookies after
checking whether a request's type is "image"; pass more arguments to
accounting function to reduce util calls; better comments; consistent
use of "share" instead of "sync".
@bcyphers bcyphers force-pushed the pixel-tracking-heuristic branch 2 times, most recently from 09bc72f to 7eb885b Compare June 8, 2019 00:08
Move cookie sharing detection logic out of heuristicBlockingAccounting,
and call it directly from the listener.
Copy link
Member

@ghostwords ghostwords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It may be best to keep using tabOrigins/tabUrls for now, unfortunately.

src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
@@ -108,8 +108,7 @@ function explodeSubdomains(fqdn, all) {
/*
* Estimate the max possible entropy of str using min and max
* char codes observed in the string.
* Tends to overestimate in many cases, e.g. hexadecimals.
* Also, sensitive to case, e.g. bad1dea is different than BAD1DEA
* Sensitive to case, e.g. bad1dea is different than BAD1DEA
*/
function estimateMaxEntropy(str) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a question whether these changes are beneficial for/applicable to the local storage ("supercookie") case as well.

Merge pixel-sharing listener back into heuristicBlockingAccounting. Add
back tabOrigins and tabURLs data structures. Clean up comments.
Copy link
Member

@ghostwords ghostwords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

src/js/heuristicblocking.js Show resolved Hide resolved
src/js/webrequest.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
Short circuit when there aren't any cookies or cookie values are too
short. Don't check for cookie sharing inside onBeforeSendHeaders
synchfonous listener.
Copy link
Member

@ghostwords ghostwords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Some more feedback.

src/js/heuristicblocking.js Outdated Show resolved Hide resolved
src/js/heuristicblocking.js Outdated Show resolved Hide resolved
* @param cookies are the result of chrome.cookies.getAll()
* @returns {*}
*/
pixelCookieShareAccounting: function (details, cookies) {

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a functional test for pixel cookie sharing detection. Like our localStorage tracking tests but for cookie sharing.

src/js/heuristicblocking.js Outdated Show resolved Hide resolved
// Adapted from https://gist.github.com/jaewook77/cd1e3aa9449d7ea4fb4f
// Find all common substrings more than 8 characters long, using DYNAMIC
// PROGRAMMING
function findCommonSubstrings(str1, str2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add unit tests to exercise findCommonSubstrings code paths.

@@ -108,8 +108,7 @@ function explodeSubdomains(fqdn, all) {
/*
* Estimate the max possible entropy of str using min and max
* char codes observed in the string.
* Tends to overestimate in many cases, e.g. hexadecimals.
* Also, sensitive to case, e.g. bad1dea is different than BAD1DEA
* Sensitive to case, e.g. bad1dea is different than BAD1DEA
*/
function estimateMaxEntropy(str) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add unit tests to exercise estimateMaxEntropy code paths.

rename variables to snake_case; remove _extractArgs and inline URL arg
parsing because we don't capture POST requests anyway; conserve
variables and remove redundant definitions.
If we've seen a particular origin tracking on a particular first party
before, stop checking for tracking actions.
Copy link
Member

@ghostwords ghostwords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright! Let's merge to master so we can get a good week's worth of badger-sett runs. We'll follow up with tests and any other performance/bug fixes with separate PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement heuristic Badger's core learning-what-to-block functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Identify cookie syncing as third-party tracking Not blocking Google Analytics (GA)
2 participants