# Cookie-less Tracking Using Canvas Fingerprinting: 
How websites make it harder for users to control their privacy preferences.  

For Mozilla's Overscripted Challenge.  
**Analyses Category:** Tracking and Privacy

**Authors of this notebook:** [Maya Filipp](https://github.com/mordax) (email: maya@mordax.io) and [Chaya Danzinger](https://github.com/biskit1) (email: chayadanz@gmail.com)

 **Mozilla's Principle 4:**
 __Individuals’ security and privacy on the internet are fundamental and must not be treated as optional.__
 

Individuals cannot be promised this fundamental right when companies and websites utilize techniques to track them in such a way that it takes away their ability to guard themselves against it. It is up to the browsers to help identify and warn users of techniques that are not predefined or obvious (cookies have a clear way of being removed - canvas fingerprinting does not). 

This analysis aims to examine the given dataset and determine the prevalence of ways in which websites can track their users without using standard cookies,specifically exploring the use of canvas fingerprinting to uniquely identify visitors. A few external resources were used to cross reference some of our analyses, referenced further on. 

Fingerprinting has become [increasingly popular](https://webtransparency.cs.princeton.edu/webcensus/index.html), given the ease with which websites can access a persons brwoser and system settings/configuration. To generate a unique identifier, websites collect the values of as many browser and system settings as possible to calculate an identifier for that browser. Once generated, a fingerprint can be used by websites and trackers in a number of ways to accomplish a number of different things, including tracking users for business purposes (analytics, marketing, advertising), as well as actually following a user accross the web until it can actually [de-anonymize](https://robertheaton.com/2017/10/17/we-see-you-democratizing-de-anonymization/) them.

## Getting Started

> Using the same import statements as the hello_world.ipynb

In [2]:
import tldextract
DATA_DIR = '../data/' # Adjust to your path to dir containing parquets 
PARQUET_FILE = DATA_DIR
def extract_domain(url):
    """Use tldextract to return the base domain from a url"""
    try:
        extracted = tldextract.extract(url)
        return '{}.{}'.format(extracted.domain, extracted.suffix)
    except Exception as e:
        return 'ERROR'


In [3]:
import findspark
findspark.init('../spark-2.3.1-bin-hadoop2.7')  # Adjust for the location where you installed spark

from pyspark import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext(appName="Overscripted")
spark = SparkSession(sc)

In [4]:
import requests
from io import StringIO

from pyspark.sql.functions import udf
from pyspark.sql.types import *
from urllib.parse import urlparse

In [5]:
def parse_base_url(url):
  """ Extract the base part of a URL (netloc, up until the first '/'). """
  return urlparse(url).netloc

In [6]:
pdf = pd.read_parquet(PARQUET_FILE, engine='pyarrow')
len(pdf)

563430

In [7]:
pdf['location_domain'] = pdf.location.apply(extract_domain)
pdf['script_domain'] = pdf.script_url.apply(extract_domain)
pdf['location_base_url'] = pdf.location.apply(parse_base_url)


In [8]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
ddf = dd.read_parquet(PARQUET_FILE, engine='pyarrow')
ddf['location_domain'] = ddf.location.apply(extract_domain, meta=('x', 'str'))
ddf['location_base_url'] = ddf.location.apply(parse_base_url, meta=('x', 'str'))
ddf['script_domain'] = ddf.script_url.apply(extract_domain, meta=('x', 'str'))

### Canvas Fingerprinting

Just by looking at the [symbol counts](https://github.com/mozilla/Overscripted-Data-Analysis-Challenge/blob/master/data_prep/symbol_counts.csv), we were able to identify that the `canvas` was heavily involved in highly recurring calls to it's properties. Further research (outlined in [The Web Never Forgets](https://securehomes.esat.kuleuven.be/~gacar/persistent/the_web_never_forgets.pdf) confirmed a common basic flow of operations used to fingerprint the canvas, which we were able to apply to this dataset to test for fingerprinting. Although the `canvas API` might be regularly used by websites to obtain information for innocent rendering purposes and more, using the flow of operations helped to identify cases in which it was being used for fingerprinting. 

#### The Flow of Operations

Obtaining canvas properties isn't enough to uniquely identify a browser or user, since there can be many users who have the same settings, use the same browser, live in the same area, etc. In order to identify a person's canvas uniqueness, the canvas property will usually be used along with a call to `fillText` (and set calls to more properties), followed by `toDataURL` call, for reasons described below. 

In [9]:
fillTexts = ddf[ddf.symbol == 'CanvasRenderingContext2D.fillText']
toData = ddf[ddf.symbol == 'HTMLCanvasElement.toDataURL']
strokeText = ddf[ddf.symbol == 'CanvasRenderingContext2D.strokeText']
measureText = ddf[ddf.symbol == 'CanvasRenderingContext2D.measureText']
dataImage = ddf[ddf.symbol == 'CanvasRenderingContext2D.putImageData']
getContext = ddf[ddf.symbol == 'HTMLCanvasElement.getContext']
with ProgressBar():
    fillTexts = fillTexts.compute()
    toData = toData.compute()
    strokeText = strokeText.compute()
    measureText = measureText.compute()
    dataImage = dataImage.compute()
    getContext = getContext.compute()

[########################################] | 100% Completed | 34.0s
[########################################] | 100% Completed | 32.5s
[########################################] | 100% Completed | 32.2s
[########################################] | 100% Completed | 32.0s
[########################################] | 100% Completed | 33.4s
[########################################] | 100% Completed | 32.0s


An identifying component of canvas fingerprinting is when a site calls toDataURL along with fillText/strokeText/getContext/measureContext/dataImage. Below we identified the unique locations that have at least one of the combinations: 

In [10]:
import requests
from io import StringIO

from pyspark.sql.functions import udf
from pyspark.sql.types import *
from urllib.parse import urlparse

#### fillText and toDataURL combined websites

Some websites use both Canvas `fillText()` and `toDataURL()`. By filling the canvas with text and using toDataURL to get information on how the computer uniquely renders it, the site can tag you as a unique visitor. Usually the fillText() uses a pangram plus a unicode symbol (see below) so that it gets the most renderings for the most unique fingerprint. Interestingly enough, the websites that use a big set of fonts as arguments to filltext may be utilizing a different technique instead of a pangram. In [The Web Never Forgets]() they mention that if the image requested is not in a lossy format, it most likely is a Canvas fingerprint. Image/webp is passed as some of the arguments - webp can be used as a lossless format.

**CanvasRenderingContext2D.measureText()** was not discussed in the articles but rather was a discovery of our own. A measureText returns a TextMetrics object that returns a calculated width of a segment of inline text in CSS pixels. It takes into account the current font of the context.(Width is the most supported measurement, there are many others that could potentially be used if browser support becomes ubiquitous). By passing a unique string and grabbing the width, it can be used as a replacment as a pangram in fingerprinting. We caught in one of our datasets that Stripe.network in particular was using this technique 58 times along with a fillText and a toDataURL.


In [11]:
measureText.location_domain.count()

2073

In [21]:
d1 = pd.DataFrame(fillTexts.groupby(['location_domain', 'script_domain', 'argument_0', 'argument_1', 'symbol']).size())
d2 = pd.DataFrame(toData.groupby(['location_domain','script_domain', 'argument_0', 'argument_1', 'symbol']).size())
d3 = pd.DataFrame(measureText.groupby(['location_domain','script_domain', 'argument_0', 'argument_1', 'symbol']).size())
pd.concat([d1, d2, d3], axis=1).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,0,0,0
location_domain,script_domain,argument_0,argument_1,symbol,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0daydown.com,0daydown.com,,,HTMLCanvasElement.toDataURL,,4.0,
0daydown.com,0daydown.com,ðºâð³,0.0,CanvasRenderingContext2D.fillText,1.0,,
0daydown.com,0daydown.com,ðºð³,0.0,CanvasRenderingContext2D.fillText,1.0,,
0daydown.com,0daydown.com,ð§ââï¸,0.0,CanvasRenderingContext2D.fillText,1.0,,
0daydown.com,0daydown.com,ð§ââï¸,0.0,CanvasRenderingContext2D.fillText,1.0,,
163.com,127.net,,,HTMLCanvasElement.toDataURL,,2.0,
163.com,127.net,"mwC nkbafjord phsgly exvt zqiu, á½ tphst/:/uhbgtic.mo/levva",2.0,CanvasRenderingContext2D.fillText,1.0,,
163.com,127.net,"mwC nkbafjord phsgly exvt zqiu, á½ tphst/:/uhbgtic.mo/levva",4.0,CanvasRenderingContext2D.fillText,1.0,,
17173.com,17173cdn.com,image/png,,HTMLCanvasElement.toDataURL,,1.0,
2mdn.net,2mdn.net,32,,CanvasRenderingContext2D.measureText,,,1.0


In [20]:
pd.concat([fillTexts, toData, measureText]).head(10)

Unnamed: 0,argument_0,argument_1,argument_2,argument_3,argument_4,argument_5,argument_6,argument_7,argument_8,arguments,...,symbol,time_stamp,value,value_1000,value_len,valid,errors,location_domain,location_base_url,script_domain
1758,ðºð³,0,0,,,,,,,"{""0"":""🇺🇳"",""1"":0,""2"":0}",...,CanvasRenderingContext2D.fillText,2017-12-16 15:27:18.970,,,0,True,,weblogssl.com,www.weblogssl.com,weblogssl.com
1762,ðºâð³,0,0,,,,,,,"{""0"":""🇺​🇳"",""1"":0,""2"":0}",...,CanvasRenderingContext2D.fillText,2017-12-16 15:27:18.975,,,0,True,,weblogssl.com,www.weblogssl.com,weblogssl.com
1768,ð§ââï¸,0,0,,,,,,,"{""0"":""🧚‍♂️"",""1"":0,""2"":0}",...,CanvasRenderingContext2D.fillText,2017-12-16 15:27:18.978,,,0,True,,weblogssl.com,www.weblogssl.com,weblogssl.com
1772,ð§ââï¸,0,0,,,,,,,"{""0"":""🧚​♂️"",""1"":0,""2"":0}",...,CanvasRenderingContext2D.fillText,2017-12-16 15:27:18.981,,,0,True,,weblogssl.com,www.weblogssl.com,weblogssl.com
2300,http://valve.github.io,2,15,,,,,,,"{""0"":""http://valve.github.io"",""1"":2,""2"":15}",...,CanvasRenderingContext2D.fillText,2017-12-16 23:05:22.355,,,0,True,,zalando.de,www.zalando.de,metrigo.com
2302,http://valve.github.io,4,17,,,,,,,"{""0"":""http://valve.github.io"",""1"":4,""2"":17}",...,CanvasRenderingContext2D.fillText,2017-12-16 23:05:22.360,,,0,True,,zalando.de,www.zalando.de,metrigo.com
2851,"Cwm fjordbank glyphs vext quiz, ð",2,15,,,,,,,"{""0"":""Cwm fjordbank glyphs vext quiz, 😃"",""1"":2...",...,CanvasRenderingContext2D.fillText,2017-12-16 17:58:43.069,,,0,True,,epwk.com,www.epwk.com,bshare.cn
2854,"Cwm fjordbank glyphs vext quiz, ð",4,45,,,,,,,"{""0"":""Cwm fjordbank glyphs vext quiz, 😃"",""1"":4...",...,CanvasRenderingContext2D.fillText,2017-12-16 17:58:43.072,,,0,True,,epwk.com,www.epwk.com,bshare.cn
12655,ðºð³,0,0,,,,,,,"{""0"":""🇺🇳"",""1"":0,""2"":0}",...,CanvasRenderingContext2D.fillText,2017-12-16 14:40:50.903,,,0,True,,redditblog.com,redditblog.com,redditblog.com
12659,ðºâð³,0,0,,,,,,,"{""0"":""🇺​🇳"",""1"":0,""2"":0}",...,CanvasRenderingContext2D.fillText,2017-12-16 14:40:50.909,,,0,True,,redditblog.com,redditblog.com,redditblog.com


#### Pangrams and unique render identifiers
Below we can see some of the arguments passed to the fillTexts are pangrams (Cwm fjordbank glyphs vext quiz, ð) for example. Something that we caught as we were investigating this is this argument: <@nv45. F1n63r,Pr1n71n6! . If you look closely, it says Canvas. Finger,printing! in Leetspeek. Pangrams are used to observe how a computer renders all letters plus usually a unicode item is used too. This is outlined in [The Web Never Forgets.](https://securehomes.esat.kuleuven.be/~gacar/persistent/the_web_never_forgets.pdf)

Another thing we tested out of sheer curiousity is that Canvas fillText() actually can take eval() functions and output to the screen. There is no built in protection against this in the function itself. One can use an eval input and pass the argument to fillText to have a unique code for the visitor, or someone can use it for malicious XSS attacks.

#### fillText and toDataURL combined websites

What we're seeing above are websites that use both Canvas `fillText()` and `toDataURL()`. By filling the canvas with text and using toDataURL to get information on how the computer uniquely renders it, the site can tag you as a unique visitor. Usually the fillText() uses a pangram plus a unicode symbol (see below) so that it gets the most renderings for the most unique fingerprint. Interestingly enough, the websites that use a big set of fonts as arguments to filltext may be utilizing a different technique instead of a pangram. In [The Web Never Forgets]() they mention that if the image requested is not in a lossy format, it most likely is a Canvas fingerprint. Image/webp is passed as some of the arguments - webp can be used as a lossless format.

**CanvasRenderingContext2D.measureText()** was not discussed in the articles but rather was a discovery of our own. A measureText returns a TextMetrics object that returns a calculated width of a segment of inline text in CSS pixels. It takes into account the current font of the context.(Width is the most supported measurement, there are many others that could potentially be used if browser support becomes ubiquitous). By passing a unique string and grabbing the width, it can be used as a replacment as a pangram in fingerprinting. We caught in one of our datasets that Stripe.network in particular was using this technique 58 times along with a fillText and a toDataURL.


What sort of arguments are being passed to fillTexts?

In [22]:
# What's being written to canvas
pd.DataFrame(fillTexts.argument_0.value_counts()).head()

Unnamed: 0,argument_0
ð,154
"Cwm fjordbank glyphs vext quiz, ð",70
,70
ðºð³,53
201706,50


In [23]:
pd.DataFrame(strokeText.groupby(['location_domain', 'script_domain', 'argument_0']).size()) #Pulling up StrokeText

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
location_domain,script_domain,argument_0,Unnamed: 3_level_1
bhg.com,b2c.com,t,1
gravitytales.com,securepaths.com,180,1
gravitytales.com,securepaths.com,934,1
scitation.org,b2c.com,t,1
zedo.com,fqtag.com,180,1
zedo.com,fqtag.com,934,1


In "The Web never forgets", they mention CanvasRenderingContext2D.strokeText() being used similarly as fillText, although it seems filltext is more popular.

In [25]:
#Combining getcontect and tourldata to see if context is instead used as a canvas fingerprint method.
d1 = pd.DataFrame(getContext.groupby(['location_domain','script_domain', 'argument_0', 'argument_1', 'symbol']).size())
d2 = pd.DataFrame(toData.groupby(['location_domain','script_domain', 'argument_0', 'argument_1', 'symbol']).size())
pd.concat([d1, d2], axis=1).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,0,0
location_domain,script_domain,argument_0,argument_1,symbol,Unnamed: 5_level_1,Unnamed: 6_level_1
0daydown.com,0daydown.com,,,HTMLCanvasElement.toDataURL,,4.0
0daydown.com,0daydown.com,2d,,HTMLCanvasElement.getContext,1.0,
104china.com,104china.com,2d,,HTMLCanvasElement.getContext,2.0,
11street.my,11street.my,2d,,HTMLCanvasElement.getContext,15.0,
163.com,127.net,,,HTMLCanvasElement.toDataURL,,2.0
163.com,127.net,2d,,HTMLCanvasElement.getContext,2.0,
163.com,127.net,webgl,,HTMLCanvasElement.getContext,1.0,
17173.com,17173cdn.com,2d,,HTMLCanvasElement.getContext,5.0,
17173.com,17173cdn.com,image/png,,HTMLCanvasElement.toDataURL,,1.0
2mdn.net,2mdn.net,2d,,HTMLCanvasElement.getContext,33.0,


Above we can see a call to grab HTMLCanvasElement.getContext(), which returns information on the drawing context for the canvas. It can be used to check rendering. If toDataURL is paired, it's most likely using that as a fingerprint. See [canvas](https://cseweb.ucsd.edu/~hovav/dist/canvas.pdf), the pdf used that explains how getcontext works to fingerprint and [webgl](https://browserleaks.com/webgl) fingerprinting. 

# Canvas Fingerprinting Check

Princton University ran a project looking into various fingerprinting methods and came up with a long list of script URLs that make use of Canvas fingerprinting. The URL is located [here](https://webtransparency.cs.princeton.edu/webcensus/canvas_scripts.html). We compared the data found in the Princeton .tsv against the suspected canvas using URLs. We found that the URLs that have been popping up in our data pulls matched those gathered by Princeton, which helps both show what kind of Canvas attributes are being used and also to backup our analysis.  

In [27]:
def get_cfp_sites():
    """Loads a list of canvas fingerprinting script providers discovered as part of the Princeton WebTAP project,
    which listed sites within the Alexa top 100,000 that show signs of canvas fingerprinting scripts.
    """
    cfp_csv_raw = requests.get("https://webtransparency.cs.princeton.edu/webcensus/canvas_fingerprinting.tsv")
    # cfp_csv_raw
    cfp_csv = pd.read_csv(StringIO(cfp_csv_raw.text), sep="\t",  names=["site", "fp_domain"])
    cfp_csv['fp_cut_domain'] = cfp_csv.fp_domain.apply(extract_domain)
    return list(cfp_csv.fp_cut_domain.unique())

In [28]:
#Gets list of script domains
cfp_sites = get_cfp_sites()

How many unique domains are there?

In [29]:
len(cfp_sites)

373

In [34]:
cfp_sites[0:10]

['doubleverify.com',
 'lijit.com',
 'adbox.lk',
 'aa.com.ve',
 'seewhy.com',
 'adf.ly',
 'addthis.com',
 'bling99.com',
 'playsport.cc',
 'gazeta.pl']

Next, we find instances in the full dataset where the script_url is one of the known session replay providers.

First, extract the main page URLs and script URLs from the dataset


In [36]:
df = spark.read.parquet(PARQUET_FILE) #Adjust to PARQUET_FILE
df.count() 

563430

How many distinct script calls are there in the full dataset?

In [38]:
df_urls = df.select("location", "script_url").distinct()
n_rows = df_urls.count()
n_rows

29688

Add additional colums for extracted components of the URLs that we will use in the analysis.

In [39]:
def parse_base_url(url):
  """ Extract the base part of a URL (netloc, up until the first '/'). """
  return urlparse(url).netloc
udf_parse_base_url = udf(parse_base_url, StringType())

def parse_url_scheme(url):
  """ Extract the scheme (protocol) from a URL. """
  return urlparse(url).scheme
udf_parse_url_scheme = udf(parse_url_scheme, StringType())

def parse_suffix(url):
  """ Extract the suffix (TLD) from a URL. """
  return url.split(".")[-1]
udf_parse_suffix = udf(parse_suffix, StringType())

In [40]:
# total distinct script calls with added columns
udf_parse_domain = udf(extract_domain, StringType())
df_urls = df_urls.withColumn("base_location_url", udf_parse_base_url(df.location))\
  .withColumn("base_script_url", udf_parse_base_url(df.script_url))\
  .withColumn("location_scheme", udf_parse_url_scheme(df.location))\
  .withColumn("script_scheme", udf_parse_url_scheme(df.script_url))\
  .withColumn("location_domain", udf_parse_domain(df.location))\
  .withColumn("script_domain", udf_parse_domain(df.script_url))
df_urls = df_urls.withColumn("location_suffix", udf_parse_suffix(df_urls.base_location_url))

## Canvas fingerprinting scripts

Find the subset that correspond to canvas fingerprinting scripts.

In [42]:
CFP_REGEX = "|".join(cfp_sites)

In [43]:
CFP_REGEX

'doubleverify.com|lijit.com|adbox.lk|aa.com.ve|seewhy.com|adf.ly|addthis.com|bling99.com|playsport.cc|gazeta.pl|banker.bg|spankbang.com|cloudfront.net|shopjapan.co.jp|cdnetworks.com|bitmedia.io|nt.vc|eyenewton.ru|eyereturn.com|watcheezy.com|pxi.pub|wemark.com|emop.be|groupon.ch|groupon.co.uk|groupon.be|groupon.pt|pof.de|yandex.ru|cntntflow.hu|poll-maker.com|watcheezy.net|runningbare.com.au|fraudmetrix.cn|cloudcrm.co|mileroticos.com|uol.com.br|imusicaradios.com.br|constantcontact.com|pet360.com|ratepay.com|trustedform.com|imedia.cz|sa-mp.im|metrigo.com|cdnetworks.co.jp|free-dollar.com|pof.com.mx|qualoperadora.net|groupon.cl|cformanalytics.com|mindedgeonline.com|job1001.com|machinio.com|pardisgame.net|y-track.com|adlibr.com|straitstimes.com|groupon.it|kbmg.com|jogging-point.de|vimeoo.net|kf2.pl|eternalcrusade.com|news.com.au|ozelders.com|namebrightstatic.com|grouponnz.co.nz|revmob.com|amazonaws.com|websosanh.com|laaptu.com|ml.com|emarsys.net|groupon.se|163.com|worldota.net|rastclick.com|

In [45]:
sites_using_canvas_fingerprinting =  df_urls.filter(df_urls.base_script_url.rlike(CFP_REGEX))

Overall, how many calls are made to canvas fingerprinting scripts?

In [47]:
sites_using_canvas_fingerprinting.count()

2105

above is the number of distinct calls to canvas fingerprinting scripts (one location might call two different scripts)

from 29688 distinct calls, 2105 were made to known canvas fingerprinting script urls

How many distinct base URLs are there among the sites in the dataset? (from 29688 distinct calls- where one location might make 2 different calls and therefor be listed twice- how many distinct locations are making calls?)

In [48]:
df_urls.dropDuplicates(['base_location_url']).count()

3405

3405 locations (base urls) out of 29688 are distinct

And of **those** using cfp?


In [49]:
sites_using_canvas_fingerprinting.dropDuplicates(['base_location_url']).count()


701

Of the 3405 total distinct calls to known canvas fingerprinting scripts, 701 of those locations are distinct, which is quite a significant finding. 

Overall stats

    Total unique (page, script) calls in the dataset: 29688
    Total unique base locations (netloc): 3405

    Total unique calls to CFP providers: 2105
    Total unique base locations using CFP: 701
    % of calls that are to a CFP provider: 7% (2105/29688)
    % of sites(base url) that uses a CFP provider: 21% (701/3405) 
    