## Detecting canvas fingerprinting scripts
- This notebook demonstrates canvas fingerprinting detection using streaming analysis
- Canvas fingerprinting detection code is taken from https://github.com/sensor-js/OpenWPM-mobile
- For background on canvas fingerprinting see our [CCS'14](https://securehomes.esat.kuleuven.be/~gacar/persistent/) and [CCS'16](https://webtransparency.cs.princeton.edu/webcensus/) studies.



In [2]:
import re
import json
import sqlite3
import pandas as pd
from _collections import defaultdict
from tqdm import tqdm

In [3]:
# import some analysis utilities from https://github.com/englehardt/crawl_utils
import sys
sys.path.append('./crawl_utils/')
import domain_utils as du
import analysis_utils as au

pd.set_option("display.max_colwidth",500)
pd.set_option("display.max_rows",500)


In [4]:
# use the sample sqlite
DB = '/home/marleensteinhoff/UNi/Projektseminar/Datenanalyse/sample_2018-06_1m_stateless_census_crawl.sqlite'

### Load JavaScript calls

In [5]:
con = sqlite3.connect(DB)

con.row_factory = sqlite3.Row
cur = con.cursor()
js = pd.read_sql_query("SELECT * FROM javascript", con)

print("Number of javascript calls", len(js))

Number of javascript calls 501207


In [6]:
# Add the helper column
js['script_ps1'] = js['script_url'].apply(lambda x: du.get_ps_plus_1(x) if x is not None else None)
js.head(3)

Unnamed: 0,id,crawl_id,visit_id,script_url,script_line,script_col,func_name,script_loc_eval,document_url,top_level_url,call_stack,symbol,operation,value,arguments,time_stamp,script_ps1
0,1,7,7,https://www.google.co.in/?gws_rd=ssl,1,3641,,,https://www.google.co.in/?gws_rd=ssl,https://www.google.co.in/?gws_rd=ssl,,window.navigator.userAgent,get,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,,2018-06-27T14:19:39.880Z,google.co.in
1,2,7,7,https://www.google.co.in/?gws_rd=ssl,1,3731,,,https://www.google.co.in/?gws_rd=ssl,https://www.google.co.in/?gws_rd=ssl,,window.navigator.userAgent,get,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,,2018-06-27T14:19:39.880Z,google.co.in
2,3,7,7,https://www.google.co.in/?gws_rd=ssl,1,3732,,,https://www.google.co.in/?gws_rd=ssl,https://www.google.co.in/?gws_rd=ssl,,window.navigator.userAgent,get,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0,,2018-06-27T14:19:39.882Z,google.co.in


### Breakdown of instrumented function calls

In [7]:
js[js.operation == "call"].symbol.value_counts().head(10)

window.Storage.getItem                  46851
window.Storage.setItem                  18104
window.Storage.removeItem               13812
CanvasRenderingContext2D.fill            7258
CanvasRenderingContext2D.save            7074
CanvasRenderingContext2D.restore         7070
HTMLCanvasElement.getContext             4208
window.Storage.key                       3689
CanvasRenderingContext2D.measureText     3103
CanvasRenderingContext2D.stroke          2393
Name: symbol, dtype: int64

### Canvas API calls
- Print the most common arguments to `CanvasRenderingContext2D.fillText`,
which is used to draw a text onto canvas.
- `Cwm fjordbank glyphs vext quiz` is a [perfect pangram](https://en.wikipedia.org/wiki/Pangram#Short_pangrams)
that [we found](https://securehomes.esat.kuleuven.be/~gacar/persistent/#canvas-results) to be commonly used by canvas fingerprinters.

In [8]:
js[(js.operation == "call") &
   (js.symbol == "CanvasRenderingContext2D.fillText")
  ].arguments.value_counts().head(10)

{"0":"Cwm fjordbank glyphs vext quiz, 😃","1":4,"2":45}    74
{"0":"Cwm fjordbank glyphs vext quiz, 😃","1":2,"2":15}    74
{"0":"!image!","1":4,"2":17}                              39
{"0":"!image!","1":2,"2":15}                              39
{"0":"!H71JCaj)]# 1@#","1":4,"2":8}                       19
{"0":"Soft Ruddy Foothold 2","1":2,"2":2}                 19
{"0":"🇺​🇳","1":0,"2":0}                                   18
{"0":"🇺🇳","1":0,"2":0}                                    18
{"0":"🕴​♀️","1":0,"2":0}                                  14
{"0":"🕴‍♀️","1":0,"2":0}                                  14
Name: arguments, dtype: int64

### Streaming analysis to detect canvas fingerprinting
- To detect potential canvas fingerprinters we seek for [a set of conditions](http://randomwalker.info/publications/OpenWPM_1_million_site_tracking_measurement.pdf#page=12)* to be present. (We use a slightly different set of conditions to reduce false negatives.)


In [9]:
CANVAS_READ_FUNCS = [
    "HTMLCanvasElement.toDataURL",
    "CanvasRenderingContext2D.getImageData"
    ]

CANVAS_WRITE_FUNCS = [
    "CanvasRenderingContext2D.fillText",
    "CanvasRenderingContext2D.strokeText"
    ]

"""
Criteria 3 from Englehardt & Narayanan, 2016
"3. The script should not call the save, restore, or addEventListener
methods of the rendering context."

`addEventListener` is only called for HTMLCanvasElement, so we use that.
"""
CANVAS_FP_DO_NOT_CALL_LIST = ["CanvasRenderingContext2D.save",
                              "CanvasRenderingContext2D.restore",
                              "HTMLCanvasElement.addEventListener"]

In [10]:
MIN_CANVAS_TEXT_LEN = 10
MIN_CANVAS_IMAGE_WIDTH = 16
MIN_CANVAS_IMAGE_HEIGHT = 16


def get_canvas_text(arguments):
    """Return the string that is written onto canvas from function arguments."""
    if not arguments:
        return ""
    canvas_write_args = json.loads(arguments)
    try:
        # cast numbers etc. to a unicode string
        return unicode(canvas_write_args["0"])
    except Exception:
        return ""


def are_get_image_data_dimensions_too_small(arguments):
    """Check if the retrieved pixel data is larger than min. dimensions."""
    # https://developer.mozilla.org/en-US/docs/Web/API/CanvasRenderingContext2D/getImageData#Parameters  # noqa
    get_image_data_args = json.loads(arguments)
    sw = int(get_image_data_args["2"])
    sh = int(get_image_data_args["3"])
    return (sw < MIN_CANVAS_IMAGE_WIDTH) or (sh < MIN_CANVAS_IMAGE_HEIGHT)



In [11]:
def get_canvas_fingerprinters(canvas_reads, canvas_writes, canvas_styles,
                              canvas_banned_calls, canvas_texts):
    canvas_fingerprinters = set()
    for script_address, visit_ids in canvas_reads.items():
        if script_address in canvas_fingerprinters:
            continue
        canvas_rw_visits = visit_ids.\
            intersection(canvas_writes[script_address])
        if not canvas_rw_visits:
            continue
        # we can remove the following, we don't use the style/color condition
        for canvas_rw_visit in canvas_rw_visits:
            # check if the script has made a call to save, restore or
            # addEventListener of the Canvas API. We exclude scripts making
            # these calls to eliminate false positives
            if canvas_rw_visit in canvas_banned_calls[script_address]:
                print ("Excluding potential canvas FP script", script_address,
                       "visit#", canvas_rw_visit,
                       canvas_texts[(script_address, canvas_rw_visit)])
                continue
            canvas_fingerprinters.add(script_address)
            #print ("Canvas fingerprinter", script_address, "visit#",
            #       canvas_rw_visit,
            #       canvas_texts[(script_address, canvas_rw_visit)])
            break

    return canvas_fingerprinters

#### Start streaming analysis

In [12]:
query = """SELECT sv.site_url, sv.visit_id,
    js.script_url, js.operation, js.arguments, js.symbol, js.value
    FROM javascript as js LEFT JOIN site_visits as sv
    ON sv.visit_id = js.visit_id WHERE
    js.script_url <> ''
    """

canvas_reads = defaultdict(set)
canvas_writes = defaultdict(set)
canvas_texts = defaultdict(set)
canvas_banned_calls = defaultdict(set)
canvas_styles = defaultdict(lambda: defaultdict(set))

for row in tqdm(cur.execute(query)):
    # visit_id, script_url, operation, arguments, symbol, value = row[0:6]
    visit_id = row["visit_id"]
    site_url = row["site_url"]
    script_url = row["script_url"]
    operation = row["operation"]
    arguments = row["arguments"]
    symbol = row["symbol"]
    value = row["value"]

    # Exclude relative URLs, data urls, blobs
    if not (script_url.startswith("http://")
            or script_url.startswith("https://")):
        continue
    if symbol in CANVAS_READ_FUNCS and operation == "call":
        if (symbol == "CanvasRenderingContext2D.getImageData" and
                are_get_image_data_dimensions_too_small(arguments)):
            continue
        canvas_reads[script_url].add(visit_id)
    elif symbol in CANVAS_WRITE_FUNCS:
        text = get_canvas_text(arguments)
        # Python miscalculates the length of unicode strings that contain
        # surrogate pairs such as emojis. This make strings look longer
        # than they really are, and is causing false positives.
        # For instance length of "🏴󠁧", which is written onto canvas by
        # Wordpress to check emoji support, is returned as 13.
        # We ignore non-ascii characters to prevent false positives.
        # Perhaps a good idea to log such cases to prevent real fingerprinting
        # scripts to slip in.
        if len(text.encode('ascii', 'ignore')) >= MIN_CANVAS_TEXT_LEN:
            canvas_writes[script_url].add(visit_id)
            # the following is used to debug false positives
            canvas_texts[(script_url, visit_id)].add(text)
    elif symbol == "CanvasRenderingContext2D.fillStyle" and\
            operation == "call":
        canvas_styles[script_url][visit_id].add(value)
    elif operation == "call" and symbol in CANVAS_FP_DO_NOT_CALL_LIST:
        canvas_banned_calls[script_url].add(visit_id)



500911it [00:03, 153928.63it/s]


In [13]:
canvas_fingerprinters = get_canvas_fingerprinters(canvas_reads,
                                                  canvas_writes,
                                                  canvas_styles,
                                                  canvas_banned_calls,
                                                  canvas_texts)

In [14]:
# Mark canvas fingerprinting scripts in the dataframe
js["canvas_fp"] = js["script_url"].map(lambda x: x in canvas_fingerprinters)
# Extract first arguments of function calls as a separate column
js["arg0"] = js["arguments"].map(lambda x: json.loads(x)["0"] if x else "")

In [15]:
## List canvas fingerprinting scripts

In [16]:
js[(js.canvas_fp) &
   (js.operation == "call") &
   (js.symbol == "CanvasRenderingContext2D.fillText")
  ].rename({"arg0": "canvas_text"}, axis='columns')[["top_level_url", "script_ps1", "canvas_text"]].\
        drop_duplicates()

Unnamed: 0,top_level_url,script_ps1,canvas_text
