GitHub - citp/email_tracking: Code and data release for our PETS 2018 paper: "I never signed up for this! Privacy implications of email tracking".

I never signed up for this! Privacy implications of email tracking

This is a public code and data release for the research paper "I never signed up for this! Privacy implications of email tracking.", which will appear at PETS 2018. Portions of the code for this project borrow heavily from Jeffrey's undergraduate senior thesis, available here.

Authors: Steven Englehardt (@englehardt), Jeffrey Han (@itdelatrisu), and Arvind Narayanan (@randomwalker)

Paper: available here.

Components

Core components:

crawler_emails/ - A web crawler, built on OpenWPM, to simulate email views and link clicks.
crawler_mailinglists/ - A web crawler, built on OpenWPM, to find and submit mailing list sign-ups.
email-tracking-tester/ - A tool to test the privacy properties of a mail client.
mailserver/ - The mail server used to collect our corpus of emails.
analysis/ - Coming soon

Code Usage

Additional documentation is available in the README of each component subdirectory.

System Requirements

The framework is fully tested only on Ubuntu 16.04, and requires Java and Python runtime environments.
The processes (described below) can be run on separate machines. The mail server is OS-independent, but the web crawlers only run on Linux.
Depending on the number of registered sites, the mail server might store anywhere from a few hundred megabytes to tens of gigabytes of data on disk per month.

Processes

The system consists of three long-running processes:

The mail server, which receives, stores, and analyzes incoming mail.

$ cd mailsever
$ mvn clean package
$ java -jar target/mailserver.jar

The mailing list crawler, which crawls a list of input sites and searches for mailing lists.
```
$ cd crawler_mailinglists
$ python crawl_mailinglist_signup.py
```
The email crawler, which renders emails in a simulated webmail environment and visits links from those emails.
```
$ cd crawler_emails
$ python crawl_*.py
```

SMTP Configuration

Running the mail server requires a domain name with MX records pointing to the server. Additionally, if running the mailing list crawler from machines other than the mail server's machine, host records (A, CNAME) must also be set.

Data

The following data used in the analysis is available for download:

Mailbox

Includes email meta data (subjects, sender, etc) and email body content.

Download link: mailbox.tar.bz2

Contents:

email_inbox.sqlite
- users table -- Email address registration records. Maps email address to registration site and time.
- inbox table -- Subject, sender, delivery time, and other metadata for each email
mail/ -- Directory of raw .eml files saved by the mail server. Use the inbox table of the email_inbox.sqlite database to navigate.
html/ -- HTML bodies parsed from the corresponding raw email bodies. These are the HTML emails loaded by the crawlers.
html_after_filtering/ -- HTML bodies after filtering tracking tags using EasyList and EasyPrivacy. See Section 7 of the paper.

Email view crawl

Crawl data generated by opening the HTML email bodies given in the html/ directory of the mailbox using a simulated webmail client. This is the primary dataset used for the results in Section 4.

Download link: 2017-05-17_email_tracking_view_crawl.sqlite.bz2

Filtered email view crawl

Crawl data generated by opening the HTML email bodies given in the filtered_html/ directory of the mailbox using a simulated webmail client. This is the primary dataset used for the results in the "Server-side email content filtering" subsection of Section 7.

Download link: 2017-05-28_email_tracking_filtered_view_crawl.sqlite.bz2

Email click crawl

Crawl data generated by visiting a sample of links extracted from the HTML email bodies of each email in the html/ directory of the mailbox. This is the primary dataset used for the results in Section 5.

Download link: 2017-05-17_email_tracking_click_crawl.sqlite.bz2

Mailing list sign-up success rate crawl

Crawl data generated by running our mailing list sign-up procedure on the top sites, instrumenting the resulting pages to compute the overall level of successful sign-ups. This is the primary dataset used for the results in the "Form submission measurement" subsection of Section 3.

Download link: 2017-08-13_signup_success_measurement.sqlite.bz2

Funding

This project was funded by NSF Grant CNS 1526353, a research grant from Mozilla, and Amazon AWS Cloud Credits for Research.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
crawler_emails		crawler_emails
crawler_mailinglists		crawler_mailinglists
email-tracking-tester @ 1c0ca1d		email-tracking-tester @ 1c0ca1d
example_email		example_email
mailserver		mailserver
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawler_emails

crawler_emails

crawler_mailinglists

crawler_mailinglists

email-tracking-tester @ 1c0ca1d

email-tracking-tester @ 1c0ca1d

example_email

example_email

mailserver

mailserver

.gitignore

.gitignore

.gitmodules

.gitmodules

README.md

README.md

Repository files navigation

I never signed up for this! Privacy implications of email tracking

Components

Code Usage

System Requirements

Processes

SMTP Configuration

Data

Mailbox

Email view crawl

Filtered email view crawl

Email click crawl

Mailing list sign-up success rate crawl

Funding

About

Releases

Packages

Languages

citp/email_tracking

Folders and files

Latest commit

History

Repository files navigation

I never signed up for this! Privacy implications of email tracking

Components

Code Usage

System Requirements

Processes

SMTP Configuration

Data

Mailbox

Email view crawl

Filtered email view crawl

Email click crawl

Mailing list sign-up success rate crawl

Funding

About

Resources

Stars

Watchers

Forks

Languages