A sett or set is a badger's den which usually consists of a network of tunnels and numerous entrances. Setts incorporate larger chambers used for sleeping or rearing young.
This script is designed to raise young Privacy Badgers by teaching them a little
about the trackers on popular sites, thus preparing them to fight trackers out
in the wild.
crawler.py visits the top 2,000 sites of the Majestic Million
with a fresh version of Privacy Badger installed and saves the
snitch_map it learns in
Prerequisites: have docker installed. Make sure your user is part of the
dockergroup so that you can build and run docker images without
sudo. You can add yourself to the group with
$ sudo usermod -aG docker $USER
Clone the repository
$ git clone https://github.com/efforg/badger-sett
Run a scan
This will run a scan with the latest version of Privacy Badger's master branch and won't commit the results.
To run the script with a different branch of privacy badger, set the
$ PB_BRANCH=my-feature-branch ./runscan.sh
You can also pass arguments to
crawler.py, the python script that does the actual crawl. Any arguments passed to
runscan.shwill be forwarded to
crawler.py. To control the number of sites that the crawler visits, use the
--n-sitesargument (the default is 2000). For example:
$ ./runscan.sh --n-sites 10
Monitor the scan
To have the scan print verbose output about which sites it's visiting, use the
If you don't use that argument, all output will still be logged to
docker-out/log.txt, beginning after the script outputs "Running scan in Docker..."
To set up the script to run periodically and automatically update the repository with its results:
Create a new ssh key with
ssh-keygen. Give it a name unique to the repository.
$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/USER/.ssh/id_rsa): /home/USER/.ssh/id_rsa_badger_sett
Add the new key as a deploy key with R/W access to the repo on Github. https://developer.github.com/v3/guides/managing-deploy-keys/
Add a SSH host alias for Github that uses the new key pair. Create or open
~/.ssh/configand add the following:
Host github-badger-sett HostName github.com User git IdentityFile /home/USER/.ssh/id_rsa_badger_sett
Configure git to connect to the remote over SSH. Edit
[remote "origin"] url = ssh://git@github-badger-sett:/efforg/badger-sett
This will have
gitconnect to the remote using the new SSH keys by default.
Create a cron job to call
runscan.shonce a day. Set the environment variable
RUN_BY_CRON=1to turn off TTY forwarding to
docker run(which would break the script in cron), and set
GIT_PUSH=1to have the script automatically commit and push
results.jsonwhen the scan finishes. Here's an example
0 0 * * * RUN_BY_CRON=1 GIT_PUSH=1 /home/USER/badger-sett/runscan.sh
If everything has been set up correctly, the script should push a new version of
results.jsonafter each crawl. Soon, whenever you
makea new version of Privacy Badger, it will pull the latest version of the crawler's data and ship it with the new version of the extension.