An example crawler for ads.txt files given a list of URLs or domains etc and saves them to a SQLite DB table.
Usage: adstxt_crawler.py [options] Options: -h, --help show this help message and exit -t FILE, --targets=FILE list of domains to crawler ads.txt from -d FILE, --database=FILE Database to dump crawlered data into -v, --verbose Increase verbosity (specify multiple times for more)
The targets file can be a list of domains, URLs etc. For each line the crawler with extract the full hostname, validate it and cause a request to http://HOSTNAME/ads.txt
$ cat target_domains.txt #https://chicagotribune.com #http://latimes.com/sports #washingtonpost.com #http://nytimes.com/index.html localhosttribune.com
The project depends on these libraries and programs installed
- Python 2 or better
- Python HTTP Requests libary (pip install requests)
Execute this command to install the DB table
$sqlite3 adstxt.db < adstxt_crawler.sql
The usual usage would be to pass a filename of target URLs and a filename of the SQLite DB.
$ ./adstxt_crawler.py -t target_domains.txt -d adstxt.db Wrote 3 records from 1 URLs to adstxt.db
Upon each run a sequence of entries in adstxt_crawler.log is created.
You can examine the DB records created as follows:
$echo "select * from adstxt;" | sqlite3 adstxt.db
You can clear the DB records as follows:
$echo "delete from adstxt;" | sqlite3 adstxt.db
This is an example prototype crawler and would be suitable only for a very modest production usage. It doesn't contain a lot of niceties of a production crawler, such as parallel HTTP download and parsing of the data files, stateful recovery of target servers being down, usage of a real production DB server etc.
Initial author is Neal Richter (firstname.lastname@example.org)
The open source license used is the 2-clause BSD license