This is a small Scrapy app, for scraping new articles off of the front pages of news sites.
The app is deployed with docker swarm. This stack is based on the wonderful work of vegasbrianc/prometheus and consists of Prometheus, Node Exporter, Grafana, cAdvisor, and Alertmanager for monitoring the containers. Note that the cAdvisor image used, is for ARM. If you wish to deploy this on a non-arm architecture, you should change this image to Google's official cAdvisor image.
Build caadvisor image corresponding to your arch. Check uname -m
. For arm64v8 (aarch64) you run
Build image on arm machine:
docker build -t emillime/cadvisor:v0.36.0-arm64 ./cadvisor_build/arm64
Or if you use buildx to cross-compile. Assuming you have installed qemu and have setup a builder:
cd art_docker
docker buildx build --platform linux/arm32v7 -t emillime/cadvisor:v0.36.0-arm32v7 --push cadvisor_build/arm32v7/.
On armv7 (32bit) you run
docker build -t emillime/cadvisor:v0.36.0-arm32v7 ./cadvisor_build/arm32v7
Or if you use buildx to cross-compile. Assuming you have installed qemu and setup builder:
cd art_docker
docker buildx build --platform linux/arm64 -t emillime/cadvisor:v0.36.0-arm64 --push cadvisor_build/arm64/.
and then change the image in the docker-stack.yml accordingly.
- Create aws credentials secret (needs docker swarm on host)
docker secret create aws_creds $HOME/.aws/credentials
. This is needed to access the S3 bucket when uploading scraped data.
The repo contains some custom Prometheus exporters.
- A Scrapy Prometheus exporter (artscraper/prometheus.py), which publishes the internal spider stats to Prometheus. It is a slight modification of this. It will publish all internal Scrapy stats in the Spider.stats.get_stats() dictionary, including stats added by your own custom middleware.
- Using Node exporters Text Collector plugin, we publish the total file size of the scraped data (given you save the data on the host), along with the CPU temperature.
- This is done with the scripts
export_scraped_file_size.py
andtemp.sh
. To achieve atomic writes, I have the following line in my host crontab. Here PATH is the full path to the art_docker repo.
- This is done with the scripts
* * * * * python3 PATH/export_scraped_file_size.py -e jsonl -d PATH/data/ -o PATH/.metrics/file_sizes.prom.$$ && mv PATH/.metrics/file_sizes.prom.$$ /node_exporter/textfile_collector/file_sizes.prom
Consult makefile: Build the scraper image with appropriate tags, then deploy the docker-stack.yml.
A rudimentary web API has been written to orchestrate the different spiders. It should block you from running more than one spider at a time.
Swagger documentation is available at
IP-TO-HOST:5666/docs
To launch a scraper you can hit curl -XGET IP-TO-HOST:5666/start/SPIDER_NAME
, and stop a running spider with curl -XGET IP-TO-HOST:5666/stop
.
Depending on the site, it should be easy to extend the generic spider class, to support novel news sites.
You need to figure out CSS-selectors for the relevant fields and get a bunch of start page links, to act as entry points for scraping. Sometimes this is all there is to it.
On some sites, different sections require custom logic. For example, the blog/discussion section needs specific CSS-selectors for some fields, and for URLs to new articles. For custom fields, use the predicate_loader_pairs to supply pairs
(pred, loader)
where (pred(response)
( is a predicate which decides if custom logic is needed based on the response, and the loader(itemloader, response) overwrites CSS-selector logic on the itemloader. See the current spider for example.
- [] Add scraped field counter to each run, and log this.. Just note if field is found/empty/None.
- [*] Upload scraped data to S3 directly.
- [*] Use AWS Lambda to zip the scraped .jsonl files every week.
- [*] crawl_master needs a lot of work.
- [*] Refactor all spiders to a common base class.
- [] Bring all scraped data to a common format, and resave to S3.
- [] Document VisitedFilter middleware.
- [] Document Prometheus exporter middleware.
- [*] Figure out how to operate on a full list of fields, and each scraped field (Scrapy Itemloader docs). Use this to apply TakeFirst, and remove dupes where appropriate.
- [*] Use w3lib to remove tags. Compare speed with html2text.
- [] Content duplicate detection. Several Ritzau articles on all sites.
- [] Use AWS Lambda to calculate streaming stats about scraped data. Either make a weekly report or use pushgateway to Prometheus for Grafana integration.