Skip to content

Kpler/scrapy-datadog-extension

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrapy Datadog Extension

Branch CI Python Versions
master CircleCI pythonVersion pythonVersion2

scrapy-datadog-extension is a Scrapy extension to send metrics from your spiders executions to Datadog (scrapy stats).

Installation

There is no public pre-packaged version yet. If you want to use it you will have to clone the project and make it installable easilly from the requirements.txt.

Configuration

First, you will need to include the extension to the EXTENSIONS dict located in your settings.py file. For example:

EXTENSIONS = {
    'scrapy-datadog-extension': 1,
}

Then you need to provide the followings variables, directly from the scrapinghub settings of your jobs:

  • DATADOG_API_KEY: Your Datadog API key.
  • DATADOG_APP_KEY: Your Datadog APP key.
  • DATADOG_CUSTOM_TAGS: List of tags to bind on metrics
  • DATADOG_CUSTOM_METRICS: Sub list of metrics to send to Datadog
  • DATADOG_METRICS_PREFIX: What prefix you want to apply to all of your metrics, e.g.: kp.
  • DATADOG_HOST_NAME: The hostname you want your metrics to be associated with. e.g.: app.scrapinghub.com.

Sometimes one might need to set tags at runtime. For example to compute them out of the spider arguments. To allow such scenario, just set a tags attribute to your spider with a list of statsd compatible keys (i.e. ["foo", ...] or ["foo:bar", ...]). Note that all metrics will then be tagged as well.

How it works

Basically, this extension will, on the spider_closed signal execution, collect the scrapy stats associated to a given projct/spider/job and extract a list of variables listed in a stats_to_collect list, custom variables will be also be added:

  • elapsed_time: which is a simple computation of finish-time - start_time.
  • done: a simple counter, acting like a ping to indicate that a job is ran regularly.

At the end, we have a list of metrics, with tags associated (to enable better filtering from Datadog):

  • project: The scrapinghub project ID.
  • spider_name: The scrapinghub spider name as defined in the spider class.

Then, everything is sent to Datadog, using the Datadog API.

Known issues

  • Sometimes, when the spider_closed is executed right after the job completion, some scrapy stats are missing so we send incomplete list of metrics, preventing us to rely 100% on this extension.

Useful links


By the way we're hiring across the world 👇

Kpler Offices

Join our engineering team to help us building data intensive projects! We are looking for people who love their craft and are the best at it.


Kpler logo

This code is MIT licensed.
Designed & built by Kpler engineers with a
💻 and some 🍣.