generated from MuckRock/documentcloud-scraper-cron-addon
-
Notifications
You must be signed in to change notification settings - Fork 2
/
config.yaml
83 lines (81 loc) · 2.93 KB
/
config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
title: Scraper
description: "Scrape and optionally crawl a given site for documents to upload to DocumentCloud."
instructions: |
You may specify a project to scrape the documents into as well as an access level.
Scraper can alert you by email or Slack notification when given keywords appear in
documents if you specify keywords to monitor. For Slack notifications, you must provide a webhook.
The crawl depth is a parameter that tells the Scraper how many clicks deep away from the
site you specify in order to continue looking for documents.
If the PDFs are directly linked on the site you provide
(1 click to get to the PDF), 0 is the crawl depth you should use.
If the site you provide a link to contains multiple links to other pages that have PDFs linked to those pages,
your crawl depth would be 1. A crawl depth of 2 is the maximum supported.
The Scraper Add-On now supports Google Drive links.
It will upload the first 30 Google Drive documents it sees per run.
Scraper will upload the first 100 regular documents it sees per run.
The Scraper keeps track of which documents it has seen and already uploaded.
type: object
properties:
site:
title: Site
type: string
format: uri
description: The URL of the site to start scraping
project:
title: Project
type: string
description: >-
The DocumentCloud project title or ID of the project the documents should
be uploaded to. If the project title does not exist, it will be created.
keywords:
title: Keywords
type: string
description: Keywords to search and notify on (comma separated)
filecoin:
title: Push to IPFS/Filecoin
type: boolean
description: >-
WARNING: This will push all scraped files to IPFS and Filecoin.
There is no way to remove files from these storage systems.
access_level:
title: Access Level
type: string
description: Access level of documents scraped.
default: public
enum:
- public
- organization
- private
#filetypes:
# title: File Types
# type: string
# description: File extensions to be uploaded to DocumentCloud (comma separated)
# default: ".pdf,.docx,.xlsx,.pptx,.doc,.xls,.ppt"
crawl_depth:
title: Crawl Depth
type: integer
description: Recursively scrape same-domain links found on the page (Must be between 0 and 2)
default: 0
minimum: 0
maximum: 2
notify_all:
title: Notify on all new documents
type: boolean
slack_webhook:
title: Slack Webhook
type: string
format: uri
description: Enter a slack webhook to enable Slack notifications
required:
- site
- project
categories:
- monitor
custom_disabled_email_footer: |
Read about some of the reasons why the Scraper Add-On fails in our guide: https://muckrock.notion.site/Common-reasons-the-Scraper-Add-On-fails-ecd09d1b2e254fee9ff01fad293b61c3?pvs=74
eventOptions:
name: site
events:
- hourly
- daily
- weekly