config.yaml

title: Scraper
description: "Scrape and optionally crawl a given site for documents to upload to DocumentCloud."
instructions: |
  You may specify a project to scrape the documents into as well as an access level. 
  Scraper can alert you by email or Slack notification when given keywords appear in
  documents if you specify keywords to monitor. For Slack notifications, you must provide a webhook. 

  The crawl depth is a parameter that tells the Scraper how many clicks deep away from the
  site you specify in order to continue looking for documents. 
  If the PDFs are directly linked on the site you provide 
  (1 click to get to the PDF), 0 is the crawl depth you should use. 
  If the site you provide a link to contains multiple links to other pages that have PDFs linked to those pages, 
  your crawl depth would be 1. A crawl depth of 2 is the maximum supported. 

  The Scraper Add-On now supports Google Drive links. 
  It will upload the first 30 Google Drive documents it sees per run. 
  Scraper will upload the first 100 regular documents it sees per run. 
  The Scraper keeps track of which documents it has seen and already uploaded.
type: object
properties:
  site:
    title: Site
    type: string
    format: uri
    description: The URL of the site to start scraping
  project:
    title: Project
    type: string
    description: >-
      The DocumentCloud project title or ID of the project the documents should
      be uploaded to.  If the project title does not exist, it will be created.
  keywords:
    title: Keywords
    type: string
    description: Keywords to search and notify on (comma separated)
  filecoin:
    title: Push to IPFS/Filecoin
    type: boolean
    description: >-
      WARNING: This will push all scraped files to IPFS and Filecoin.  
      There is no way to remove files from these storage systems.
  access_level:
    title: Access Level
    type: string
    description: Access level of documents scraped.
    default: public
    enum:
      - public
      - organization
      - private
  #filetypes:
  #  title: File Types
  #  type: string
  #  description: File extensions to be uploaded to DocumentCloud (comma separated)
  #  default: ".pdf,.docx,.xlsx,.pptx,.doc,.xls,.ppt"
  crawl_depth:
    title: Crawl Depth
    type: integer
    description: Recursively scrape same-domain links found on the page (Must be between 0 and 2)
    default: 0
    minimum: 0
    maximum: 2
  notify_all:
    title: Notify on all new documents
    type: boolean
  slack_webhook:
    title: Slack Webhook
    type: string
    format: uri
    description: Enter a slack webhook to enable Slack notifications
required:
  - site
  - project
categories:
  - monitor
custom_disabled_email_footer: |
  Read about some of the reasons why the Scraper Add-On fails in our guide: https://muckrock.notion.site/Common-reasons-the-Scraper-Add-On-fails-ecd09d1b2e254fee9ff01fad293b61c3?pvs=74
eventOptions:
  name: site
  events:
    - hourly
    - daily
    - weekly