Skip to content
This repository has been archived by the owner. It is now read-only.
Scan/Trim/Extra Pipeline for State Coronavirus Site
Python
Branch: master
Clone or download

Latest commit

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src Merge branch 'master' into dev Mar 25, 2020
tests added full_page option to screen capture Mar 25, 2020
.gitignore ignore credential files Mar 25, 2020
LICENSE Initial commit Mar 13, 2020
README.md converted to absolute paths Mar 22, 2020
data_pipeline.ini read defaults from .ini file Mar 22, 2020
notes.txt changed require_utc check Mar 21, 2020

README.md

corona19-data-pipeline

Scan/Trim/Extract Pipeline for Coronavirus Site

  • The code now expects to be run from the root directory of the repo. *
  • This includes IDEs like VS Code. *

Scanner

  1. Gets the data from urls in google sheet.
  2. Pulls the raw HTML
  3. Creates a clean version without the markup
  4. Push it into a github repo.

Backup To S3

  1. pulls an image for each page
  2. pushed it to an S3 bucket

Specialized_Capture

  1. Fire up a captive browser
  2. For a list of urls, take a screen shot
  3. If they change, push them into git
You can’t perform that action at this time.