Parsing newspapers everyday to get a corpus
This program periodically scrapes web pages, two times a day and dumps the resulting texts to a directory with this format :
out/ newspaper name/ 2018-04-09_21:00.txt
You can specify as many sites to scrapes as you want in the configuration file.
Launeparser scrapes newspapers Usage: launeparser [command] Available Commands: help Help about any command scrape Instantly scrape start Start the server and scraping version Show build and version Flags: -h, --help help for launeparser --log.format string one of text or json (default "text") --log.level string one of debug, info, warn, error or fatal (default "info") --log.line enable filename and line in logs --output string output directory (default "out") Use "launeparser [command] --help" for more information about a command.
server: host: 127.0.0.1 port: 8012 debug: true log: level: debug format: text line: true newspapers: - url: http://... name: ...
server part is not needed, as well as the
log server as there are sane
defaults. Also the
server part is completely unused when using the
launeparser scrape command.