Skip to content

web scraping for different news sites. It retrieves the latest news from the site and stores them in a new folder in your working directory

Notifications You must be signed in to change notification settings

AnomanderRK/newsscraping

Repository files navigation

Web scraping for "El universal" news

Tests

web scraping for different news sites. It retrieves the latest news from the site and stores them in a new folder in your working directory

Output

  • First stage output (Extract) -> csv: Each day will be stored in a new .csv, in a separated folder with the following structure: output/ [site]/[day]/__consolidated_news.csv

  • Second stage output (Transform) -> pkl: Some natural language processing is performed using NLTK library. The output will be stored inside a consolidated pkl file: output/transform.pkl

  • Third stage output (load): -> sqlite: The information from previous stages is stored and updated in a general db in: output/newspaper.db

How to use it

Note: It is common for different sites to change some of their html structure if you find that the code is not working for you, maybe you need to check the queries expressions used in the configuration.yaml

run the following command from your terminal

python run_pipeline.py --config_file config.yaml

It is possible to run each stage individually from news_scraping/[stage]/main.py

Requirement

I included the requirements.txt so you can install all the needed packages

About

web scraping for different news sites. It retrieves the latest news from the site and stores them in a new folder in your working directory

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published