Web scraping for "El universal" news
web scraping for different news sites. It retrieves the latest news from the site and stores them in a new folder in your working directory
-
First stage output (Extract) -> csv: Each day will be stored in a new .csv, in a separated folder with the following structure: output/ [site]/[day]/__consolidated_news.csv
-
Second stage output (Transform) -> pkl: Some natural language processing is performed using NLTK library. The output will be stored inside a consolidated pkl file: output/transform.pkl
-
Third stage output (load): -> sqlite: The information from previous stages is stored and updated in a general db in: output/newspaper.db
Note: It is common for different sites to change some of their html structure if you find that the code is not working for you, maybe you need to check the queries expressions used in the configuration.yaml
run the following command from your terminal
python run_pipeline.py --config_file config.yaml
It is possible to run each stage individually from news_scraping/[stage]/main.py
I included the requirements.txt so you can install all the needed packages