Web scraping for "El universal" news

web scraping for different news sites. It retrieves the latest news from the site and stores them in a new folder in your working directory

Output

First stage output (Extract) -> csv: Each day will be stored in a new .csv, in a separated folder with the following structure: output/ [site]/[day]/__consolidated_news.csv
Second stage output (Transform) -> pkl: Some natural language processing is performed using NLTK library. The output will be stored inside a consolidated pkl file: output/transform.pkl
Third stage output (load): -> sqlite: The information from previous stages is stored and updated in a general db in: output/newspaper.db

How to use it

Note: It is common for different sites to change some of their html structure if you find that the code is not working for you, maybe you need to check the queries expressions used in the configuration.yaml

run the following command from your terminal

python run_pipeline.py --config_file config.yaml

It is possible to run each stage individually from news_scraping/[stage]/main.py

Requirement

I included the requirements.txt so you can install all the needed packages

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
news_scraping		news_scraping
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web scraping for "El universal" news

Output

How to use it

Requirement

About

Releases

Packages

Languages

AnomanderRK/newsscraping

Folders and files

Latest commit

History

Repository files navigation

Web scraping for "El universal" news

Output

How to use it

Requirement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages