Table of Contents
Python application for web crawling of dynamic web page, which generates content by doing asynchronous Javascript calls after page is loaded. The Scrapy spider implements inside a Selenuim WebDriver to handle the asynchronous JS calls and handles:
- collecting of NBS articles on multiple pages;
- validations of collected data;
- saving collected and valid articles to sqlite database;
- with FastAPI framework are build endpoints of collected data.
- Clone the repo
https://github.com/TanyaAng/Articles_API.git
- Install all Python libraries
pip install -r requirements.txt
- Make nbs_articles root directory
- Run Spider from terminal:
- (venv) ..\nbs_articles> scrapy crawl article
Datapoint | HTTP Method | Description |
---|---|---|
/articles/ | GET | get all crawled articles and their properties |
/articles/?label={label} | GET | get list of articles with the same label |
/articles/?date={date} | GET | get list of articles from the date |
/article/{article_id} | GET | get single article |
/article/{article_id} | DELETE | delete single article |
/article/{article_id} | PUT | update single article |
MIT License
Tanya Angelova - LinkedIn - t.j.angelova@gmail.com
Project Link: github link