Crawl data from Shopee and visualize
- Create a db-config.yml in
dags/config
:
mongo:
admin:
url:
usernamme:
password:
read_only:
url:
username:
password:
read_and_write:
url:
username:
password:
postgre:
admin:
host:
database:
user:
password:
port:
url:
- Modify config.yml or config-with-ariflow.yml in
dags/config.yml
for crawler:
links: # link by each category to crawl
- https://shopee.vn/dien-thoai-phu-kien-cat.84
- https://shopee.vn/Th%E1%BB%9Di-Trang-Nam-cat.78
- https://shopee.vn/M%C3%A1y-t%C3%ADnh-Laptop-cat.13030
pages: 80 # Number of page of that category
- Using one of two ways below: Docker with airflow / Locally with threadpool
- Create a .env in base directory contains airflow's config:
AIRFLOW_IMAGE_NAME=
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
# Database
AIRFLOW__CORE__SQL_ALCHEMY_CONN=
AIRFLOW__CELERY__RESULT_BACKEND=
AIRFLOW__CELERY__BROKER_URL=
# Airflow ID
AIRFLOW_UID=
AIRFLOW_GID=
# Airflow WEBSERVER
_AIRFLOW_WWW_USER_USERNAME=
_AIRFLOW_WWW_USER_PASSWORD=
# Airflow UPDATE DATA
_AIRFLOW_DB_UPGRADE=
_AIRFLOW_WWW_USER_CREATE=
# Postgre's config
POSTGRES_USER=
POSTGRES_PASSWORD=
POSTGRES_DB=
- Create database for scheduler
docker compose up (-d) airflow-init
- Start all build-in services: posgre, redis, flower, webserver, scheduler, worker
docker compose up (-d)
*Note:
- Add "-d" in command to run on background
- There're pretty much services so minimum required is 8gb of ram.
- Install all needed packages
pip/pip3 install -r requirements.txt
- Start etl1
python/python3 local.py --etl 1
- Start etl2
python/python3 local.py --etl 2