News crawler project using Celery
, BeautifulSoup4
, mecab-ko
.
<Celery worker>
Distributer: Orchestrate jobs<Celery worker>
Harvester: harvest text contents from urls.<Celery worker>
Extractor: extract keywords from articles<Celery worker>
Aggregator: aggregate analyzed data.<RabbitMQ>
Broker: celery message broker.<Redis>
Result Backend: celery task results backend.
- Harvest<*>: gather parsed content from urls.
- Distribute chain: Demultiplex result_items to each chain tasks.
- Extract<*>: extract nouns from news article.
- Aggregate words: aggregate and make
BoW
of articles.
harvest<link> --> distribute_chain -+
|
+-----------------------------------+
|
+-> {harvest<content> -> extract<nouns>} -+
+-> {harvest<content> -> extract<nouns>} -+
: |
|
+---------------------------------------+
|
+-> aggregate_words
- python
3.6.5
- apt package dependencies (
apt-pkgs.txt
) - pip requirements (
requirements.txt
)
# start containers
$ docker-compose up -d
# execute test workflow
$ python main.py
✔ Wait for workflow group tasks.. - Done (1 tasks / 2.12s)
✔ Wait for Chain tasks ready.. - Done (10 tasks / 0.00s)
⠙ Wait for Terminal tasks ready.. - {'PENDING': 500}
...
:15672
: rabbitmq management:5555
: celery flower
# run lint
$ docker-compose -f docker-compose.yml -f docker-compose.test.yml up lint
# run unittest
$ docker-compose -f docker-compose.yml -f docker-compose.test.yml up pytest
# lint
$ flake8
# unittest
$ pytest