Decentralized Crawler - GoogleMap Cafe info

Description

A decentralized crawler which targets to crawl GoogleMap cafe data (shop information).

Motivation

This program be developed for my responsibility of crawling data from internet. It's a part of side project which also be called as "Coffee Google" currently.
For "Coffee Google" project, it needs to integrate all data which from different sources and must to crawl data from internet. However, general crawler program is too slow to get target data, especially crawler program be developed with framework Selenium.

Skills

This project classifies it to 2 parts. One is crawler, another one is other prcesses logic-implements (like Multiple Actors relationship, Send message mechanism and build Kafka producer, consumer, etc).

Environment

For Developing

OS: MacOS (Current Version: 10.14.5)

For Running

OS: MacOS (Current Version: 10.14.5), Windows OS (Current Version: Win10)

Crawler

Language: Python
Version: 3.7 up
Framework: Requests, Selenium

All other logic-Implements

Language: Scala
Version: 2.12
Framework: Spark (version: 2.4.5), AKKA (version: 2.4.20)
Message Queue Server: Kafka (version: 2.5.0)
Databsase: Cassandra (Datastax driver-core version: 3.6.0, Spark connector version: 2.5.0)

AKKA Actors Tree Relationsahip

In this project, everything depends on one thing --- AKKA to build crawler code to be a decentralized system. However, this project builded a stand-alone mode code currently. Below is the AKKA actors tree architecture in this project.

Actors Tree Relationship like:
Master (King) ---> Worker Leaders (Paladins) ---> Workers (Soldiers)

Master (King)
The boss of all actors. It builds and manages worker leaders, also, it gets the pre-data in the prestart actor status.
- Import Pre-data
- Build worker leaders and send task to let them ready to stand by (wait for pre-data).
Worker Leaders (Paladins)
Build workers and distribute job to them to work. It has 3 different worker leaders here:
- Crawl Paladin
  Build and manage crawl soldiers. This project divides the targert crawl-content to 4 parts and 4 different paladins have responsibility of them.
  - Cafe basic information
  - Cafe services
  - Cafe comments
  - Cafe images
- Pre-Data Producer Paladin
  Receive the Pre-Data from King (master) and distribute to workers.
- Data-Saver Paladin
  Receive all data it gets from crawl soldiers and save it to database or files.
  This Paladin is independence even it doesn't have any Soldiers be built under it because its main job is integrating all of data which be send by all "Crawl Soldiers" in any time (immediatelly). In other words, it guarantees the connector session is unique between database and multiple actors.
Workers (Soldiers)
Receive task and essentailly work the content. It has 3 different workers:
- Crawl Soldier
  Receive task (from Search Soldiers) and crawl data.
- Pre-Data Preoducer Soldier
  Receive task and Pre-Data and write (produce) it into Kafka broken.
  This soldier is Kafka Producer.
- Search Soldier
  Be activated by King firt and keep sniffing (consuming) target Topic of Kafka. If it gets something, Search Soldier sends it to every soldiers who has responsibility of different content part. If it doesn't, on going to keep listening.
  This soldier is Kafak Consumer.

AKKA with Kafka Relationship

Kafka is a one of greatest software product! Kafka broken be a very important role in this project, that's "Distributer". Please refer to the below AKKA with Kafka Relationship

The project let King to be the all actors management and Paladins to be the management and "Distributer" (Paladin also is but it's a little bit different with Kafka, let us talk about it later). Some soldiers which be build up by Paladins is the "Kafka Role". "Pre-Data Soldiers" is Kafka Producer and "Search Soldier" is Kafka Consumer.
First, "Search Soldier" will keep consuming the message even it has nothing in topic. If it gets something, "Search Soldier" sends it to "Crawl Soldier" as AKKA actor message immediately. For "Pre-Data Soldier" part, it starts to receive task content (Pre-Data) and produces it to target topic which be sniffed by "Search Soldier" until finishes the data.

Addition of "Distributer"

In Kafka "Distributer" point, it for Kafka consumers. But in AKKA actor role Paladin "Distributer", it for distributing data to multiple AKKA actor role Soldier (Pre-Data Paladin) or build up multiple AKKA actors to do something (All Paladin except Data-Saver Paladin). The distribute objects between them are different.

Benefits

Re-Balance

Kafka could auto-distribute the partitions of topic to consumers. Developer could set any number of consumers they want to receive target message and doesn't need to do other anything.

Data is Unique

This is one of the most important features of Kafka. That helps developer deeply decrease development complexity.

Singler and Simplier Development

In this project, no matter which type of Soldiers (Worker), it has responsibility of ONE and SIMPLE job. For example, "Search Soldier" (Kafka Consumer) target to sniff all message of target topic; "Pre-Data Soldier" (Kafka Producer) produce message (Crawl Pre-Data) which be assigned from Paladin. "Crawl Soldier" could pay attention to crawl data because "Search Soldier" could help it sniff and send Pre-Data message to it. Each of them has SIMPLE and CLEAR job to do.

Crawler Flow Chart

Here is the GoogleMap cafe crawler program flow chart which could help developer understand the full program process how to run and the software architecture.

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
docs		docs
src/main		src/main
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

src/main

src/main

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

requirements.txt

requirements.txt

Repository files navigation

Decentralized Crawler - GoogleMap Cafe info

Description

Motivation

Skills

Environment

For Developing

For Running

Crawler

All other logic-Implements

AKKA Actors Tree Relationsahip

AKKA with Kafka Relationship

Addition of "Distributer"

Benefits

Re-Balance

Data is Unique

Singler and Simplier Development

Crawler Flow Chart

About

Releases

Packages

Languages

License

Chisanan232/Decentralized-Crawler---GoogleMap-Cafe-info

Folders and files

Latest commit

History

Repository files navigation

Decentralized Crawler - GoogleMap Cafe info

Description

Motivation

Skills

Environment

For Developing

For Running

Crawler

All other logic-Implements

AKKA Actors Tree Relationsahip

AKKA with Kafka Relationship

Addition of "Distributer"

Benefits

Re-Balance

Data is Unique

Singler and Simplier Development

Crawler Flow Chart

About

Resources

License

Stars

Watchers

Forks

Languages