Onigumo

About

Onigumo is yet another web-crawler. It “crawls” websites or webapps, storing their data in a structured form suitable for further machine processing.

Architecture

Onigumo is composed of three sequentially interconnected components:

the Operator,
the Downloader,
the Parser.

The flowchart below illustrates the flow of data between those parts:

flowchart LR
    start([START])               -->         onigumo_operator[OPERATOR]
    onigumo_operator   -- <hash>.urls ---> onigumo_downloader[DOWNLOADER]
    onigumo_downloader -- <hash>.raw  ---> onigumo_parser[PARSER]
    onigumo_parser     -- <hash>.json ---> onigumo_operator
	
	onigumo_operator          <-.->        spider_operator[OPERATOR]
	onigumo_parser            <-.->        spider_parser[PARSER]

    onigumo_operator           -->         spider_materialization[MATERIALIZER]
	
	subgraph "Onigumo (kernel)"
	    onigumo_operator
		onigumo_downloader
		onigumo_parser
	end

    subgraph "Spider (application)"
       spider_operator
       spider_parser
       spider_materialization
    end

Operator

The Operator determines URL addresses for the Downloader. A Spider is responsible for adding the URLs, which it gets from the structured form of the data provided by the Parser.

The Operator’s job is to:

initialize a Spider,
extract new URLs from structured data,
insert those URLs onto the Downloader queue.

Downloader

The Downloader fetches and saves the contents and metadata from the unprocessed URL addresses.

The Downloader’s job is to:

read URLs for download,
check for the already downloaded URLs,
fetch the URLs contents along with its metadata,
save the downloaded data.

Parser

Zpracovává data ze staženého obsahu a metadat do strukturované podoby.

Činnost parseru se skládá z:

kontroly stažených URL adres ke zpracování,
zpracovávání obsahu a metadat stažených URL do strukturované podoby,
ukládání strukturovaných dat.

Aplikace (pavouci)

Ze strukturované podoby dat vytáhne potřebné informace.

Podstata výstupních dat či informací je závislá na uživatelských potřebách a také podobě internetového obsahu. Je nemožné vytvořit univerzálního pavouka splňujícího všechny požadavky z kombinace obou výše zmíněných. Z tohoto důvodu je nutné si napsat vlastního pavouka.

Materializer

Usage

Credits

Licenced under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 598 Commits
.github/workflows		.github/workflows
config		config
lib		lib
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Onigumo

About

Architecture

Operator

Downloader

Parser

Aplikace (pavouci)

Materializer

Usage

Credits

About

Contributors 3

Languages

License

Glutexo/onigumo

Folders and files

Latest commit

History

Repository files navigation

Onigumo

About

Architecture

Operator

Downloader

Parser

Aplikace (pavouci)

Materializer

Usage

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages