Skip to content
Web crawler for multi-threaded computers
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src
.gitignore
CONTRIBUTING.md
LICENSE
Makefile
README.md
TODO.md

README.md

Mermoz ✈️

Web crawler for multi-threaded computers.

📢 Dear webmasters, we crawled your website? All the needed infos here 📢

Build

  • First clone the current repository and which already contains urlfactory which is needed:
$ git clone https://github.com/QwantResearch/mermoz.git
  • Go to the root directory of Mermoz and check that you got all the needed dependencies.

  • Finally compiling the code is really easy:

$ make

Launch

After doing the command make the binary are located whihin build/:

$ cd build
$ ./mermoz --settings file --seeds file

The settings file has the following format:

fetchers [num fetchers]
parsers [num parsers]
user-agent [user-agent]
max-ram [GB]

and the seeds file

url1
[urls...]

Webmasters 💻

Probably, you see us crawling your website, we announce ourselves as:

Mozilla/5.0 (compatible; Qwantify/Mermoz/0.1; +https://www.qwant.com/; +https://www.github.com/QwantResearch/mermoz)

with the following IPs 194.187.171.xxx.

⚠️ Something wrong? ⚠️

If when we visited your website something went wrong because of our crawler, we are sincerely sorry for this inconvenience. Actually, Mermoz is a research project still under development and highly perfectible.

Thus, please tell us what went wrong because of Mermoz by sending a message to this address: mermoz [at] qwantresearch [dot] com

Dependencies

This list is more or less like a memo:

Contributing

Please first read CONTRIBUTING.md and propose what you want or you can fix or add functionalities detailed within TODO.md.

For any questions, comments, or collaborations, please use: mermoz [at] qwantresearch [dot] com.

You can’t perform that action at this time.