Web crawler for multi-threaded computers.
- First clone the current repository and which already contains
urlfactorywhich is needed:
$ git clone https://github.com/QwantResearch/mermoz.git
Go to the root directory of
Mermozand check that you got all the needed dependencies.
Finally compiling the code is really easy:
After doing the command
make the binary are located whihin
$ cd build $ ./mermoz --settings file --seeds file
settings file has the following format:
fetchers [num fetchers] parsers [num parsers] user-agent [user-agent] max-ram [GB]
Probably, you see us crawling your website, we announce ourselves as:
Mozilla/5.0 (compatible; Qwantify/Mermoz/0.1; +https://www.qwant.com/; +https://www.github.com/QwantResearch/mermoz)
with the following IPs
⚠️ Something wrong? ⚠️
If when we visited your website something went wrong because of our crawler, we are sincerely sorry for this inconvenience. Actually, Mermoz is a research project still under development and highly perfectible.
Thus, please tell us what went wrong because of Mermoz by sending a message to this address: mermoz [at] qwantresearch [dot] com
This list is more or less like a memo:
urlfactoryall the needed tools for URLs and
gumbofast and reliable HTML5 parser,
For any questions, comments, or collaborations, please use: mermoz [at] qwantresearch [dot] com.