GitHub - OriginLive/SpideR: A web crawler engine that gathers data, sorts it and outputs it to file. The goal is for it to be fully customizable and extensible, with possibilities for scripting the behaviour of the spiders.

SpideR

NOTE: A lot of functionality is still missing. This will be rectified eventually.

A web crawler engine that gathers data, sorts it and outputs it to file.
The goal is for it to be fully customizable and extensible, with possibilities for scripting the behaviour of the spiders. The emphasis is on speed and ease of embeddance.

Requirements:
libcurl, the c++ wrapper curlpp, and gumbo-parser..
See: https://curl.haxx.se/libcurl/
http://www.curlpp.org/
https://github.com/google/gumbo-parser for downloads.
For arch-linux, curl is to be found in the core repos, while libcurlpp and gumbo-git can be found in the AUR.

Installation: Compile it. For linux there is a makefile that should work.

Settings:
Settings are set through the Settings.json file.
The settings currently available are:
textspeed : int - Does nothing atm.
depth : int - Determines how far it should follow links that are found. Default is 1.
debug : bool - Setting this to 1 sets the verbose flag for the connection. But this ends up in the parsing..
type : unchanged|small|firstcapital|fullcapital - Format of the words stored.

Use:
At the moment there are only 3 commands: help, connect and quit.
Connect url - Attempts to connect to the specified site and gather words and url based on the settings set.
Example:

connect www.google.com

Output goes to Output.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
libs/rapidjson		libs/rapidjson
pybind11		pybind11
x64/Debug		x64/Debug
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.MD		CONTRIBUTING.MD
Connection.cpp		Connection.cpp
Connection.h		Connection.h
Console.h		Console.h
Factory.cpp		Factory.cpp
Factory.h		Factory.h
LICENSE		LICENSE
Linuxconsole.cpp		Linuxconsole.cpp
Manager.cpp		Manager.cpp
Manager.h		Manager.h
Parser.cpp		Parser.cpp
Parser.h		Parser.h
Queen.cpp		Queen.cpp
Queen.h		Queen.h
README.md		README.md
ReadeR.sln		ReadeR.sln
ReadeR.vcxproj		ReadeR.vcxproj
ReadeR.vcxproj.filters		ReadeR.vcxproj.filters
Settings.json		Settings.json
Source.cpp		Source.cpp
Spider.cpp		Spider.cpp
Spider.h		Spider.h
Utility.h		Utility.h
Winconsole.cpp		Winconsole.cpp
makefile		makefile
program_plans.txt		program_plans.txt
script.py		script.py

License

OriginLive/SpideR

Folders and files

Latest commit

History

Repository files navigation

SpideR

About

Resources

License

Stars

Watchers

Forks

Languages