- The project basically crawls the webpage and collects as much information as possible,
like external links, mails, etc. Like a web crawler used by search engines but specific for a domain and url.
- It is a project for WOC.
- reqwest : For making http requests.
- select : A library to extract useful data from HTML documents, suitable for web scraping.
- clap : Command Line Argument Parser for Rust
- Tokio : A runtime for writing reliable, asynchronous, and slim applications with the Rust programming language.
- Futures : A library providing the foundations for asynchronous programming in Rust.
- Serde : A framework for serializing and deserializing Rust data structures efficiently and generically.
- Mime : Support MIME (HTTP Media Types) as strong types in Rust.
- trust-dns-resolver : A dns resolver written in Rust.
- thirtyfour : A Selenium / WebDriver library for Rust, for automated website UI testing.
- url : URL library for Rust
webcrawler 1.0
Ayush Singh <ayushsingh1325@gmail.com>
USAGE:
webcrawler [FLAGS] [OPTIONS] <url>
ARGS:
<url> Seed url for crawler
FLAGS:
-h, --help Prints help information
--selenium Flag for taking screenshots using Selenium. Takes screenshot if a word from
wordlist is found in the page
--verbose Output the link to standard output
-V, --version Prints version information
OPTIONS:
-b, --blacklist <blacklist> Path of file containing list of domains not to be crawled
-d, --depth <depth> Gives numeric depth for crawl
-o, --output-folder <output-folder> Path to the output folder
-s, --search-words <search-words> Path to file containing words to search for in the page
--task-limit <task-limit> Limits the number of parallel tasks [default: 1000]
-t, --timeout <timeout> Timout for http requests [default: 10]
-w, --whitelist <whitelist> Path of file containing list of domains to be crawled