Readme

Webcrawler

The project basically crawls the webpage and collects as much information as possible,

like external links, mails, etc. Like a web crawler used by search engines but specific for a domain and url.

It is a project for WOC.

Dependencies

reqwest : For making http requests.
select : A library to extract useful data from HTML documents, suitable for web scraping.
clap : Command Line Argument Parser for Rust
Tokio : A runtime for writing reliable, asynchronous, and slim applications with the Rust programming language.
Futures : A library providing the foundations for asynchronous programming in Rust.
Serde : A framework for serializing and deserializing Rust data structures efficiently and generically.
Mime : Support MIME (HTTP Media Types) as strong types in Rust.
trust-dns-resolver : A dns resolver written in Rust.
thirtyfour : A Selenium / WebDriver library for Rust, for automated website UI testing.
url : URL library for Rust

Usage

webcrawler 1.0
Ayush Singh <ayushsingh1325@gmail.com>

USAGE:
    webcrawler [FLAGS] [OPTIONS] <url>

ARGS:
    <url>    Seed url for crawler

FLAGS:
    -h, --help        Prints help information
        --selenium    Flag for taking screenshots using Selenium. Takes screenshot if a word from
                      wordlist is found in the page
        --verbose     Output the link to standard output
    -V, --version     Prints version information

OPTIONS:
    -b, --blacklist <blacklist>            Path of file containing list of domains not to be crawled
    -d, --depth <depth>                    Gives numeric depth for crawl
    -o, --output-folder <output-folder>    Path to the output folder
    -s, --search-words <search-words>      Path to file containing words to search for in the page
        --task-limit <task-limit>          Limits the number of parallel tasks [default: 1000]
    -t, --timeout <timeout>                Timout for http requests [default: 10]
    -w, --whitelist <whitelist>            Path of file containing list of domains to be crawled

Resources

https://rolisz.ro/2020/03/01/web-crawler-in-rust/
https://crawler-test.com/
https://dev.to/stevepryde/using-selenium-with-rust-aca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.org

README.org

Readme

Webcrawler

Dependencies

Usage

Resources

Files

README.org

Latest commit

History

README.org

File metadata and controls

Readme

Webcrawler

Dependencies

Usage

Resources