Readme

Webcrawler

The project basically crawls the webpage and collects as much information as possible,

like external links, mails, etc. Like a web crawler used by search engines but specific for a domain and url.

It is a project for WOC.

Dependencies

reqwest : For making http requests.
select : A library to extract useful data from HTML documents, suitable for web scraping.
clap : Command Line Argument Parser for Rust
Tokio : A runtime for writing reliable, asynchronous, and slim applications with the Rust programming language.
Futures : A library providing the foundations for asynchronous programming in Rust.
Serde : A framework for serializing and deserializing Rust data structures efficiently and generically.
Mime : Support MIME (HTTP Media Types) as strong types in Rust.
trust-dns-resolver : A dns resolver written in Rust.
thirtyfour : A Selenium / WebDriver library for Rust, for automated website UI testing.
url : URL library for Rust

Usage

webcrawler 1.0
Ayush Singh <ayushsingh1325@gmail.com>

USAGE:
    webcrawler [FLAGS] [OPTIONS] <url>

ARGS:
    <url>    Seed url for crawler

FLAGS:
    -h, --help        Prints help information
        --selenium    Flag for taking screenshots using Selenium. Takes screenshot if a word from
                      wordlist is found in the page
        --verbose     Output the link to standard output
    -V, --version     Prints version information

OPTIONS:
    -b, --blacklist <blacklist>            Path of file containing list of domains not to be crawled
    -d, --depth <depth>                    Gives numeric depth for crawl
    -o, --output-folder <output-folder>    Path to the output folder
    -s, --search-words <search-words>      Path to file containing words to search for in the page
        --task-limit <task-limit>          Limits the number of parallel tasks [default: 1000]
    -t, --timeout <timeout>                Timout for http requests [default: 10]
    -w, --whitelist <whitelist>            Path of file containing list of domains to be crawled

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.emacs		.emacs
.vscode		.vscode
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.org		LICENSE.org
README.org		README.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readme

Webcrawler

Dependencies

Usage

Resources

About

Releases

Packages

Languages

License

Ayush1325/webcrawler-woc

Folders and files

Latest commit

History

Repository files navigation

Readme

Webcrawler

Dependencies

Usage

Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages