Skip to content
This repository has been archived by the owner on Feb 17, 2024. It is now read-only.

Latest commit

 

History

History
49 lines (44 loc) · 2.59 KB

README.org

File metadata and controls

49 lines (44 loc) · 2.59 KB

Readme

Webcrawler

  • The project basically crawls the webpage and collects as much information as possible,

like external links, mails, etc. Like a web crawler used by search engines but specific for a domain and url.

  • It is a project for WOC.

Dependencies

  • reqwest : For making http requests.
  • select : A library to extract useful data from HTML documents, suitable for web scraping.
  • clap : Command Line Argument Parser for Rust
  • Tokio : A runtime for writing reliable, asynchronous, and slim applications with the Rust programming language.
  • Futures : A library providing the foundations for asynchronous programming in Rust.
  • Serde : A framework for serializing and deserializing Rust data structures efficiently and generically.
  • Mime : Support MIME (HTTP Media Types) as strong types in Rust.
  • trust-dns-resolver : A dns resolver written in Rust.
  • thirtyfour : A Selenium / WebDriver library for Rust, for automated website UI testing.
  • url : URL library for Rust

Usage

webcrawler 1.0
Ayush Singh <ayushsingh1325@gmail.com>

USAGE:
    webcrawler [FLAGS] [OPTIONS] <url>

ARGS:
    <url>    Seed url for crawler

FLAGS:
    -h, --help        Prints help information
        --selenium    Flag for taking screenshots using Selenium. Takes screenshot if a word from
                      wordlist is found in the page
        --verbose     Output the link to standard output
    -V, --version     Prints version information

OPTIONS:
    -b, --blacklist <blacklist>            Path of file containing list of domains not to be crawled
    -d, --depth <depth>                    Gives numeric depth for crawl
    -o, --output-folder <output-folder>    Path to the output folder
    -s, --search-words <search-words>      Path to file containing words to search for in the page
        --task-limit <task-limit>          Limits the number of parallel tasks [default: 1000]
    -t, --timeout <timeout>                Timout for http requests [default: 10]
    -w, --whitelist <whitelist>            Path of file containing list of domains to be crawled

Resources