Skip to content
This repository has been archived by the owner on Feb 17, 2024. It is now read-only.

Ayush1325/webcrawler-woc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readme

Webcrawler

  • The project basically crawls the webpage and collects as much information as possible,

like external links, mails, etc. Like a web crawler used by search engines but specific for a domain and url.

  • It is a project for WOC.

Dependencies

  • reqwest : For making http requests.
  • select : A library to extract useful data from HTML documents, suitable for web scraping.
  • clap : Command Line Argument Parser for Rust
  • Tokio : A runtime for writing reliable, asynchronous, and slim applications with the Rust programming language.
  • Futures : A library providing the foundations for asynchronous programming in Rust.
  • Serde : A framework for serializing and deserializing Rust data structures efficiently and generically.
  • Mime : Support MIME (HTTP Media Types) as strong types in Rust.
  • trust-dns-resolver : A dns resolver written in Rust.
  • thirtyfour : A Selenium / WebDriver library for Rust, for automated website UI testing.
  • url : URL library for Rust

Usage

webcrawler 1.0
Ayush Singh <ayushsingh1325@gmail.com>

USAGE:
    webcrawler [FLAGS] [OPTIONS] <url>

ARGS:
    <url>    Seed url for crawler

FLAGS:
    -h, --help        Prints help information
        --selenium    Flag for taking screenshots using Selenium. Takes screenshot if a word from
                      wordlist is found in the page
        --verbose     Output the link to standard output
    -V, --version     Prints version information

OPTIONS:
    -b, --blacklist <blacklist>            Path of file containing list of domains not to be crawled
    -d, --depth <depth>                    Gives numeric depth for crawl
    -o, --output-folder <output-folder>    Path to the output folder
    -s, --search-words <search-words>      Path to file containing words to search for in the page
        --task-limit <task-limit>          Limits the number of parallel tasks [default: 1000]
    -t, --timeout <timeout>                Timout for http requests [default: 10]
    -w, --whitelist <whitelist>            Path of file containing list of domains to be crawled

Resources

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published