Skip to content

Crawlarr is a fast web crawler built in Go. It searches for anchor tags in the HTML pages and follows links. It leverages concurrency to improve speed.

License

LockBlock-dev/crawlarr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawlarr

GitHub stars

Crawlarr is a web crawler built using the Go programming language. This tool allows users to input a base URL, and it will search through the HTML code to locate all anchor tags (<a>) on the page. Crawlarr will then follow these links and repeat the process, searching through each subsequent page for more anchor tags until either the end of the website or a user-defined maximum depth is reached. This tool leverages concurrency to significantly increase its speed.

See the changelog for the latest updates.

Table of content

Installation

Compiling from source

Usage

  • With a binary:
    • Run chmod +x crawlarr.
    • Start the tool with ./crawlarr
  • Running from source:
    • Start the tool with go run ./cmd/crawlarr/main.go or cd ./cmd/crawlarr/ && go run .

Find the results in links.txt.

Configuring Crawlarr

The config can be found at the root of the project.

  • Open the config in your favorite editor.
  • Enable the features you want to use. See Config details for in-depth explanations.

Config details

Item Values Meaning
debug boolean Enable debug logs
baseUrl text The URL to starts with
matchType text Matching type for URL
depthLimit number Maximum crawling depth
delay number Delay in ms between crawls

Matching types

  • SAME_BASE:
    Match the same base URL, e.g:

    baseUrl: "http://example.com/this-page/"
    + valid match : http://example.com/this-page/random-page/
    - discarded match : http://example.com/another-page/
    - discarded match : http://test.example.com/
    - discarded match : http://random.site/a-third-page/
  • SAME_HOST:
    Match the same host, e.g:

    baseUrl: "http://example.com/this-page/"
    + valid match : http://example.com/this-page/random-page/
    + valid match : http://example.com/another-page/
    - discarded match : http://test.example.com/
    - discarded match : http://random.site/another-page/
  • SAME_ORIGIN:
    Match the same origin, e.g:

    baseUrl: "http://example.com/this-page/"
    + valid match : http://example.com/this-page/random-page/
    + valid match : http://example.com/another-page/
    + valid match : http://test.example.com/
    - discarded match : http://random.site/another-page/
  • DANGEROUS_NO_MATCH_TYPE_ONLY_ENABLE_IF_YOU_KNOW_WHAT_YOURE_DOING:
    Match any URL (this can go very far), e.g:

    baseUrl: "http://example.com/this-page/"
    + valid match : http://example.com/this-page/random-page/
    + valid match : http://example.com/another-page/
    + valid match : http://test.example.com/
    + valid match : http://random.site/another-page/

License

Copyright (c) 2023 LockBlock-dev

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with [project name]. If not, see https://www.gnu.org/licenses/.

About

Crawlarr is a fast web crawler built in Go. It searches for anchor tags in the HTML pages and follows links. It leverages concurrency to improve speed.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project