A fast Web API scraper written in C++ and built on Boost ASIO
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
img Initial commit Sep 1, 2017
.gitignore Removing beast; staging for Boost 1.66.0 Jan 1, 2018
.gitmodules Committing beast as part of repo Sep 15, 2017
Action.h Adding Candidate header; prepping to handle headers, content, and POST Feb 28, 2018
CMakeLists.txt Adding Candidate header; prepping to handle headers, content, and POST Feb 28, 2018
Candidate.h
Connection.h Adding Candidate header; prepping to handle headers, content, and POST Feb 28, 2018
Controller.cpp Initial commit Sep 1, 2017
Controller.h Initial commit Sep 1, 2017
Dockerfile Upgrading cppbuild Apr 23, 2018
Exception.cpp A few cpp core guidelines fixes Feb 28, 2018
Exception.h A few cpp core guidelines fixes Feb 28, 2018
Generator.cpp A few cpp core guidelines fixes Feb 28, 2018
Generator.h
GeneratorTest.cpp A few cpp core guidelines fixes Feb 28, 2018
LICENSE Initial commit Sep 1, 2017
Options.cpp Adding screen option Sep 27, 2017
Options.h Adding screen option Sep 27, 2017
OptionsTest.cpp stdin-based generator Sep 26, 2017
Query.h
README.md
Scraper.h Adding Candidate header; prepping to handle headers, content, and POST Feb 28, 2018
Writer.h Adding Candidate header; prepping to handle headers, content, and POST Feb 28, 2018
catch.hpp Initial commit Sep 1, 2017
circle.yml Initial commit Sep 1, 2017
hippomocks.h
main.cpp Adding screen option Sep 27, 2017
test_main.cpp Initial commit Sep 1, 2017

README.md

Abrade

CircleCI

Docker Automated build

Docker Repository on Quay

Abrade is a coroutine-based web scraper suitable for querying the existence (a HEAD request) or the contents (a GET request) of a web resource with a sequential, numerical pattern.

Check out the blog post at http://lospi.net for usage and examples.

> abrade -h
Usage: abrade host pattern:
  --host arg                            host name (eg example.com)
  --pattern arg (=/)                    format of URL (eg ?mynum={1:5}&myhex=0x
                                        {hhhh}). See documentation for
                                        formatting of patterns.
  --agent arg (=Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0)
                                        User-agent string (default: Firefox 47)
  --out arg                             output path. dir if contents enabled.
                                        (default: HOSTNAME)
  --err arg                             error path (file). (default:
                                        HOSTNAME-err.log)
  --proxy arg                           SOCKS5 proxy address:port. (default:
                                        none)
  --screen arg                          omits 200-level response if contents
                                        contains screen (default: none)
  -d [ --stdin ]                        read from stdin (default: no)
  -t [ --tls ]                          use tls/ssl (default: no)
  -s [ --sensitive ]                    complain about rude TCP teardowns
                                        (default: no)
  -o [ --tor ]                          use local proxy at 127.0.0.1:9050
                                        (default: no)
  -r [ --verify ]                       verify ssl (default: no)
  -l [ --leadzero ]                     output leading zeros in URL (default:
                                        no)
  -e [ --telescoping ]                  do not telescope the pattern (default:
                                        no)
  -f [ --found ]                        print when resource found (default:
                                        no). implied by verbose
  -v [ --verbose ]                      prints gratuitous output to console
                                        (default: no)
  -c [ --contents ]                     read full contents (default: no)
  --test                                no network requests, just write
                                        generated URIs to console (default: no)
  -p [ --optimize ]                     Optimize number of simultaneous
                                        requests (default: no)
  -i [ --init ] arg (=1000)             Initial number of simultaneous requests
  --min arg (=1)                        Minimum number of simultaneous requests
  --max arg (=25000)                    Maximum number of simultaneous requests
  --ssize arg (=50)                     Size of velocity sliding window
  --sint arg (=1000)                    Size of sampling interval
  -h [ --help ]                         produce help message

v0.2

You can now pipe URLs to Abrade via the --stdin option:

echo /anything/a/b/c?d=123 | abrade httpbin.org --stdin --contents --verbose

You must omit the pattern positional argument to pipe from stdin.

You can also use the --screen option to detect error landing pages that still return 200 responses. Such responses get screened out and will not get written to disk during a --content scrape.

Linux ELF

Windows EXE

Docker Image

docker pull jlospinoso/abrade:v0.2.0

or

docker pull quay.io/jlospinoso/abrade:v0.2.0

v0.1

Linux ELF

Windows EXE

Docker Image

docker pull jlospinoso/abrade:v0.1.0

or

docker pull quay.io/jlospinoso/abrade:v0.1.0

Building Abrade

  1. Abrade uses cmake, so you'll need to install it.
  2. Clone abrade.
  3. Navigate to the checked out directory.
  4. Make a build subdirectory.
  5. Navigate to the build directory.
  6. Invoke cmake.
  7. Use make (*nix) or Visual Studio (Windows) to build the project.

For example, on *nix:

git clone git@github.com:JLospinoso/abrade.git
cd abrade
mkdir build
cd build
cmake ..
make

On Windows, you'll need to open the abrade.sln file and build.