• Extract text from HTML

    Python 24 8 MIT Updated Sep 20, 2018
  • A project to attempt to automatically login to a website given a single seed

    Python 57 21 Apache-2.0 Updated Sep 4, 2018
  • scikit-learn inspired API for CRFsuite

    Python 199 58 Updated Aug 2, 2018
  • Formasaurus tells you the type of an HTML form and its fields using machine learning

    HTML 42 18 Updated Jul 2, 2018
  • A library for debugging/inspecting machine learning classifiers and explaining their predictions

    Jupyter Notebook 901 125 MIT 1 issue needs help Updated Jun 5, 2018
  • Scrapy middleware for the autologin

    Python 17 8 Updated May 29, 2018
  • A generic crawler

    Python 38 13 Updated May 29, 2018
  • Broad crawler for domain discovery

    Python 4 2 MIT Updated May 29, 2018
  • Simple heuristic for measuring web page similarity (& data set)

    HTML 20 3 Updated May 29, 2018
  • Headless Horseman Page Classifier service

    Python 3 3 MIT Updated May 29, 2018
  • Adaptive crawler which uses Reinforcement Learning methods

    Jupyter Notebook 77 20 Updated May 29, 2018
  • A collection of example LUA scripts and JS utilities

    JavaScript 3 Updated May 29, 2018
  • use multiple proxies with Scrapy

    Python 181 37 MIT Updated May 30, 2018
  • Log TensorBoard events without touching TensorFlow

    Python 505 35 MIT Updated May 30, 2018
  • Detect and classify pagination links

    HTML 33 8 Updated May 29, 2018
  • A component that tries to avoid downloading duplicate content

    Python 12 8 MIT Updated May 29, 2018
  • This is the facade for installation and access to the individual components

    Shell 6 2 Apache-2.0 Updated May 29, 2018
  • A simple tool to add a new user with OpenSSH keys.

    Python 1 1 MIT Updated May 29, 2018
  • Splash + HAProxy + Docker Compose

    Python 68 19 MIT Updated May 29, 2018
  • Scrapy middleware which allows to crawl only new content

    Python 43 13 MIT Updated May 29, 2018
  • Scrapy extension which writes crawled items to Kafka

    Python 11 4 MIT Updated May 29, 2018
  • A BK Tree based approach to storing and querying strings by Levenshtein Distance.

    C 2 5 MIT Updated May 29, 2018
  • A classifier for detecting soft 404 pages

    Jupyter Notebook 29 5 Updated May 29, 2018
  • extract difference between two html pages

    HTML 18 3 MIT Updated May 29, 2018
  • Web Crawling UI and HTTP API, based on Scrapy and Tornado

    Python 112 55 Updated May 29, 2018
  • Agnostic Database Migrations

    Python 14 6 MIT Updated May 29, 2018
  • Scrapy middleware that reads proxy config from settings

    Python 3 5 MIT Updated May 29, 2018
  • Annotate parts of web pages in the browser

    Python 4 MIT Updated May 29, 2018
  • Show summary of a large number of URLs in a Jupyter Notebook

    Python 9 5 MIT Updated May 29, 2018
  • Broad crawl of onion sites in search for captchas

    Python 2 1 MIT Updated May 29, 2018
  • 0

    People

    This organization has no public members. You must be a member to see who’s a part of this organization.