#

web-crawler

Here are 862 public repositories matching this topic...

crawlee

apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

nodejs javascript npm crawler scraper automation typescript web-crawler headless scraping crawling web-scraping web-crawling headless-chrome apify puppeteer playwright

Updated Apr 30, 2024
TypeScript

crawlab

crawlab-team / crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

go docker platform crawler spider web-crawler scrapy webcrawler scrapyd-ui webspider crawling-tasks crawlab spiders-management

Updated Apr 19, 2024
Go

ssssssss-team / spider-flow

新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。

crawler spider web-crawler jsoup xpath webcrawler webspider web-spider spider-flow

Updated Jun 14, 2023
Java

BruceDone / awesome-crawler

A collection of awesome web crawler,spider in different languages

crawler scraper awesome spider web-crawler web-scraper node-crawler

Updated Apr 8, 2024

apache / nutch

Apache Nutch is an extensible and scalable web crawler

java hadoop web-crawler nutch crawling apache

Updated Apr 30, 2024
Java

mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown

markdown crawler data scraper ai html-to-markdown web-crawler scraping rag llm

Updated May 1, 2024
TypeScript

sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

Updated Jun 1, 2023
C#

xianhu / PSpider

简单易用的Python爬虫框架，QQ交流群：597510560

python crawler multi-threading spider multiprocessing web-crawler proxies python-spider web-spider

Updated Jun 10, 2022
Python

apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

java crawler web-crawler distributed apache-storm stormcrawler

Updated May 2, 2024
HTML

MarginaliaSearch / MarginaliaSearch

Internet search engine for text-oriented websites. Indexing the small, old and weird web.

search-engine web-crawler indexer language-processing no-ai-used internet-search no-cloud self-hostable small-web alt-search

Updated May 1, 2024
HTML

postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

ruby crawler scraper web spider web-crawler web-scraper web-scraping web-spider spider-links

Updated Jan 25, 2024
Ruby

platonai / PulsarRPA

Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.

crawler data-science data-mining scraper web-crawler scraping web-scraping web-mining web-automation rpa web-sql

Updated Apr 30, 2024
Kotlin

Algebra-FUN / WeReadScan

扫描“微信读书”已购图书并下载本地PDF的爬虫

web-crawler selenium weread book-downloader

Updated Sep 19, 2023
Python

webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

crawler web-crawler crawling warc web-archiving webrecorder wacz

Updated May 2, 2024
TypeScript

hyunwoongko / kochat

Opensource Korean chatbot framework

deep-learning web-crawler chatbot korean deeplearning sentence-classification korean-chatbot sequance-tagging

Updated May 22, 2023
Python

VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.

web-crawler web-scraping hacktoberfest web-spider focused-crawler domain-specific-search web-search

Updated Aug 24, 2023
Java

USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

search search-engine distributed-systems information-retrieval big-data spark solr web-crawler nutch tika

Updated Mar 30, 2023
Java

brendonboshell / supercrawler

A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

sitemap crawler robot web-crawler distributed-crawler

Updated Dec 30, 2022
JavaScript

rivermont / spidy

The simple, easy to use command line web crawler.

python crawler web-crawler crawling python3 web-spider

Updated Oct 9, 2023
Python

infinilabs / crawler

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

lightweight elasticsearch crawler spider web-crawler scraping crawling web-scraping web-spider

Updated May 19, 2021
Go

Improve this page

Add a description, image, and links to the web-crawler topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the web-crawler topic, visit your repo's landing page and select "manage topics."