MemoryCrawler is a lightweight, in-memory web crawler and search engine built in pure Java. It crawls web pages, indexes their content, and allows fast keyword-based search using a custom inverted index. No external database required—everything runs in memory for speed and simplicity.
- Multi-threaded web crawling
- In-memory inverted index for fast search
- Simple query engine with ranking
- Extensible architecture
- Uses only core Java and jsoup for HTML parsing
+-------------------+
| Main (CLI) |
+-------------------+
|
v
+-------------------+
| Crawler |--- Fetcher (HTTP)
| |--- Parser (HTML)
+-------------------+
|
v
+-------------------+
| Indexer |
+-------------------+
|
v
+-------------------+
| InvertedIndex |
+-------------------+
|
v
+-------------------+
| QueryEngine |--- Ranker
+-------------------+
- Main: Entry point, manages crawling and search loop.
- Crawler: Manages threads, URL queue, and visited set.
- Fetcher: Downloads HTML using jsoup.
- Parser: Extracts text and links from HTML.
- Indexer: Tokenizes and adds content to the inverted index.
- InvertedIndex: Maps words to documents for fast lookup.
- QueryEngine: Processes search queries and ranks results.
- Ranker: Orders results by relevance.
- Concurrency: ThreadPoolExecutor, BlockingQueue, ConcurrentHashMap
- Collections: Set, List, Map
- Streams & Lambdas: For filtering and ranking
- Exception Handling: Robust error management
- OOP Principles: Encapsulation, modularity, extensibility
- Clone the repo
- Download jsoup and place in
lib/
- Compile and run (see below)
mvn clean package
java -cp target/web-crawler-1.0-SNAPSHOT.jar:lib/jsoup.jar com.searchengine.main.Main