MemoryCrawler: In-Memory Java Web Crawler & Search Engine

Overview

MemoryCrawler is a lightweight, in-memory web crawler and search engine built in pure Java. It crawls web pages, indexes their content, and allows fast keyword-based search using a custom inverted index. No external database required—everything runs in memory for speed and simplicity.

Features

Multi-threaded web crawling
In-memory inverted index for fast search
Simple query engine with ranking
Extensible architecture
Uses only core Java and jsoup for HTML parsing

Architecture

+-------------------+
|   Main (CLI)      |
+-------------------+
         |
         v
+-------------------+
|    Crawler        |--- Fetcher (HTTP)
|                   |--- Parser (HTML)
+-------------------+
         |
         v
+-------------------+
|    Indexer        |
+-------------------+
         |
         v
+-------------------+
| InvertedIndex     |
+-------------------+
         |
         v
+-------------------+
|  QueryEngine      |--- Ranker
+-------------------+

Main: Entry point, manages crawling and search loop.
Crawler: Manages threads, URL queue, and visited set.
Fetcher: Downloads HTML using jsoup.
Parser: Extracts text and links from HTML.
Indexer: Tokenizes and adds content to the inverted index.
InvertedIndex: Maps words to documents for fast lookup.
QueryEngine: Processes search queries and ranks results.
Ranker: Orders results by relevance.

Core Java Concepts Used

Concurrency: ThreadPoolExecutor, BlockingQueue, ConcurrentHashMap
Collections: Set, List, Map
Streams & Lambdas: For filtering and ranking
Exception Handling: Robust error management
OOP Principles: Encapsulation, modularity, extensibility

Getting Started

Clone the repo
Download jsoup and place in lib/
Compile and run (see below)

Build & Run

mvn clean package
java -cp target/web-crawler-1.0-SNAPSHOT.jar:lib/jsoup.jar com.searchengine.main.Main

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/main/java/com/searchengine		src/main/java/com/searchengine
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MemoryCrawler: In-Memory Java Web Crawler & Search Engine

Overview

Features

Architecture

Core Java Concepts Used

Getting Started

Build & Run

About

Uh oh!

Releases

Packages

Languages

Sk-singla/WebCrawler-SearchEngine-Java

Folders and files

Latest commit

History

Repository files navigation

MemoryCrawler: In-Memory Java Web Crawler & Search Engine

Overview

Features

Architecture

Core Java Concepts Used

Getting Started

Build & Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages