Skip to content

Sk-singla/WebCrawler-SearchEngine-Java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

MemoryCrawler: In-Memory Java Web Crawler & Search Engine

Overview

MemoryCrawler is a lightweight, in-memory web crawler and search engine built in pure Java. It crawls web pages, indexes their content, and allows fast keyword-based search using a custom inverted index. No external database required—everything runs in memory for speed and simplicity.

Features

  • Multi-threaded web crawling
  • In-memory inverted index for fast search
  • Simple query engine with ranking
  • Extensible architecture
  • Uses only core Java and jsoup for HTML parsing

Architecture

+-------------------+
|   Main (CLI)      |
+-------------------+
         |
         v
+-------------------+
|    Crawler        |--- Fetcher (HTTP)
|                   |--- Parser (HTML)
+-------------------+
         |
         v
+-------------------+
|    Indexer        |
+-------------------+
         |
         v
+-------------------+
| InvertedIndex     |
+-------------------+
         |
         v
+-------------------+
|  QueryEngine      |--- Ranker
+-------------------+
  • Main: Entry point, manages crawling and search loop.
  • Crawler: Manages threads, URL queue, and visited set.
  • Fetcher: Downloads HTML using jsoup.
  • Parser: Extracts text and links from HTML.
  • Indexer: Tokenizes and adds content to the inverted index.
  • InvertedIndex: Maps words to documents for fast lookup.
  • QueryEngine: Processes search queries and ranks results.
  • Ranker: Orders results by relevance.

Core Java Concepts Used

  • Concurrency: ThreadPoolExecutor, BlockingQueue, ConcurrentHashMap
  • Collections: Set, List, Map
  • Streams & Lambdas: For filtering and ranking
  • Exception Handling: Robust error management
  • OOP Principles: Encapsulation, modularity, extensibility

Getting Started

  1. Clone the repo
  2. Download jsoup and place in lib/
  3. Compile and run (see below)

Build & Run

mvn clean package
java -cp target/web-crawler-1.0-SNAPSHOT.jar:lib/jsoup.jar com.searchengine.main.Main

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages