JSearch

JSearch is a powerful and efficient Java-based search engine designed to index and search through documents seamlessly. It incorporates the Vector Space Model, TF-IDF weighting, and supports full-text searches. All functionalities, including crawling, indexing, and searching, are implemented from scratch without the use of external libraries.

Screenshots

CLI

DESKTOP GUI

Features

Efficient Multithreading: Utilizes multithreading to speed up indexing and searching. Extremely optimized, spends ~15 seconds to crawl and index 2gb of data.
Full Text Search: Performs full-text search in the inverted index file. Extremely optimized, spends ~1 second to find in exactly which file contains 200 words long query inside 2gb of data.
Custom Algorithms: Implements various custom algorithms for indexing and searching.
TF-IDF Weighting: Enhances search accuracy with TF-IDF (Term Frequency-Inverse Document Frequency).
Keyword Boosting: Adjusts search relevance based on term frequency and document frequency.
Cosine Similarity: Ranks search results using cosine similarity.
Configurable Ignored Directories: Allows users to specify directories to exclude from indexing.

Installation

Clone the repository:

git clone https://github.com/yourusername/jsearch.git
cd jsearch

Build the project: Ensure you have Maven installed. Then run:
```
mvn clean install
cd frontend
npm install
```

Usage

Getting Started

Run the Application:
```
mvn clean compile exec:java
```
Enter the Search Directory: When prompted, enter the path to the directory you want to index. You can press a to index the entire system (Note: this can be time-consuming based on the directory size).
Configure Ignored Directories: By default, certain directories (e.g., node_modules, .git) are ignored. You can add additional directories or choose not to ignore any.
Start Indexing: The application will index the specified directory. This may take up to 10 minutes depending on the size.
Search for Queries: Once indexing is complete, you can enter your search queries. Both full-text search and vector space model search operations will executed. The results will be displayed based on their relevance.

Example

# Start the application
$ mvn clean compile exec:java

# Follow the prompts:
# Enter the search directory path (Press a for the whole computer):
/home/user/documents

# Enter directories to ignore (Press n to ignore none):
node_modules .git

# Enter your search query (Press q to quit):
How do I traverse a Graph?

## Results will be listed here

Code Overview

Main Components

App.java: The entry point of the application. Handles user inputs and orchestrates indexing and searching.
FileCrawler.java: Crawls through directories and files to build the index using multithreading.
Searcher.java: Manages the search logic and interfaces with different search models.
VectorSpaceModel.java: Implements the Vector Space Model for search using TF-IDF, cosine similarity, and boolean logic.

Key Methods

computeTfIdfWeights: Calculates the TF-IDF weights for terms.
calculateCosineSimilarity: Computes cosine similarity between query and documents.
getKeywordBoosts: Calculates boosts for keywords based on their occurrence.
performBooleanSearch: Processes boolean search queries and returns relevant documents.

Customization

Ignored Directories

You can modify the default ignored directories in App.java:

private static void addDefaultIgnoredDirectories() {
  ignoredDirectories.addAll(Arrays.asList("node_modules", "target", ".git", "rbenv", ".idea", ".rspec", ".steam", ".gradle", "words.txt", "cache", "logs", "build", "dist", "bin", "obj", "out", "vendor", "tmp", "temp", "examples", "samples"));
}

Boost Logic

The keyword boosting logic is implemented in VectorSpaceModel.java:

private Map<String, Double> getKeywordBoosts(List<String> query, Map<String, Map<Integer, List<List<Integer>>>> queryIndexes) {
  Map<String, Double> keywordBoosts = new HashMap<>();
  Map<String, Integer> keywordFoundInDifferentFilesCount = new HashMap<>();
  Map<String, Integer> totalOccurrenceOfWord = new HashMap<>();

  for (String term : query) {
    keywordFoundInDifferentFilesCount.put(term, queryIndexes.get(term).size());
  }

  for (String term : query) {
    if (!queryIndexes.containsKey(term)) continue;
    
    int totalOccurrence = 0;
    for (Map.Entry<Integer, List<List<Integer>>> entry : queryIndexes.get(term).entrySet()) {
      totalOccurrence += entry.getValue().size();
    }
    totalOccurrenceOfWord.put(term, totalOccurrence);
  }

  // Implement your boosting logic here using keywordFoundInDifferentFilesCount and totalOccurrenceOfWord

  return keywordBoosts;
}

Contributing

Contributions are welcome! Please fork the repository and submit pull requests.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

Special thanks to my advisor Prof. Dr. Bekir Taner Dincer for all his guidance.

TODO List

Implement advanced keyword boosting based on additional heuristics.
Optimize the multithreading implementation for faster indexing.
Enhance the user interface for better user experience.
Include more detailed logging and error handling.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
frontend		frontend
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JSearch

Screenshots

Features

Table of Contents

Installation

Usage

Getting Started

Example

Code Overview

Main Components

Key Methods

Customization

Ignored Directories

Boost Logic

Contributing

License

Acknowledgements

TODO List

About

Releases

Packages

Languages

License

CumaBolat/jsearch

Folders and files

Latest commit

History

Repository files navigation

JSearch

Screenshots

Features

Table of Contents

Installation

Usage

Getting Started

Example

Code Overview

Main Components

Key Methods

Customization

Ignored Directories

Boost Logic

Contributing

License

Acknowledgements

TODO List

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages