Skip to content

Latest commit

 

History

History
72 lines (49 loc) · 2.83 KB

ARCHITECTURE.md

File metadata and controls

72 lines (49 loc) · 2.83 KB

Architecture

SeekStorm is an open-source, sub-millisecond full-text search library & multi-tenancy server implemented in Rust.

Scalability and performance are the two fundamental design goals.

Index size and latency grow linearly with the number of indexed documents, while the RAM consumption remains constant, ensuring scalability.

Index

The index is based on an inverted index. The index can either be kept in RAM or memory mapped files. In both cases it is fully persistent on disk. The identical index file format for both RAM and memory mapping mode, allows to switch the index access mode for an existing index at any time.

  • Ram: no disc access at search time for minimal latency, even after cold start, at the cost of longer index load time and higher RAM consumption as the whole index is preloaded to RAM.
  • Mmap: disc access via mmap during search time, for minimal RAM consumption, high scalability, and minimal index load time. With Mmap disk access is cached by the OS, being persistent between program starts until reboot.

index.bin : contains posting lists with document IDs and term positions. Posting lists are compressed with roaring bitmaps. Term positions of each field are delta compressed and VINT encoded.

index.json : contains index meta data such as similarity (e.g. Bm25), access type (e.g. Ram/Mmap), tokenizer (e.g. AsciiAlphabetic).

SeekStorm server index directory structure

First hierarchy level: API keys
Second hierarchy level: Indices per API key

seekstorm_index/  
├─ 0/  
│  ├─ 0  
│  ├─ 1  
│  ├─ 2  
├─ 1/  
│  ├─ 0  
│  ├─ 1  

You can manually delete, copy, or backup and restore both API key and index directories (shutdown server first and then restart).

Search

  • DaaT (Document-at-a-Time) intersection and union:
    • prevents writing long intermediate result lists in RAM of TaaT (Term-at-a-Time)
    • allows streaming to enable scalability for huge indexes
  • SIMD vector processing hardware support for intersection and union of roaring bitmaps compressed posting lists
  • Galloping intersection
  • Improved Block-max WAND
  • Bigram indexing of frequent terms

Database schema

Every document can contain an arbitrary number of fields of different types.

Every field can be searched and filtered individually or all field together globally.

schema.json : contains the definition of fields, their field types, and whether they are stored and/or indexed.

Document store

The documents are stored in JSON format and compressed with Zstandard.

The index schema defines which fields of the documents are stored in the document store and can be part of the returned search results.

docstore.bin : contains the compressed documents

Limits

There are no limits on the number of

  • indices
  • documents
  • fields
  • field length
  • terms

beyond limited hardware resources.