Skip to content

jackdoe/paxx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

simple inverted index search engine

good for short corpuses, such as city names, food ingredients etc, and especially powerfull if you have some external popularity to be used in combination for the scoring, wouldnt recommend to be used for searching long corpuses (such as news papers), where the term frequency is of much importance.

example usage:

const {
  Index,
  AND,
  OR,
  TERM,
  CONSTANT,
  DISMAX,
  analyzers
} = require("paxx");

let ix = new Index({
  name: analyzers.autocomplete,
  type: analyzers.keyword
});

ix.doIndex(
  // documents to be indexed
  [
    { name: "john Crème Brulée", type: "user" },
    { name: "hello world k777bb k9 bzz", type: "user" },
    { name: "jack", type: "admin" },
    { name: "doe" }
  ],
  // which fields to index (must be strings)
  ["name", "type"]
);

// iterate over all the matches
ix.forEach(
  new OR(
    ...ix.terms("name", "creme"),
    new AND(
      // matches on k9 because it splits k and 9
      ...ix.terms("name", "9k hell"),
    ),
    new AND(...ix.terms("name", "ja"), new CONSTANT(1, new OR(...ix.terms("type", "user")))),
    new DISMAX(...ix.terms("name", "doe"), ...ix.terms("type", "user"))
  ),
  // callback called with the document and its IDF score
  function(doc, score) {
    console.log({ doc, score });
  }
);

outputs:

{ doc: { name: 'john Crème Brulée', type: 'user' },
  score: 2.6931471805599454 }
{ doc: { name: 'hello world k777bb k9 bzz', type: 'user' },
  score: 7.673976433571672 }
{ doc: { name: 'doe' }, score: 2.6931471805599454 }

queries

NB: the queries is statefull, and can not be reused, you must create one query per request

  • TERM

new TERM(numberOfDocumentsInIndex, postingsList)
e.g.

let t = new Term(5, [1,2,34])

the term is the most primitive query, it uses binary search to advance its position. (same as lucens's TermQuery)

its score is (unnormalized) IDF as follows:

this.idf = 1 + Math.log(nDocumentsInIndex / (postingsList.length + 1));

tf is not used or stored.

  • AND
new AND(queryA, queryB, queryC)

returns (queryA AND queryB AND queryC) boolean query (similar to lucene's Bool MUST), its score is the sum of the scores of the matching subqueries (unnormalized)

  • OR
new OR(queryA, queryB, queryC)

returns (queryA OR queryB OR queryC) boolean query (similar to lucene's Bool SHOULD), its score is the sum of the scores of the matching subqueries (unnormalized)

  • DISMAX
new DISMAX(tiebreaker, queryA, queryB, queryC)

e.g.
let q = new DISMAX(0.1, queryA, queryB, queryC)

returns (queryA DISMAX queryB DISMAX queryC) boolean OR query (similar to lucene's DisMax), its score is the max of the matching subqueries plus the tiebreaker multiplier by the rest of the scores.

  • CONSTANT
new CONSTANT(boost, query)

e.g.
let q = new CONSTANT(0.1, query)

returns a constant score query, that will score with whatever boost you give it, in the example used new CONSTANT(0.1...) it will score with 0.1

TOP N

let limit = 2 // -1 for all matches, sorted by idf score
let matches = ix.topN(
  new DISMAX(0.5, ...ix.terms("name", "hello"), ...ix.terms("name", "world")),
  limit
)

outputs: 
[ { name: 'hello world k777bb k9 bzz', type: 'user' } ]

more examples

ix.forEach(
  new OR(0.5, ...ix.terms("name", "hello"), ...ix.terms("name", "world")),
  function(doc, score, docID) {
    console.log({ doc, score, docID });
  }
);

ix.forEach(
  new DISMAX(
    // tiebreaker
    0.5,
    // variable argument list of queries
    ...ix.terms("name", "hello"),
    new CONSTANT(1000, new OR(...ix.terms("name", "world")))
  ),
  function(doc, score, docID) {
    console.log({ doc, score, docID });
  }
);

Index

create inverted index (which is a handy way to store the postings lists and create term queries)

to create an index you need to pass per-field analyzer, e.g. for the 'name' field you could use autocomplete analyzer, but for the 'type' field you could use ID analyzer (noop)

let ix = new Index({
  name: analyzers.autocomplete,
  type: analyzers.keword
});

ix.doIndex(
  // documents to be indexed
  [
    { name: "john Crème Brulée", type: "user" },
    { name: "hello world k777bb k9 bzz", type: "user" },
    { name: "jack", type: "admin" },
    { name: "doe" }
  ],
  // which fields to index (must be strings)
  ["name", "type"]
);

create an array term queries out of a field: ix.terms("field", "token") e.g. ix.terms("name","john"), you can wrap those queries in AND/OR/DISMAX etc

analyzers

analyzer is a group of tokenizers and normalizers

  • Autocomplete (analyzers.autocomplete)
tokenize at index: whitespace, edge
tokenize at search: whitespace
normalize: lowercase, unaccent, spaceBetweenDigits
  • Basic (analyzers.basic)
tokenize at index: whitespace
tokenize at search: whitespace
normalize: lowercase, unaccent, spaceBetweenDigits
  • Keyword (analyzers.keyword)
tokenize at index: noop
tokenize at search: noop
normalize: noop
  • SOUNDEX (analyzers.soundex)
tokenize at index: whitespace, soundex
tokenize at search: whitespace, soundex
normalize: lowercase, unaccent, spaceBetweenDigits

tokenizers

a tokenizer takes a string and produces tokens from that string, at the moment those are available:

  • whitespace: 'a b c' -> ['a','b','c']
  • noop: 'a b c' -> ['a b c']
  • soundex: 'halo hello' -> ['H400', 'H400']
  • edge: 'hello' -> ['h','he','hel','hell','hello']

any object that has apply([string]) -> [string] function can be used as tokenizer

normalizers

normalizes apply transformation to the string, used both at search and index time

  • lowercase: 'ABC' -> 'abc'
  • unaccent: 'Crème' - 'Creme'
  • removeNonAlphanumeric: 'a/b/c' -> 'a b c'
  • space between digits: k9 -> 'k 9'

any object that has apply(string) -> string function can be used as normalizer

About

simple js search thing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published