GitHub

simple inverted index search engine

good for short corpuses, such as city names, food ingredients etc, and especially powerfull if you have some external popularity to be used in combination for the scoring, wouldnt recommend to be used for searching long corpuses (such as news papers), where the term frequency is of much importance.

example usage:

const {
  Index,
  AND,
  OR,
  TERM,
  CONSTANT,
  DISMAX,
  analyzers
} = require("paxx");

let ix = new Index({
  name: analyzers.autocomplete,
  type: analyzers.keyword
});

ix.doIndex(
  // documents to be indexed
  [
    { name: "john Crème Brulée", type: "user" },
    { name: "hello world k777bb k9 bzz", type: "user" },
    { name: "jack", type: "admin" },
    { name: "doe" }
  ],
  // which fields to index (must be strings)
  ["name", "type"]
);

// iterate over all the matches
ix.forEach(
  new OR(
    ...ix.terms("name", "creme"),
    new AND(
      // matches on k9 because it splits k and 9
      ...ix.terms("name", "9k hell"),
    ),
    new AND(...ix.terms("name", "ja"), new CONSTANT(1, new OR(...ix.terms("type", "user")))),
    new DISMAX(...ix.terms("name", "doe"), ...ix.terms("type", "user"))
  ),
  // callback called with the document and its IDF score
  function(doc, score) {
    console.log({ doc, score });
  }
);

outputs:

{ doc: { name: 'john Crème Brulée', type: 'user' },
  score: 2.6931471805599454 }
{ doc: { name: 'hello world k777bb k9 bzz', type: 'user' },
  score: 7.673976433571672 }
{ doc: { name: 'doe' }, score: 2.6931471805599454 }

queries

NB: the queries is statefull, and can not be reused, you must create one query per request

TERM


new TERM(numberOfDocumentsInIndex, postingsList)
e.g.

let t = new Term(5, [1,2,34])

the term is the most primitive query, it uses binary search to advance its position. (same as lucens's TermQuery)

its score is (unnormalized) IDF as follows:

this.idf = 1 + Math.log(nDocumentsInIndex / (postingsList.length + 1));

tf is not used or stored.

AND

new AND(queryA, queryB, queryC)

returns (queryA AND queryB AND queryC) boolean query (similar to lucene's Bool MUST), its score is the sum of the scores of the matching subqueries (unnormalized)

OR

new OR(queryA, queryB, queryC)

returns (queryA OR queryB OR queryC) boolean query (similar to lucene's Bool SHOULD), its score is the sum of the scores of the matching subqueries (unnormalized)

DISMAX

new DISMAX(tiebreaker, queryA, queryB, queryC)

e.g.
let q = new DISMAX(0.1, queryA, queryB, queryC)

returns (queryA DISMAX queryB DISMAX queryC) boolean OR query (similar to lucene's DisMax), its score is the max of the matching subqueries plus the tiebreaker multiplier by the rest of the scores.

CONSTANT

new CONSTANT(boost, query)

e.g.
let q = new CONSTANT(0.1, query)

returns a constant score query, that will score with whatever boost you give it, in the example used new CONSTANT(0.1...) it will score with 0.1

TOP N

let limit = 2 // -1 for all matches, sorted by idf score
let matches = ix.topN(
  new DISMAX(0.5, ...ix.terms("name", "hello"), ...ix.terms("name", "world")),
  limit
)

outputs: 
[ { name: 'hello world k777bb k9 bzz', type: 'user' } ]

more examples

ix.forEach(
  new OR(0.5, ...ix.terms("name", "hello"), ...ix.terms("name", "world")),
  function(doc, score, docID) {
    console.log({ doc, score, docID });
  }
);

ix.forEach(
  new DISMAX(
    // tiebreaker
    0.5,
    // variable argument list of queries
    ...ix.terms("name", "hello"),
    new CONSTANT(1000, new OR(...ix.terms("name", "world")))
  ),
  function(doc, score, docID) {
    console.log({ doc, score, docID });
  }
);

Index

create inverted index (which is a handy way to store the postings lists and create term queries)

to create an index you need to pass per-field analyzer, e.g. for the 'name' field you could use autocomplete analyzer, but for the 'type' field you could use ID analyzer (noop)

let ix = new Index({
  name: analyzers.autocomplete,
  type: analyzers.keword
});

ix.doIndex(
  // documents to be indexed
  [
    { name: "john Crème Brulée", type: "user" },
    { name: "hello world k777bb k9 bzz", type: "user" },
    { name: "jack", type: "admin" },
    { name: "doe" }
  ],
  // which fields to index (must be strings)
  ["name", "type"]
);

create an array term queries out of a field: ix.terms("field", "token") e.g. ix.terms("name","john"), you can wrap those queries in AND/OR/DISMAX etc

analyzers

analyzer is a group of tokenizers and normalizers

Autocomplete (analyzers.autocomplete)

tokenize at index: whitespace, edge
tokenize at search: whitespace
normalize: lowercase, unaccent, spaceBetweenDigits

Basic (analyzers.basic)

tokenize at index: whitespace
tokenize at search: whitespace
normalize: lowercase, unaccent, spaceBetweenDigits

Keyword (analyzers.keyword)

tokenize at index: noop
tokenize at search: noop
normalize: noop

SOUNDEX (analyzers.soundex)

tokenize at index: whitespace, soundex
tokenize at search: whitespace, soundex
normalize: lowercase, unaccent, spaceBetweenDigits

tokenizers

a tokenizer takes a string and produces tokens from that string, at the moment those are available:

whitespace: 'a b c' -> ['a','b','c']
noop: 'a b c' -> ['a b c']
soundex: 'halo hello' -> ['H400', 'H400']
edge: 'hello' -> ['h','he','hel','hell','hello']

any object that has apply([string]) -> [string] function can be used as tokenizer

normalizers

normalizes apply transformation to the string, used both at search and index time

lowercase: 'ABC' -> 'abc'
unaccent: 'Crème' - 'Creme'
removeNonAlphanumeric: 'a/b/c' -> 'a b c'
space between digits: k9 -> 'k 9'

any object that has apply(string) -> string function can be used as normalizer

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
README.md		README.md
index.js		index.js
index.test.js		index.test.js
package.json		package.json
unicode.js		unicode.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

index.js

index.js

index.test.js

index.test.js

package.json

package.json

unicode.js

unicode.js

Repository files navigation

queries

TOP N

more examples

Index

analyzers

tokenizers

normalizers

About

Releases

Packages

Contributors 2

Languages

jackdoe/paxx

Folders and files

Latest commit

History

Repository files navigation

queries

TOP N

more examples

Index

analyzers

tokenizers

normalizers

About

Resources

Stars

Watchers

Forks

Languages