ngrams.js

A Node.js library for creating n-grams, skip-grams, bag of words, bag of n-grams, bag of skip-grams. Also has a simple function to sanitise input text and split it into words for use in these methods.

Usage

Put ngrams.js in your node_modules folder and require it:

let __ngrams = require("ngrams");

Input

These methods take an array of words to turn into n-grams, skip-grams, etc.

"use strict";
let __ngrams = require("ngrams");

// Create 2-grams
let ngrams = __ngrams.Ngrams(["these", "are", "some", "words"], 2);
console.log(ngrams);

Output:

[ 'these are', 'are some', 'some words' ]

"these are some words".split(" ") is nice and simple, but often our source text is not so simple. You may choose to use some sort of NLP tokeniser. This library provides a rudimentary method, SanitiseToWords, to get rid of things like double spaces, punctuation, and so on in a string before splitting it into words.

console.log(__ngrams.SanitiseToWords("   Turning and turning in the widening gyre\r\n    The falcon cannot hear the falconer;\r\n    Things fall apart; the centre cannot hold;\r\n    Mere anarchy is loosed upon the world   "));

Output:

[ 'Turning', 'and', 'turning', 'in', 'the', 'widening', 'gyre', 'The', 'falcon', 'cannot', 'hear', 'the', 'falconer', 'Things', 'fall', 'apart', 'the', 'centre', 'cannot', 'hold', 'Mere', 'anarchy', 'is', 'loosed', 'upon', 'the', 'world' ]

See documentation below for more information.

Methods

🐚 Ngrams

Create n-grams from an array of words.

Parameter	Type	Description
words	`ARRAY [INTEGER] = STRING`	An array of words e.g. `["these", "are", "words"]`
n	`INTEGER`	Size of the n-grams, e.g. `2` will create bigrams `["these are", "are words"]`

Returns an array of n-grams of size n (ARRAY [INTEGER] = STRING).

let words = __ngrams.SanitiseToWords("   Turning and turning in the widening gyre\r\n    The falcon cannot hear the falconer;\r\n    Things fall apart; the centre cannot hold;\r\n    Mere anarchy is loosed upon the world   ");
let ngrams = __ngrams.Ngrams(words, 5);
console.log(ngrams);

Output (truncated):

[ 'Turning and turning in the',
  'and turning in the widening',
  'turning in the widening gyre',
...

🐚 SkipGrams

Create skip-grams from an array of words.

Parameter	Type	Description
words	`ARRAY [INTEGER] = STRING`	An array of words e.g. `["these", "are", "words"]`
size	`INTEGER`	Size of the n-grams e.g. `2`: `"these are", "are words"`
distance	`INTEGER`	Distance to skip to create skip-grams, e.g. `5` will create skip-grams using the base word (or n-gram) and n-grams from the 5 following words.
sortForDuplicates	`INTEGER`	Pass `__ngrams.SORT_NGRAMS` or `__ngrams.DONT_SORT_NGRAMS`. Sorting n-grams alphabetically can help flag up duplicates e.g. when creating a bag of words/n-grams/skip-grams. If you only care about pairing n-grams by proximity but not by direction, use `__ngrams.DONT_SORT_NGRAMS`.

Returns an array of n-grams found near one another within distance words (ARRAY [INTEGER][INTEGER] = STRING).

let words = __ngrams.SanitiseToWords("   Turning and turning in the widening gyre\r\n    The falcon cannot hear the falconer;\r\n    Things fall apart; the centre cannot hold;\r\n    Mere anarchy is loosed upon the world   ");

let skipGrams = __ngrams.SkipGrams(words, 1, 2, __ngrams.DONT_SORT_NGRAMS);
console.log(skipGrams);

Output (truncated):

[ [ 'Turning', 'and' ],
  [ 'Turning', 'turning' ],
  [ 'and', 'turning' ],
  [ 'and', 'in' ],
  [ 'turning', 'in' ],
  [ 'turning', 'the' ],
...

You can choose instead to pass __ngrams.SORT_NGRAMS and this will make direction irrelevant (e.g. it will be easier to sport ["Turning", "and"] and ["and", "turning"] as the same words because they are now sorted to ["Turning", "and"] and ["turning", "and"]. Using method BagOfSkipGrams (passing __ngrams.CASE_INSENSITIVE) would then remove one of these as a duplicate.

let skipGrams = __ngrams.SkipGrams(words, 1, 2, __ngrams.SORT_NGRAMS);
console.log(skipGrams);

Output (truncated):

[ [ 'and', 'Turning' ],
  [ 'turning', 'Turning' ],
  [ 'and', 'turning' ],
  [ 'and', 'in' ],
  [ 'in', 'turning' ],
  [ 'the', 'turning' ],
...

🐚 BagOfNgrams

Generates n-grams and removes duplicates. Can be case sensitive or insensitive by passing __ngrams.CASE_SENSITIVE or __ngrams.CASE_INSENSITIVE.

Parameter	Type	Description
words	`ARRAY [INTEGER] = STRING`	An array of words e.g. `["these", "are", "words"]`.
n	`INTEGER`	Size of the n-grams e.g. `2` creates bigrams `["these are", "are words"]`
caseSensitivity	`INTEGER`	Pass `__ngrams.CASE_SENSITIVE` or `__ngrams.CASE_INSENSITIVE`. Case insensitive calls will ignore differences in case when removing duplicates e.g. `"Turning"`, `"turning"`, `"TURNING"` will all be seen as identical and reduced to just `"Turning"`.

Returns an array of n-grams with duplicates removed (ARRAY [INTEGER] = STRING).

let words = __ngrams.SanitiseToWords("   Turning and turning in the widening gyre\r\n    The falcon cannot hear the falconer;\r\n    Things fall apart; the centre cannot hold;\r\n    Mere anarchy is loosed upon the world   ");

let bagOfWords = __ngrams.BagOfNgrams(words, 1, __ngrams.CASE_INSENSITIVE);
console.log(bagOfWords);

Output:

[ 'Turning',
  'and',
  'in',
  'the',
  'widening',
  'gyre',
  'falcon',
  'cannot',
  'hear',
  'falconer',
  'Things',
  'fall',
  'apart',
  'centre',
  'hold',
  'Mere',
  'anarchy',
  'is',
  'loosed',
  'upon',
  'world' ]

🐚 BagOfWords

This is just a wrapper function for readability that calls BagOfNgrams with an n-gram size (n) of 1.

Parameter	Type	Description
words	`ARRAY [INTEGER] = STRING`	An array of words e.g. `["these", "are", "words"]`.
caseSensitivity	`INTEGER`	Pass `__ngrams.CASE_SENSITIVE` or `__ngrams.CASE_INSENSITIVE`. Case insensitive calls will ignore differences in case when removing duplicates e.g. `"Turning"`, `"turning"`, `"TURNING"` will all be seen as identical and reduces to just `"Turning"`.

Returns an array of words with duplicates removed (ARRAY [INTEGER] = STRING).

let bagOfWords = __ngrams.BagOfWords(words, __ngrams.CASE_INSENSITIVE);

🐚 BagOfSkipGrams

Generates skip-grams and removes duplicates. Can ignore direction by passing __ngrams.SORT_NGRAMS. Can be case insensitive by passing __ngrams.CASE_INSENSITIVE.

Parameter	Type	Description
words	`ARRAY [INTEGER] = STRING`	An array of words e.g. `["these", "are", "words"]`
size	`INTEGER`	Size of the n-grams e.g. `2`: `"these are", "are words"`
distance	`INTEGER`	Distance to skip to create skip-grams, e.g. `5` will create skip-grams using the base word (or n-gram) and n-grams from the 5 following words.
sortForDuplicates	`INTEGER`	Pass `__ngrams.SORT_NGRAMS` or `__ngrams.DONT_SORT_NGRAMS`. Sorting n-grams alphabetically can help flag up duplicates e.g. when creating a bag of words/n-grams/skip-grams. If you only care about pairing n-grams by proximity but not by direction, use `__ngrams.DONT_SORT_NGRAMS`.
caseSensitivity	`INTEGER`	Pass `__ngrams.CASE_SENSITIVE` or `__ngrams.CASE_INSENSITIVE`. Case insensitive calls will ignore differences in case when removing duplicates e.g. `"Turning"`, `"turning"`, `"TURNING"` will all be seen as identical and reduces to just `"Turning"`.

Returns an array of arrays of paired n-grams ARRAY [INTEGER][INTEGER] = STRING.

Case sensitive, direction sensitive:

let words = "Something and SOMETHING and something and something".split(" ");

let skipGrams = __ngrams.BagOfSkipGrams(words, 2, 2, __ngrams.DONT_SORT_NGRAMS, __ngrams.CASE_SENSITIVE);
console.log(skipGrams);

Output:

[ [ 'Something and', 'and SOMETHING' ],
  [ 'Something and', 'SOMETHING and' ],
  [ 'and SOMETHING', 'SOMETHING and' ],
  [ 'and SOMETHING', 'and something' ],
  [ 'SOMETHING and', 'and something' ],
  [ 'SOMETHING and', 'something and' ],
  [ 'and something', 'something and' ],
  [ 'and something', 'and something' ],
  [ 'something and', 'and something' ] ]

Case sensitive, direction insensitive __ngrams.BagOfSkipGrams(words, 2, 2, __ngrams.SORT_NGRAMS, __ngrams.CASE_SENSITIVE):

[ [ 'and SOMETHING', 'Something and' ],
  [ 'Something and', 'SOMETHING and' ],
  [ 'and SOMETHING', 'SOMETHING and' ],
  [ 'and something', 'and SOMETHING' ],
  [ 'and something', 'SOMETHING and' ],
  [ 'something and', 'SOMETHING and' ],
  [ 'and something', 'something and' ],
  [ 'and something', 'and something' ] ]

Case insensitive, direction insensitive __ngrams.BagOfSkipGrams(words, 2, 2, __ngrams.SORT_NGRAMS, __ngrams.CASE_INSENSITIVE):

[ [ 'and SOMETHING', 'Something and' ],
  [ 'Something and', 'SOMETHING and' ],
  [ 'and something', 'and SOMETHING' ] ]

Case insensitive, direction sensitive __ngrams.BagOfSkipGrams(words, 2, 2, __ngrams.DONT_SORT_NGRAMS, __ngrams.CASE_INSENSITIVE):

[ [ 'Something and', 'and SOMETHING' ],
  [ 'Something and', 'SOMETHING and' ],
  [ 'and SOMETHING', 'SOMETHING and' ],
  [ 'and SOMETHING', 'and something' ] ]

🐚 ConcatSkipGrams

Pass skip-grams through this method if you would prefer a simpler array of strings where skip-grams have been concatenated into a single string.

Parameter	Type	Description
skipGrams	`ARRAY [INTEGER][INTEGER] = STRING`	An array of skip-grams created using `SkipGrams` or `BagOfSkipGrams` which you want to simplify into `ARRAY [INTEGER] = STRING` by joining each n-gram pair into one string.

Example:

let skipGrams = __ngrams.SkipGrams(["these", "are", "some", "words"], 2, 2, __ngrams.DONT_SORT_NGRAMS);

skipGrams = __ngrams.ConcatSkipGrams(skipGrams);
console.log(skipGrams);

Output:

[ 'these are are some',
  'these are some words',
  'are some some words' ]

🐚 SanitiseToWords

A rudimentary method that attempts to refine messy text into an array of words.

Parameter	Type	Description
text	`STRING`	The source text you want to split into words.

Note that this is mainly only good for English-language text - it does not support accented characters etc.

Its approach is to replace anything outwith a small list of allowable characters with a space, avoiding any double spacing, and then split by those spaces.

This works quite well for many English-language texts - with the occasional mistake.

However, you may prefer to roll your own sanitisation/splitting/tokenisation method based more closely on your source text(s). You might like to use this 'natural' library's tokenizers.

let words = __ngrams.SanitiseToWords("   Turning and turning in the widening gyre\r\n    The falcon cannot hear the falconer;\r\n    Things fall apart; the centre cannot hold;\r\n    Mere anarchy is loosed upon the world   ");
console.log(words);

Output:

[ 'Turning', 'and', 'turning', 'in', 'the', 'widening', 'gyre', 
	'The', 'falcon', 'cannot', 'hear', 'the', 'falconer', 
	'Things', 'fall', 'apart', 'the', 'centre', 'cannot', 'hold', 
	'Mere', 'anarchy', 'is', 'loosed', 'upon', 'the', 'world' ]

Interesting/Useful Links

natural - an NLP library for Node.js
n-gram - Wikipedia

Acknowledgements

CSV to Markdown Table Generator - Donat Studios
Emoji cheat sheet - WebpageFX

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
dist		dist
test		test
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ngrams.js

Usage

Input

Methods

🐚 Ngrams

🐚 SkipGrams

🐚 BagOfNgrams

🐚 BagOfWords

🐚 BagOfSkipGrams

🐚 ConcatSkipGrams

🐚 SanitiseToWords

Interesting/Useful Links

Acknowledgements

About

Releases

Packages

Languages

License

DanielJohnBenton/ngrams.js

Folders and files

Latest commit

History

Repository files navigation

ngrams.js

Usage

Input

Methods

🐚 Ngrams

🐚 SkipGrams

🐚 BagOfNgrams

🐚 BagOfWords

🐚 BagOfSkipGrams

🐚 ConcatSkipGrams

🐚 SanitiseToWords

Interesting/Useful Links

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages