Skip to content

NTag/words-splitter

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

Words Splitter

Split a French sentence without any spaces nor accents, into words.

Example: jepenserestersurparisouprendreletgvpourmarseilleje pense rester sur Paris ou prendre le TGV pour Marseille

Useful for domains/URLs, which can't contain spaces. Example: https://github.com/ntag/secho.fr or https://desurlsansespaces.sécho.fr

Installation

npm i words-splitter

Usage

const { addWord, splitSentence, loadDictionary } = require('words-splitter');

const main = async () => {
  await loadDictionary(); // Take a few seconds to load all French words with their probability

  addWord('MonMOTÀmoi'); // If you want, you can add your own words to the dictionary
  // "monmotamoi" will now be recognized as a word: "MonMOTÀmoi"
  // previously, "monmotamoi" was converted to "mon mot à moi"

  console.log(splitSentence('faire10hdetrainaveclasncf'));
  console.log(splitSentence('jemappellemathildeetvous'));
  console.log(splitSentence('monnomestthomas'));
  console.log(splitSentence('oupartezvousenvacances'));
  console.log(splitSentence('jhesiteentrenancyetalenconetvous'));
  console.log(splitSentence('jepenserestersurparisouprendreletgvpourmarseille'));

  console.log(splitSentence('quepensezvousdemonmotamoi'));

  // faire 10h de train avec la SNCF
  // je m'appelle Mathilde et vous
  // mon nom est Thomas
  // ou partez-vous en vacances
  // j'hésite entre Nancy et Alençon et vous
  // je pense rester sur Paris ou prendre le TGV pour Marseille

  // que pensez-vous de MonMOTÀmoi
};

main();

Acknowledgement

Algorithm from https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words. List of words created from Wikipedia French articles.

Building the dictionary

Download frwiki-latest-pages-articles.xml from https://dumps.wikimedia.org/frwiki/latest/, extract it and move it in the current folder. Execute node --max_old_space_size=4096 create-dictionary.js. It will output two files, the one to use is words-fr-new.txt. The process may also crash at some point because of a too high memory usage 🤷‍♂️. Then, execute node minimize-dictionary.js to slightly reduce the size of the dictionary. It will output a words-fr-new-min.txt which is the final dictionary.

About

Split a French sentence without spaces nor accents into words

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published