Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Travis CI Software License GoDoc

Word segmentation

Word segmentation is the process of dividing a phrase without spaces back into its constituent parts. For example, consider a phrase like "thisisatest". Humans can immediately identify that the correct phrase should be "this is a test".

Source and credits

This package is heavily inspired by the Python module grantjenks/wordsegment.

Copyright (c) 2015 by Grant Jenks under the Apache 2 license

The package is based on code from the chapter Natural Language Corpus Data by Peter Norvig from the book Beautiful Data (Segaran and Hammerbacher, 2009).

Copyright (c) 2008-2009 by Peter Norvig

Getting started

You can grab this package with the following command:

go get


If you wanna use the default English corpus:

package main

import (


func main() {
    // Grab the default English corpus that will be created thanks to TSV files
    englishCorpus := corpus.NewEnglishCorpus()
    fmt.Println(wordsegmentation.Segment(englishCorpus, "thisisatest"))

Unigrams and bigrams

Information: an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

This package ships with an English corpus by default that is ready to use. Data files are derived from the Google web trillion word corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Using a custom corpus

If you want to use a custom corpus, you will need to implement the Corpus interface to give to the Segment method.

The interface is as follow:

// The corpus interface that lets access bigrams,
// unigrams, the total number of words from the corpus
// and a function to clean a string.
type Corpus interface {
    Bigrams() *models.Bigrams
    Unigrams() *models.Unigrams
    Total() float64
    Clean(string) string

Take a look at the English corpus source code to help you start!


The documentation of this package can be found on GoDoc. Here is a list of links for the different modules:

  • corpus - the default English corpus
  • helpers - little functions to get the length of a string, remove special characters of a string, get the minimum between 2 given integers
  • models - the various objects used (Unigrams, Bigrams, Arrangement, Candidate, Possibility)
  • parsers - parsers to read tab-separated files into Unigrams and Bigrams
  • segment - the 'main' package


Word segmentation problem in Golang







No releases published


No packages published