gstlib: a Generalized Suffix Tree Library in Scala
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src Taking into account the special case of empty (generalized) suffix tree Feb 11, 2018
.gitignore Initial commit Aug 1, 2017
CHANGELOG.md Release of version 0.1.2 Aug 24, 2017
LICENSE.txt
README.md
build.sbt

README.md

gstlib: Generalized Suffix Tree Library in Scala

gstlib is a library that implements a generalized suffix tree datastructure for sequences of items in the Scala programming language.

Latest version: 0.1.2

Features:

  • efficient building of suffix trees and generalized suffix trees;
  • efficient search of a pattern in (generalized) suffix trees;
  • efficient computation of longest common subsequence of two sequences;
  • linear-time solution to the multiple common substring problem.

It implements the following algorithms from the book "Algorithm on Strings, Trees, and Sequences: Computer Science and Computational Biology" by D. Gusfield:

  • Ukkonen's algorithm for building generalized suffix tree;
  • constant-time lowest common ancestor (LCA) retrieval (Schieber and Vishkin approach) via a linear-time preprocessing of the tree;
  • a linear-time solution to the multiple common substring problem (Lucas Hui approach).

Getting gstlib

gstlib is available for Scala 2.11 and 2.12.

Usage with SBT, adding a dependency to the latest version of the gstlib to your sbt build definition file:

libraryDependencies += "com.github.guillaumedd" %% "gstlib" % "0.1.2"

Usage

Suffix Tree

import gstlib.SuffixTree

/*
 * Building the suffix tree
 */
val str = "String to be searched"
val stree = SuffixTree(str)

/*
 * Searching for a pattern
 */
val pattern = "to be"
stree.contains(pattern) // res0: Boolean = true

val indexes = stree.find(pattern) // indexes: List[Int] = List(7)
val first = indexes.head // indexes: List[Int] = List(7)

stree.sequence.substring(first, first + pattern.length)
// res1: String = to be

/*
 * Traversing the suffixes
 */
for (suffix <- stree.suffixes()) {
  println(suffix)
}

/*
 * It is possible to build suffix tree for a wide variety of sequences!
 */
val sentence = str.split(" ") // sentence: Array[String] = Array(String, to, be, searched)
val streeSentence = SuffixTree(sentence)

val patternSentence = Array("be", "searched")

streeSentence.contains(patternSentence) // res3: Boolean = true

val indexesSentence = streeSentence.find(patternSentence) // indexesSentence: List[Int] = List(2)
val firstSentence = indexesSentence.head // firstSentence: Int = 2

streeSentence.sequence.slice(firstSentence, firstSentence + patternSentence.length)
// res4: Array[String] = Array(be, searched)

streeSentence.contains(Array("hello", "world", "!")) // res5: Boolean = false

Generalized Suffix Tree

import gstlib.GeneralizedSuffixTree

/*
 * Building a generalized suffix tree
 *
 * Here:
 * - a sequence is an array of string, and
 * - an item is a string
 */
val utterances = List(
  Array("Hello", "world", "!"),
  Array("How", "are", "you", "today", "?"),
  Array("How", "are", "you", "doing", "?"),
  Array("are", "you", "there", "?"),
  Array("What", "a", "beautiful", "world", "!")
)

val gstree = GeneralizedSuffixTree(utterances: _*)

/*
 * Searching the generalized suffix tree
 */
val pattern = Array("are", "you")
gstree.find(pattern)
// res0: List[(gstlib.GeneralizedSuffixTreeBuilder.SequenceID, Int)] = List((2,1), (1,1), (3,0))
// (2,1) in "How", "are", "you", "doing", "?"
// (1,1) in "How", "are", "you", "today", "?"
// (3,0) in "are", "you", "there", "?"

/*
 * Computing the multiple common subsequence
 */
for ((freq, subseq) <- gstree.bulkMultipleCommonSubsequence()) {
  val subseqStr = subseq.mkString(" ")
  println(s"The sequence '$subseqStr' is appearing in $freq sequences")
}
// Output:
// The sequence 'are you' is appearing in 3 sequences
// The sequence '!' is appearing in 2 sequences
// The sequence '?' is appearing in 3 sequences
// The sequence 'How are you' is appearing in 2 sequences
// The sequence 'you' is appearing in 3 sequences
// The sequence 'world !' is appearing in 2 sequences

/*
 * Computing the longest common subsequences between a pattern and a
 * generalized suffix tree
 */
val pattern2 = Array("well", ",", "today", "?", "where", "are", "you", "?")
for {
  (start, end, positions) <- gstree.findLongestCommonSubsequences(pattern2)
} {
  val subsequence = pattern2.slice(start, end).mkString(" ")
  println(s"Subsequence '$subsequence' appears in:")
  for {
    (seqID, startPos) <- positions
  } {
    println(s"\t- sequence $seqID: ${gstree.getSequenceBy(seqID).mkString(" ")} at starting position $startPos")
  }
}
// Output:
//Subsequence 'today ?' appears in:
//	- sequence 1: How are you today ? at starting position 3
//Subsequence 'are you' appears in:
//	- sequence 2: How are you doing ? at starting position 1
//	- sequence 1: How are you today ? at starting position 1
//	- sequence 3: are you there ? at starting position 0

/*
 * It is possible to build generalized suffix tree for a wide variety
 * of sequence!
 *
 * Here:
 * - a sequence is a string, and
 * - an item is a character
 */
val gstree2 = GeneralizedSuffixTree("ABCDEF", "CDE", "EFGHIJK", "LMNOPQRST", "STUVWXYZ")
gstree2.contains("E") // res2: Boolean = true
gstree2.find("E") // res3: List[(gstlib.GeneralizedSuffixTreeBuilder.SequenceID, Int)] = List((2,0), (0,4), (1,2))
// (2,0) in EFGHIJK
// (0,4) in ABCDEF
// (1,2) in CDE

/*
 * Building a generalized suffix tree via a Builder
 */
val builder = GeneralizedSuffixTree.newBuilder[Char, String]
builder += "foo"
builder += "bar"

val gstreeFooBar = builder.result()

gstreeFooBar.contains("hello") // res7: Boolean = false
gstreeFooBar.contains("oo") // res8: Boolean = true
gstreeFooBar.contains("ba") // res9: Boolean = true

Utils: Longest Common Subsequence

import gstlib.GeneralizedSuffixTree.longestSubsequence

longestSubsequence("abcd", "efgh") // res0: Option[String] = None

longestSubsequence("abcd", "cdefgh") // res1: Option[String] = Some(cd)

Benchmarks

This project includes benchmarks to assess the efficiency of some algorithm (building of tree, solving of the multiple common subsequence problem). Benchmarks can be executed via SBT:

$ sbt test:run

Files to benchmark the code come from the repository: schmidda/ukkonen-suffixtree.

Contributors

  • Guillaume Dubuisson Duplessis (2016-2017)
  • Vincent Letard (2016)

Usage for Research Purposes

If you use this library for research purposes, please make reference to this library by citing the following paper:

  • Dubuisson Duplessis, G.; Charras, F.; Letard, V.; Ligozat, A.-L.; Rosset, S., Utterance Retrieval based on Recurrent Surface Text Patterns, 39th European Conference on Information Retrieval (ECIR), 2017, pp. 199--211 [More DOI]

License

CECILL-B - see the LICENSE file.