Skip to content

Ritu-Kundu/lpsf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lspf: Tool to compute Longest Successor and/or Previous Factor Array

This is an implementation of the linear-time algorithm by Crochemore and Ilie to find the Longest Previous Factor (LPF) Array using Suffix Array as published in the following:

Maxime Crochemore, Lucian Ilie, Computing Longest Previous Factor in linear time and applications, Information Processing Letters, Volume 106, Issue 2, 2008, Pages 75-80, ISSN 0020-0190, doi url

In addition to the LPF Array of a string, this implementation produces the corresponding Reference Array which indicates the index of the leftmost occurrence of the longest previous factor at each position. This is in contrast to the above publication where the algorithm gives an index of any of the previous occurrences. Compressed Suffix Tree (SDSL Library) has been used to achieve that.

Moreover, the Longest Successor Factor (LSF) Array is also produced which is implemented as a mirror image of the algorithm to create LPF Array. Accordingly, Reference Array in this case provides the index of the rightmost occurrent of the longest successor factor at each position.

The arrays are verified against those produced by the brute-force algorithm. Consequently, verification stage is qudaratic-time even though computation is linear-time. Of course, verification can be disabled by simple modifications in the main method (File: lspfuser.cpp in src folder).

Configuration:

There are three modes of operations--DETAILS, RANDOM, EXHAUSTIV--which respectively correspond to testing either the user-specified strings or random strings or all possible binary strings up to a certain length. Specifically,

  • DETAILS: Check one or more strings by adding to the list <text_list>.

  • RANDOM: Test <cNum_Strings> number of random strings from the alphabet <cAlphabet> of lengths varying up to <cMax_len>.

  • EXHAUSTIVE: Test all binary strings of length up to <max_len>.

Choosing and Setting-up the mode:

The configuration macros and variables have been defined in the file lspfuser.cpp in src folder. Please choose the suitable operating mode by uncommenting/commenting the corresponding macro and set-up the corresponding variables for that mode with the values of your choice. By default, DETAILS is selected with two toy strings: "abbaabbbaaabab", "ananas", and "aabbabaabbaababbabab".

After changing (as needed) the mode and configuration variables, the source-code can be compiled.

Compilation:

Please follow the instructions given in file INSTALL.md

Output:

  • For each string, the following results are displayed on the standard output (usually terminal):

    • In DETAILS mode,
      • The string itself.
      • All the arrays: Length (denoted <lspf_l>) and Reference (denoted <lspf_r>) arrays for both - LPF and LSF.
      • Verification results give the number of errors (which should be 0). An error is a mismatch at any position in any of the arrays produced by the algorithm and those by brute force. For each error, its location/index (denoted <i>), value calculated by the algorithm (denoted <calc>), value calculated by the algorithm (denoted <naive>) are printed and number of errors is increased by 1.
    • In RANDOM and EXHAUSTIVE modes, only the text and the verification results are printed.
  • If there is an error in one of the strings, tool will terminate and will not test the remaining strings.

In addition, a <stat.txt> file is generated in the working (i.e. tool) directory. Each row/line stores the stats-record for each string tested. A stats-record comprise of the length (denoted <n>), time taken for computation of the LPF arrays (denoted <lpf-time> in microseconds), time taken for computation of the LSF arrays (denoted <lsf-time> in microseconds); each tab-separated. Time taken does not include the pre-processing stage (construction and manipulation of the suffix-tree).

Usage:

Run the tool by the following command from inside the tool-directory.

./bin/luf 

Definitions:

Given a string $w$, we define the following:

  • The longest previous factor of $w$ starting at position $i$, is the longest factor of $w$ that occurs at $i$ and has at least one other occurrence in the prefix $w[1 .. i-1]$. The longest previous factor (LPF) array (denoted by LPF_l) gives for each position $i$ in $w$, the length of the longest factor starting both at position $i$ and at another position $i &gt; j$.

  • The LPF-Reference Array, denoted by LPF_r, specifies for each position $i$ of $w$, the reference of the longest previous factor at $i$. The reference of $i$ is defined as the position $j$ of the first occurrence of the $w[i .. i+ LSF_l[i]-1]$ in $w$.

  • The longest successor factor of $w$ starting at position $i$, is the longest factor of $w$ that occurs at $i$ and has at least one other occurrence in the suffix $w[i+1 .. n]$. The longest successor factor (LSF) array (denoted by LSF_l) gives for each position $i$ in $w$, the length of the longest factor starting both at position $i$ and at another position $j&gt;i$.

  • The LSF-Reference Array, denoted by LSF_r, specifies for each position $i$ of $w$, the reference of the longest successor factor at $i$. The reference of $i$ is defined as the position $j$ of the last occurrence of the $w[i .. i+ LSF_l[i]-1]$ in $w$.

More formally, for a string $w$ of length $n$, the arrays (one-based index) can be specified as follows: lpfl-array

lpfr-array

lsfl-array

lsfr-array

Examples:

example

External Tools:

SDSL Library is used for compressed suffix tree (whose underlying data structure is compressed suffix array).

About

Longest Previous/Successor Arrays

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published