# Suffix Array Algorithms Analysis


This notebook summarizes analysis for certain algorithms that build the [suffix array](
https://en.wikipedia.org/wiki/Suffix_array) data structure, classified according to their complexities.

1. **Radix Sort** - *O(n log^2 n)*
2. **Counting Sort** - *O(n log n)*
3. **DC3 Algorithm** - *O(kn)*

Conclusion summary:
- Radix Sort beats DC3 and Counting Sort for small random text (length < 1000) with lots of letters.
- Counting Sort, though having a better theoretical complexity, is slower than Radix Sort in most cases.
- Counting Sort is only better than Radix Sort in cases when the number of distinct letters is small.
- DC3 beats Counting Sort in all test cases.
- DC3 beats Radix Sort for all large text (length > 1000)


In [15]:
cd ../../../

B:\Rico\Dev\GitHub\WebRico\src


In [641]:
from rico.backend.strings.suffix import SuffixArray

In [677]:
# Testing
trial_text = []

import cProfile, os, pstats
from random import randint

def statistics(algorithm):
    return os.path.join('rico', 'backend', 'notebook', 'suffix array ' + algorithm + ' stats')

def generate_text(trials=10, text_length=1000, letters=26):
    print 'Trials:', trials
    print 'Text length:', text_length
    print 'Letters:', letters
    print ''
    global trial_text
    if isinstance(text_length, int):
        text_length = (text_length, text_length)
    trial_text = [''.join([chr(randint(0, letters - 1) + (ord('A') if letters <= 26 else 1)) for i in xrange(randint(text_length[0], text_length[1]))]) for i in xrange(trials)]

results = {}
def perform_algorithm(algorithm):
    global results
    results[algorithm] = []
    for text in trial_text:
        results[algorithm].append(SuffixArray(text, algo=algorithm))

def profiler(algorithm):
    cProfile.run('perform_algorithm("%s")' % algorithm, statistics(algorithm))
    print 'Algorithm:', algorithm
    pstats.Stats(statistics(algorithm)).strip_dirs().sort_stats(-1).print_stats()

def test(trials=10, text_length=10000, letters=5):
    generate_text(trials=trials, text_length=text_length, letters=letters)
    for algorithm in SuffixArray.algorithms:
        %time perform_algorithm(algorithm)
        print '(%s)\n' % algorithm
    L = list(results)
    for i in xrange(len(L)):
        for j in xrange(i + 1, len(L)):
            algo1, algo2 = L[i], L[j]
            print '(%s == %s)?' % (algo1, algo2), results[algo1] == results[algo2]

In [682]:
test(trials=1000, text_length=(1, 100), letters=26)

Trials: 1000
Text length: (1, 100)
Letters: 26

Wall time: 233 ms
(radix sort)

Wall time: 459 ms
(counting sort)

Wall time: 340 ms
(dc3)

(radix sort == counting sort)? True
(radix sort == dc3)? True
(counting sort == dc3)? True


In [683]:
test(trials=5, text_length=100000, letters=1)

Trials: 5
Text length: 100000
Letters: 1

Wall time: 12.3 s
(radix sort)

Wall time: 10.3 s
(counting sort)

Wall time: 4.32 s
(dc3)

(radix sort == counting sort)? True
(radix sort == dc3)? True
(counting sort == dc3)? True


In [688]:
test(trials=5, text_length=100000, letters=2)

Trials: 5
Text length: 100000
Letters: 2

Wall time: 10.6 s
(radix sort)

Wall time: 18.1 s
(counting sort)

Wall time: 4.99 s
(dc3)

(radix sort == counting sort)? True
(radix sort == dc3)? True
(counting sort == dc3)? True


In [684]:
test(trials=100, text_length=(1000, 5000), letters=5)

Trials: 100
Text length: (1000, 5000)
Letters: 5

Wall time: 3.57 s
(radix sort)

Wall time: 4.47 s
(counting sort)

Wall time: 1.68 s
(dc3)

(radix sort == counting sort)? True
(radix sort == dc3)? True
(counting sort == dc3)? True


In [685]:
test(trials=100, text_length=(1000, 5000), letters=26)

Trials: 100
Text length: (1000, 5000)
Letters: 26

Wall time: 2.77 s
(radix sort)

Wall time: 4.81 s
(counting sort)

Wall time: 1.69 s
(dc3)

(radix sort == counting sort)? True
(radix sort == dc3)? True
(counting sort == dc3)? True


In [686]:
test(trials=10, text_length=50000, letters=50)

Trials: 10
Text length: 50000
Letters: 50

Wall time: 5.72 s
(radix sort)

Wall time: 14 s
(counting sort)

Wall time: 3.12 s
(dc3)

(radix sort == counting sort)? True
(radix sort == dc3)? True
(counting sort == dc3)? True


In [687]:
test(trials=1, text_length=1000000, letters=10)

Trials: 1
Text length: 1000000
Letters: 10

Wall time: 17.8 s
(radix sort)

Wall time: 44.6 s
(counting sort)

Wall time: 10.2 s
(dc3)

(radix sort == counting sort)? True
(radix sort == dc3)? True
(counting sort == dc3)? True
