# Chapter 25: Advanced String Structures

> *"Strings are not just sequences—they are worlds unto themselves. With advanced structures, we can navigate these worlds with unprecedented speed and insight."* — Anonymous

---

## 25.1 Introduction

In Chapter 24, we explored fundamental string algorithms for pattern matching. Now we delve into **advanced string data structures** that enable powerful queries beyond simple search: finding longest repeated substrings, counting distinct substrings, constructing suffix arrays, and more. These structures are the backbone of modern bioinformatics, text indexing, and data compression.

### 25.1.1 Why Advanced String Structures?

```
┌─────────────────────────────────────────────────────────────────────┐
│                    IMPORTANCE OF ADVANCED STRING STRUCTURES          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. INDEXING: Preprocess text to answer many queries efficiently    │
│     (e.g., search for any pattern in O(|P|) time).                  │
│                                                                      │
│  2. REPEATED PATTERNS: Find longest repeated substring, which is    │
│     useful in plagiarism detection and data compression.            │
│                                                                      │
│  3. BIOINFORMATICS: DNA sequences are billions of characters long;  │
│     suffix arrays/trees enable genome alignment and assembly.       │
│                                                                      │
│  4. DATA COMPRESSION: Burrows-Wheeler Transform is the core of      │
│     bzip2 and modern compression tools.                             │
│                                                                      │
│  5. LINGUISTICS: Analyze word patterns, palindromes, and more.      │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

---

## 25.2 Suffix Trees

A **suffix tree** is a compressed trie of all suffixes of a string. It can be built in O(n) time and supports many string queries in O(m) time, where m is the length of the query pattern.

### 25.2.1 Definition and Properties

For a string S of length n, a suffix tree is a rooted tree with:
- Exactly n leaves, each labeled with a suffix start index.
- Each internal node has at least two children.
- Each edge is labeled with a non-empty substring of S.
- No two edges out of a node can have labels starting with the same character.
- The concatenation of edge labels from root to leaf gives a suffix.

**Example:** For S = "banana$" (with sentinel), the suffix tree has nodes representing common prefixes.

**Properties:**
- Space: O(n) nodes.
- Construction: O(n) time (Ukkonen's algorithm).
- Queries:
  - **Pattern presence:** O(m) by following characters.
  - **Number of occurrences:** O(m) by counting leaves in subtree.
  - **Longest repeated substring:** Find deepest internal node (by string depth).
  - **Longest common substring** of two strings: Build combined tree with markers.

### 25.2.2 Ukkonen's Algorithm (Conceptual)

Ukkonen's algorithm builds the suffix tree online, adding characters one by one. It uses **active point** and **suffix links** to achieve linear time.

**Key Concepts:**
- **Implicit suffix tree:** After processing first i characters, we have tree for S[0:i] (without sentinel).
- **Active point:** A location (node, edge, length) where the next suffix extension starts.
- **Suffix link:** Pointer from a node representing a string `xα` to the node representing `α` (useful for fast traversal).
- **Extension rules:**
  - Rule 1: The path for suffix already exists – do nothing.
  - Rule 2: The path ends at a leaf – extend edge.
  - Rule 3: The path ends at an internal node or in the middle of an edge – split edge and create new leaf.

Ukkonen uses a **remainder** counter to track how many suffixes need to be added in the current phase. The algorithm is complex; we outline steps:

1. Initialize tree with root node.
2. For each character in the string:
   - Set lastJ = -1 (for suffix link).
   - For each suffix extension (using remainder), apply rules.
   - When rule 3 is applied, create suffix link from previous internal node to new node.
   - Update active point.

After adding sentinel, we have the full suffix tree.

**Implementation note:** A full implementation is lengthy; many libraries exist. For learning, we can implement a simpler O(n²) suffix tree to understand the structure, but production uses Ukkonen.

### 25.2.3 Applications

- **Exact string matching:** Find all occurrences of pattern P in O(|P| + occ).
- **Longest repeated substring:** Find the deepest internal node (string depth).
- **Longest common substring** of two strings: Build suffix tree for T1$T2#, find deepest node with leaves from both strings.
- **Palindromes:** Can be found by building suffix tree of reverse string and using LCA queries.
- **DNA assembly:** Suffix trees are used in genome assembly algorithms.

---

## 25.3 Suffix Arrays

A **suffix array** is a sorted array of starting indices of all suffixes of a string. It requires less memory than a suffix tree and can be augmented with an LCP (Longest Common Prefix) array to achieve similar power.

### 25.3.1 Definition

For a string S of length n, the suffix array SA is an array of integers 0..n-1 such that S[SA[0]:] < S[SA[1]:] < ... < S[SA[n-1]:] lexicographically.

**Example:** S = "banana$"
Suffixes:
0: banana$
1: anana$
2: nana$
3: ana$
4: na$
5: a$
6: $
Sorted lexicographically: $, a$, ana$, anana$, banana$, na$, nana$
Indices: 6,5,3,1,0,4,2
SA = [6,5,3,1,0,4,2]

### 25.3.2 Construction Algorithms

#### 25.3.2.1 Naive O(n² log n) (for understanding)

```python
def suffix_array_naive(s):
    suffixes = [(s[i:], i) for i in range(len(s))]
    suffixes.sort()
    return [idx for _, idx in suffixes]
```

#### 25.3.2.2 Prefix Doubling (O(n log n))

This is the most common method. It sorts suffixes by their first 1,2,4,8,... characters using the previous rank.

```python
def suffix_array_doubling(s):
    n = len(s)
    k = 1
    # Initial ranks based on single character
    rank = [ord(c) for c in s]
    tmp = [0] * n
    sa = list(range(n))
    
    while k < n:
        # Sort by (rank[i], rank[i+k] if i+k < n else -1)
        sa.sort(key=lambda x: (rank[x], rank[x+k] if x+k < n else -1))
        # Assign new ranks
        tmp[sa[0]] = 0
        for i in range(1, n):
            prev = sa[i-1]
            curr = sa[i]
            prev_key = (rank[prev], rank[prev+k] if prev+k < n else -1)
            curr_key = (rank[curr], rank[curr+k] if curr+k < n else -1)
            tmp[curr] = tmp[prev] + (prev_key != curr_key)
        rank, tmp = tmp, rank
        if rank[sa[-1]] == n-1:
            break
        k <<= 1
    return sa
```

**Time:** O(n log n) with comparison-based sort; using counting sort on ranks can achieve O(n) for integer alphabets.

#### 25.3.2.3 SA-IS (O(n))

The SA-IS algorithm (Induced Sorting) is a linear-time suffix array construction algorithm. It is complex and rarely implemented manually, but it's the basis of many fast libraries. We'll not implement here; the key idea is to recursively sort suffixes based on LMS (leftmost S-type) characters and induce the rest.

### 25.3.3 LCP Array (Kasai's Algorithm)

The LCP array stores the length of the longest common prefix between consecutive suffixes in the suffix array. It enables many queries.

```python
def kasai_lcp(s, sa):
    n = len(s)
    rank = [0] * n
    for i, idx in enumerate(sa):
        rank[idx] = i
    lcp = [0] * (n - 1)
    h = 0
    for i in range(n):
        if rank[i] > 0:
            j = sa[rank[i] - 1]
            while i + h < n and j + h < n and s[i+h] == s[j+h]:
                h += 1
            lcp[rank[i] - 1] = h
            if h > 0:
                h -= 1
    return lcp
```

**Time:** O(n)

### 25.3.4 Applications

- **Pattern matching:** Binary search on suffix array to find range of suffixes starting with P. Time O(|P| log n).
- **Longest repeated substring:** Maximum value in LCP array.
- **Number of distinct substrings:** Sum over suffixes of (n - SA[i]) - LCP[i] (or LCP[i-1] appropriately).
- **Longest common substring** of two strings: Concatenate with unique separators, build suffix array + LCP, find max LCP between suffixes from different strings.
- **Burrows-Wheeler Transform:** BWT can be derived from suffix array as BWT[i] = s[SA[i] - 1] (with wrap-around).

---

## 25.4 Burrows-Wheeler Transform (BWT)

The Burrows-Wheeler Transform is a reversible permutation that groups identical characters together, aiding compression. It is the basis of the bzip2 compression tool and the FM-index.

### 25.4.1 Forward BWT

Given a string S of length n (with a unique sentinel $ smaller than all characters), form all rotations, sort them lexicographically, and take the last column.

**Example:** S = "banana$"
Rotations (sorted):
$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba
Last column: anbnaa$ (the characters at the end of each rotation)

**Implementation using suffix array:**
The BWT can be obtained directly from the suffix array:
- For each suffix starting at SA[i], the character just before it (cyclic) is BWT[i] = S[SA[i] - 1] (with S[-1] interpreted as last character, i.e., cyclic).

```python
def bwt_from_sa(s, sa):
    n = len(s)
    return ''.join(s[(sa[i] - 1) % n] for i in range(n))
```

### 25.4.2 Inverse BWT

Reversing the BWT requires the **last-to-first (LF) mapping**. Given the BWT string L (last column) and the first column F (which is just the sorted version of L), we can reconstruct the original string.

**Steps:**
1. Compute the first column F by sorting L.
2. For each character in L, we know its rank (how many times it appears so far). Build an array `next` that maps from a position in L to the same character's position in F (using rank).
3. Starting from the row that contains the sentinel, follow `next` pointers to reconstruct the original string in reverse order.

```python
def inverse_bwt(l):
    n = len(l)
    # Count occurrences of each character
    from collections import Counter
    cnt = Counter(l)
    # Determine starting positions in F
    chars = sorted(cnt.keys())
    start = {}
    total = 0
    for c in chars:
        start[c] = total
        total += cnt[c]
    # Build occurrence array for L
    occ = {c: 0 for c in chars}
    # For each position in L, compute its rank (occurrence count so far)
    rank = [0] * n
    for i, ch in enumerate(l):
        rank[i] = occ[ch]
        occ[ch] += 1
    # Build next array: next[i] = start[l[i]] + rank[i]
    next_idx = [start[l[i]] + rank[i] for i in range(n)]
    # Find row containing sentinel (say '$')
    sentinel_char = '$'  # adjust as needed
    row = l.index(sentinel_char)
    res = []
    for _ in range(n):
        row = next_idx[row]
        res.append(l[row])
    return ''.join(res)
```

### 25.4.3 FM-Index (Concept)

The FM-index (Full-text index in Minute space) combines the BWT with additional data structures (wavelet trees, rank/select) to support count and locate queries in sublinear time. It is widely used in bioinformatics (e.g., Bowtie, BWA).

**Key components:**
- **BWT string L**.
- **Occurrence array** `occ(c, i)` = number of occurrences of character c in L[0..i-1] (can be stored as wavelet tree for O(log Σ) access).
- **C array** `C[c]` = number of characters in text less than c (like start positions in F).

**Query for pattern P:**
- Start with range [l, r] covering the whole suffix array.
- For each character in P from last to first:
  - l = C[c] + occ(c, l-1) + 1
  - r = C[c] + occ(c, r)
- Resulting [l, r] gives the range of suffixes starting with P.
- Locating positions requires additional sampled suffix array values.

FM-index enables counting occurrences in O(|P|) time and locating in O(|P| + occ log n) time with small memory footprint.

---

## 25.5 Palindromic Trees (Eertree)

A **palindromic tree** (also called Eertree) is a data structure for storing all distinct palindromic substrings of a string. It can be built in O(n) time and supports queries like number of distinct palindromes, longest palindrome suffix, and counting occurrences.

### 25.5.1 Structure

- **Two roots:** 
  - `root0` with length -1 (odd length root, representing imaginary palindrome of length -1).
  - `root1` with length 0 (even length root, representing empty string).
- Each node represents a palindrome string.
- Each node stores:
  - `len`: length of palindrome.
  - `link`: suffix link to the longest proper palindromic suffix.
  - `next`: dictionary mapping character to child node (the palindrome formed by adding character to both ends).
  - `count`: number of occurrences (can be updated during construction).

### 25.5.2 Construction

We process characters one by one, maintaining a pointer `last` to the longest suffix palindrome ending at current position.

```python
class Eertree:
    def __init__(self):
        self.nodes = []
        # Node: (length, link, next, count)
        self.nodes.append({'len': -1, 'link': 0, 'next': {}, 'count': 0})  # root -1
        self.nodes.append({'len': 0, 'link': 0, 'next': {}, 'count': 0})   # root 0
        self.s = ['']  # characters, 1-indexed
        self.last = 1  # last node (longest suffix palindrome)

    def get_link(self, node, pos):
        # Follow suffix links until we can add character at position pos
        while True:
            cur_len = self.nodes[node]['len']
            if pos - cur_len - 1 >= 0 and self.s[pos - cur_len - 1] == self.s[pos]:
                break
            node = self.nodes[node]['link']
        return node

    def add_char(self, ch):
        pos = len(self.s)
        self.s.append(ch)
        # Find node to extend from
        curr = self.get_link(self.last, pos)
        # Check if palindrome already exists
        if ch in self.nodes[curr]['next']:
            self.last = self.nodes[curr]['next'][ch]
            self.nodes[self.last]['count'] += 1
            return False
        # Create new node
        new_len = self.nodes[curr]['len'] + 2
        self.nodes.append({'len': new_len, 'link': 0, 'next': {}, 'count': 1})
        new_node = len(self.nodes) - 1
        self.nodes[curr]['next'][ch] = new_node
        # Set suffix link
        if new_len == 1:
            self.nodes[new_node]['link'] = 1  # link to empty string
        else:
            link_node = self.get_link(self.nodes[curr]['link'], pos)
            self.nodes[new_node]['link'] = self.nodes[link_node]['next'][ch]
        self.last = new_node
        return True

    def build(self, s):
        for ch in s:
            self.add_char(ch)
        # Propagate counts along suffix links
        for i in range(len(self.nodes)-1, 1, -1):
            self.nodes[self.nodes[i]['link']]['count'] += self.nodes[i]['count']
```

### 25.5.3 Applications

- **Counting distinct palindromic substrings:** Number of nodes minus 2 (excluding roots).
- **Longest palindrome substring:** Maximum node length.
- **Number of occurrences of each palindrome:** `node['count']` after propagation.
- **Palindrome factorization:** Split string into fewest palindromic substrings (can be done with DP + eertree).
- **Minimum number of palindromic concatenations:** DP with eertree for quick longest palindrome suffix queries.

---

## 25.6 Rolling Hash Variants

Rolling hashes are used for efficient string comparison and substring searching. To avoid collisions, we often use multiple hashes or larger moduli.

### 25.6.1 Polynomial Rolling Hash

The classic polynomial hash:

H(S) = (S[0] * b^(n-1) + S[1] * b^(n-2) + ... + S[n-1]) mod M

Where b is a base (e.g., 131, 137, 91138233) and M is a large prime (e.g., 10^9+7, 10^9+9).

**Properties:**
- Can compute hash of any substring in O(1) using prefix hashes.
- Collisions possible; probability low with large M.

### 25.6.2 Double Hashing

To virtually eliminate collisions, use two different moduli (e.g., 10^9+7 and 10^9+9) and possibly two bases. The hash is a pair (h1, h2). The chance of both colliding is negligible.

```python
class DoubleHash:
    def __init__(self, s, base1=131, base2=137, mod1=10**9+7, mod2=10**9+9):
        self.base1 = base1
        self.base2 = base2
        self.mod1 = mod1
        self.mod2 = mod2
        self.n = len(s)
        self.pref1 = [0] * (self.n + 1)
        self.pref2 = [0] * (self.n + 1)
        self.pow1 = [1] * (self.n + 1)
        self.pow2 = [1] * (self.n + 1)
        for i, ch in enumerate(s):
            self.pref1[i+1] = (self.pref1[i] * base1 + ord(ch)) % mod1
            self.pref2[i+1] = (self.pref2[i] * base2 + ord(ch)) % mod2
            self.pow1[i+1] = (self.pow1[i] * base1) % mod1
            self.pow2[i+1] = (self.pow2[i] * base2) % mod2

    def get_hash(self, l, r):
        # [l, r) 0-indexed
        h1 = (self.pref1[r] - self.pref1[l] * self.pow1[r-l]) % self.mod1
        h2 = (self.pref2[r] - self.pref2[l] * self.pow2[r-l]) % self.mod2
        return (h1, h2)
```

### 25.6.3 Applications

- **Rabin-Karp** (as seen in Chapter 24).
- **Longest common prefix** via binary search on hash equality.
- **Checking string equality** in O(1) after preprocessing.
- **Palindrome detection:** Compare hash of substring with hash of reverse substring.
- **Anagram detection:** Rolling hash of sorted string (or using character count hashes).

---

## 25.7 Summary

```
┌──────────────────────┬──────────────┬────────────────────────────────┐
│ Structure            │ Construction │ Query Capabilities             │
├──────────────────────┼──────────────┼────────────────────────────────┤
│ Suffix Tree          │ O(n)         │ Pattern search O(m), LRS, LCS  │
│ Suffix Array + LCP   │ O(n log n)   │ Pattern search O(m log n), LRS │
│ BWT + FM-index       │ O(n)         │ Count O(m), locate O(m+occ)    │
│ Eertree              │ O(n)         │ Palindrome queries             │
│ Rolling Hash         │ O(n)         │ Substring compare O(1)          │
└──────────────────────┴──────────────┴────────────────────────────────┘
```

---

## 25.8 Practice Problems

### Suffix Arrays
1. **Longest Common Prefix** (LeetCode 14) – but for suffix array, practice with SPOJ SUBST1 (distinct substrings).
2. **Minimum Lexicographic Rotation** (LeetCode 1163) – use suffix array.
3. **Number of Distinct Substrings** (LeetCode 1698) – suffix array + LCP.
4. **Longest Repeated Substring** (LeetCode 1062) – suffix array + LCP.

### Suffix Trees
5. **Implement a suffix tree** (optional, advanced).
6. **Find all occurrences of a pattern** – simulate using suffix array.

### BWT
7. **BWT Compression** – implement forward/inverse.
8. **FM-index** – implement basic count query.

### Eertree
9. **Count Palindromic Substrings** (LeetCode 647) – eertree gives all distinct.
10. **Palindrome Pairs** (LeetCode 336) – can be solved with trie, but eertree for suffix palindrome.

### Rolling Hash
11. **Longest Chunked Palindrome Decomposition** (LeetCode 1147) – rolling hash.
12. **Repeated DNA Sequences** (LeetCode 187) – rolling hash.

---

## 25.9 Further Reading

1. **"Algorithms on Strings, Trees, and Sequences"** by Dan Gusfield – the definitive reference.
2. **"Flexible Pattern Matching in Strings"** by Gonzalo Navarro – covers suffix arrays, BWT, FM-index.
3. **Original Papers**:
   - Ukkonen, E. (1995) – "On-line construction of suffix trees"
   - Manber, U., & Myers, G. (1993) – "Suffix arrays: a new method for on-line string searches"
   - Burrows, M., & Wheeler, D. J. (1994) – "A block-sorting lossless data compression algorithm"
   - Ferragina, P., & Manzini, G. (2000) – "Opportunistic data structures with applications"
   - Rubinchik, M., & Shur, A. M. (2015) – "Eertree: An efficient data structure for processing palindromes in strings"

---

> **Coming in Chapter 26**: **Bit Manipulation** – We'll explore the world of bits, bitwise operators, and clever bit-level tricks.

---

**End of Chapter 25**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='24. string_matching.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../8. specialized_topics/26. bit_manipulation.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
