# Chapter 24: String Algorithms

> *"Strings are the DNA of text processing. From pattern matching to data compression, algorithms that operate on strings form the backbone of search engines, bioinformatics, and natural language processing."* — Anonymous

---

## 24.1 Introduction to String Algorithms

String algorithms deal with manipulating, searching, and analyzing sequences of characters. They are fundamental in computer science, with applications ranging from text editing to bioinformatics.

### 24.1.1 Why String Algorithms Matter

```
┌─────────────────────────────────────────────────────────────────────┐
│                    IMPORTANCE OF STRING ALGORITHMS                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. SEARCH ENGINES: Finding occurrences of keywords in documents    │
│                                                                      │
│  2. BIOINFORMATICS: DNA sequence alignment, pattern discovery       │
│                                                                      │
│  3. NATURAL LANGUAGE PROCESSING: Tokenization, stemming, search     │
│                                                                      │
│  4. DATA COMPRESSION: Burrows-Wheeler transform in bzip2            │
│                                                                      │
│  5. PLAGIARISM DETECTION: Finding similar text passages             │
│                                                                      │
│  6. SPELL CHECKERS: Dictionary lookup, approximate matching         │
│                                                                      │
│  7. INTRUSION DETECTION: Pattern matching in network traffic        │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### 24.1.2 Basic Definitions

- **String S:** A sequence of characters from an alphabet Σ.
- **Length:** |S|, number of characters.
- **Substring:** S[i:j] contiguous characters.
- **Subsequence:** Not necessarily contiguous.
- **Prefix:** S[0:k] for k ≤ |S|.
- **Suffix:** S[k:|S|] for k ≥ 0.
- **Pattern P:** String we are searching for (length m).
- **Text T:** String we are searching in (length n).

---

## 24.2 String Matching

String matching (or pattern matching) is the problem of finding all occurrences of a pattern P in a text T.

### 24.2.1 Naive Pattern Matching

The simplest approach: slide the pattern over the text and compare character by character.

```python
def naive_string_matching(text, pattern):
    n = len(text)
    m = len(pattern)
    occurrences = []
    for i in range(n - m + 1):
        if text[i:i+m] == pattern:
            occurrences.append(i)
    return occurrences
```

**Time Complexity:** O((n-m+1)*m) worst-case, which is O(nm). For example, text "aaaaaaaaa", pattern "aaaab" leads to many comparisons.

### 24.2.2 Rabin-Karp Algorithm (Rolling Hash)

Rabin-Karp uses hashing to quickly filter out positions that cannot match. It computes the hash of the pattern and the hash of each length-m substring of the text. Only when hashes match do we verify character-by-character.

**Rolling Hash:** Given a hash of substring T[i:i+m], we can compute hash of T[i+1:i+1+m] in O(1) using a rolling function.

Common rolling hash: base `b` and modulus `M` (e.g., 256 and a large prime).

```python
def rabin_karp(text, pattern):
    n, m = len(text), len(pattern)
    if m > n:
        return []
    base = 256
    prime = 101  # a prime modulus
    
    # Compute hash for pattern and first window of text
    h_pattern = 0
    h_text = 0
    h = 1
    # The value of h would be pow(base, m-1) % prime
    for i in range(m-1):
        h = (h * base) % prime
    
    for i in range(m):
        h_pattern = (h_pattern * base + ord(pattern[i])) % prime
        h_text = (h_text * base + ord(text[i])) % prime
    
    occurrences = []
    for i in range(n - m + 1):
        if h_pattern == h_text:
            # Verify characters
            if text[i:i+m] == pattern:
                occurrences.append(i)
        # Compute next hash
        if i < n - m:
            h_text = ( (h_text - ord(text[i]) * h) * base + ord(text[i+m]) ) % prime
            if h_text < 0:
                h_text += prime
    return occurrences
```

**Time Complexity:** Average O(n + m) with a good hash function; worst-case O(nm) if many hash collisions.

### 24.2.3 Knuth-Morris-Pratt (KMP) Algorithm

KMP avoids re-examining characters by preprocessing the pattern to build a **prefix function** (also called failure function). The prefix function π[i] is the length of the longest proper prefix of pattern[0..i] that is also a suffix of that substring.

**Prefix Function Computation:**

```python
def compute_prefix_function(pattern):
    m = len(pattern)
    pi = [0] * m
    for i in range(1, m):
        j = pi[i-1]
        while j > 0 and pattern[i] != pattern[j]:
            j = pi[j-1]
        if pattern[i] == pattern[j]:
            j += 1
        pi[i] = j
    return pi
```

**KMP Search:**

```python
def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    if m == 0:
        return []
    pi = compute_prefix_function(pattern)
    occurrences = []
    j = 0  # number of characters matched in pattern
    for i in range(n):
        while j > 0 and text[i] != pattern[j]:
            j = pi[j-1]
        if text[i] == pattern[j]:
            j += 1
        if j == m:
            occurrences.append(i - m + 1)
            j = pi[j-1]
    return occurrences
```

**Time Complexity:** O(n + m) – each character of text and pattern is processed at most twice.

### 24.2.4 Z-Algorithm

The Z-algorithm computes, for each position i in a string S, the **Z-value** Z[i] = the length of the longest substring starting at S[i] that is also a prefix of S. It is often used for pattern matching by concatenating P + '$' + T and computing Z-values.

**Z-Array Computation:**

```python
def z_algorithm(s):
    n = len(s)
    z = [0] * n
    l = r = 0
    for i in range(1, n):
        if i <= r:
            z[i] = min(r - i + 1, z[i - l])
        while i + z[i] < n and s[z[i]] == s[i + z[i]]:
            z[i] += 1
        if i + z[i] - 1 > r:
            l = i
            r = i + z[i] - 1
    return z
```

**Pattern Matching with Z:**

```python
def z_pattern_match(text, pattern):
    s = pattern + '$' + text
    z = z_algorithm(s)
    m = len(pattern)
    occurrences = []
    for i in range(m + 1, len(s)):
        if z[i] == m:
            occurrences.append(i - m - 1)
    return occurrences
```

**Time Complexity:** O(n + m)

### 24.2.5 Boyer-Moore and Horspool

Boyer-Moore is a highly efficient string matching algorithm that skips sections of the text using two heuristics: **bad character rule** and **good suffix rule**. The simplified **Horspool algorithm** uses only the bad character rule.

**Horspool Algorithm:**

Preprocess the pattern to build a shift table: for each character c, the shift is the distance from the last occurrence of c in pattern (excluding last character) to the end.

```python
def horspool_search(text, pattern):
    n, m = len(text), len(pattern)
    if m > n:
        return []
    # Build shift table
    shift = {}
    # Default shift is m
    for i in range(m - 1):
        shift[pattern[i]] = m - 1 - i
    # Default for characters not in pattern (except last) is m
    occurrences = []
    i = 0
    while i <= n - m:
        if text[i:i+m] == pattern:
            occurrences.append(i)
        # Determine shift
        c = text[i + m - 1]  # last character of current window
        i += shift.get(c, m)
    return occurrences
```

**Time Complexity:** Average sublinear (better than linear) but worst-case O(nm). Often very fast in practice.

**Boyer-Moore full algorithm** also uses good suffix rule for larger shifts.

### 24.2.6 Aho-Corasick Algorithm (Multi-pattern Matching)

Aho-Corasick builds a trie of all patterns and adds failure links (like KMP) to allow simultaneous matching of multiple patterns in one pass over the text.

**Structure:**

- **Trie** of patterns.
- **Failure links:** When a mismatch occurs at node u, follow failure link to longest proper suffix of the current path that is also a prefix of some pattern.
- **Output:** Each node stores which patterns end at that node.

**Implementation sketch:**

```python
class AhoCorasick:
    def __init__(self):
        self.trie = [{}]   # list of dicts mapping char -> next state
        self.fail = [0]     # failure links
        self.output = [[]]  # patterns ending at state

    def add_pattern(self, pattern, idx):
        state = 0
        for ch in pattern:
            if ch not in self.trie[state]:
                self.trie[state][ch] = len(self.trie)
                self.trie.append({})
                self.fail.append(0)
                self.output.append([])
            state = self.trie[state][ch]
        self.output[state].append(idx)

    def build(self):
        from collections import deque
        q = deque()
        # set fail for depth 1 nodes to root
        for ch, next_state in self.trie[0].items():
            self.fail[next_state] = 0
            q.append(next_state)
        while q:
            r = q.popleft()
            for ch, u in self.trie[r].items():
                q.append(u)
                f = self.fail[r]
                while f and ch not in self.trie[f]:
                    f = self.fail[f]
                self.fail[u] = self.trie[f][ch] if ch in self.trie[f] else 0
                self.output[u].extend(self.output[self.fail[u]])

    def search(self, text):
        state = 0
        matches = []
        for i, ch in enumerate(text):
            while state and ch not in self.trie[state]:
                state = self.fail[state]
            if ch in self.trie[state]:
                state = self.trie[state][ch]
            else:
                state = 0
            for pat_idx in self.output[state]:
                matches.append((i, pat_idx))
        return matches
```

**Time Complexity:** O(n + total pattern length + number of matches) – linear in total input.

---

## 24.3 Advanced String Structures

### 24.3.1 Suffix Trees

A **suffix tree** is a compressed trie of all suffixes of a string. It can be built in O(n) time using Ukkonen's algorithm and supports many string queries in O(m) time.

**Properties:**
- Each edge labeled with a substring.
- Each internal node has at least two children (unless root).
- Leaves correspond to suffixes.
- Total nodes ≤ 2n.

**Ukkonen's Algorithm** is complex; we give a high-level description:

- Build the tree incrementally, adding suffixes one by one.
- Uses active point and suffix links to achieve linear time.
- Key operations: extension, split, and suffix link traversal.

**Applications:**
- **Exact string matching:** Find all occurrences of pattern P in O(|P| + occ) time.
- **Longest repeated substring:** Find deepest internal node.
- **Longest common substring** of two strings: Build combined tree with markers.
- **Palindromes, substring occurrences count, etc.**

Due to space, we don't implement full Ukkonen here; many libraries exist.

### 24.3.2 Suffix Arrays

A **suffix array** is an array of starting indices of all suffixes of a string, sorted lexicographically. It requires O(n) space and can be built in O(n log n) or O(n) time. Combined with an LCP array, it can answer many queries efficiently.

**Construction (O(n log n) using sorting):**

```python
def build_suffix_array(s):
    """Naive O(n^2 log n) for small strings; better algorithms exist."""
    n = len(s)
    suffixes = [(s[i:], i) for i in range(n)]
    suffixes.sort()
    return [idx for _, idx in suffixes]
```

**Manber-Myers (O(n log n)):** Uses doubling technique: sort by first 2^k characters.

**SA-IS (O(n)):** Linear-time algorithm by Nong, Zhang, and Chan (2009) – complex.

**LCP Array (Kasai's algorithm, O(n)):**

```python
def build_lcp(s, sa):
    n = len(s)
    rank = [0] * n
    for i, idx in enumerate(sa):
        rank[idx] = i
    lcp = [0] * (n - 1)
    h = 0
    for i in range(n):
        if rank[i] > 0:
            j = sa[rank[i] - 1]
            while i + h < n and j + h < n and s[i+h] == s[j+h]:
                h += 1
            lcp[rank[i] - 1] = h
            if h > 0:
                h -= 1
    return lcp
```

**Applications:**
- **Pattern matching:** Binary search on suffix array O(m log n).
- **Longest repeated substring:** Max LCP value.
- **Number of distinct substrings:** Sum of (n - sa[i]) - lcp[i] over i.
- **Burrows-Wheeler transform:** Built from suffix array.

### 24.3.3 Burrows-Wheeler Transform (BWT)

BWT is a reversible permutation of a string used in data compression (bzip2). It tends to group identical characters together.

**Forward BWT:**
1. Append sentinel character $ (smaller than all others).
2. Form all rotations.
3. Sort rotations lexicographically.
4. Last column of the sorted rotations is the BWT.

```python
def bwt(s):
    s = s + '$'
    rotations = [s[i:] + s[:i] for i in range(len(s))]
    rotations.sort()
    bwt_result = ''.join(rot[-1] for rot in rotations)
    return bwt_result
```

**Inverse BWT:** Uses the last column to reconstruct original string via the "LF mapping" property.

**Applications:**
- Compression (run-length encoding after BWT).
- FM-index (self-index for compressed full-text search).

### 24.3.4 FM-Index

The FM-index is a compressed full-text index based on the Burrows-Wheeler Transform. It supports count and locate queries in sublinear time using additional data structures (wavelet trees, rank/select operations). It is used in bioinformatics (e.g., Bowtie, BWA).

### 24.3.5 Palindromic Trees (Eertree)

A **palindromic tree** (or Eertree) is a data structure for storing all distinct palindromic substrings of a string. It can be built in O(n) time and supports counting occurrences of each palindrome, and can be used to solve many palindrome-related problems.

**Structure:**
- Two roots: root -1 (length -1, odd root) and root 0 (length 0, even root).
- Each node represents a palindrome.
- Edges represent adding a character to both ends of current palindrome.
- Suffix links point to the longest proper palindromic suffix.

**Eertree construction** is involved; we present a simplified implementation (following standard references):

```python
class Eertree:
    def __init__(self):
        self.nodes = []
        # Node: length, link, next (dict char->node), count
        self.nodes.append({'len': -1, 'link': 0, 'next': {}, 'count': 0})  # root -1
        self.nodes.append({'len': 0, 'link': 0, 'next': {}, 'count': 0})   # root 0
        self.s = ['']  # characters (1-indexed)
        self.last = 1  # last node (longest suffix palindrome)
    
    def get_link(self, node, pos):
        # Follow suffix links until we can add char at position pos
        while True:
            cur_len = self.nodes[node]['len']
            if pos - cur_len - 1 >= 0 and self.s[pos - cur_len - 1] == self.s[pos]:
                break
            node = self.nodes[node]['link']
        return node
    
    def add_char(self, ch, pos):
        self.s.append(ch)
        # Find node to extend from
        curr = self.get_link(self.last, pos)
        # Check if palindrome already exists
        if ch in self.nodes[curr]['next']:
            self.last = self.nodes[curr]['next'][ch]
            self.nodes[self.last]['count'] += 1
            return False
        # Create new node
        new_len = self.nodes[curr]['len'] + 2
        self.nodes.append({'len': new_len, 'link': 0, 'next': {}, 'count': 1})
        new_node = len(self.nodes) - 1
        self.nodes[curr]['next'][ch] = new_node
        # Set suffix link
        if new_len == 1:
            self.nodes[new_node]['link'] = 1  # link to empty string node
        else:
            link_node = self.get_link(self.nodes[curr]['link'], pos)
            self.nodes[new_node]['link'] = self.nodes[link_node]['next'][ch]
        self.last = new_node
        return True
```

**Applications:** Counting distinct palindromes, longest palindrome, palindrome factorization.

---

## 24.4 Rolling Hash Variants

Rolling hashes are used in string matching (Rabin-Karp) and many other contexts. To avoid collisions, we often use **double hashing** or **polynomial hashing** with large moduli.

**Polynomial rolling hash:**

H(S) = (S[0] * b^(m-1) + S[1] * b^(m-2) + ... + S[m-1]) mod M

Where b is a base (e.g., 131, 137, 91138233) and M is a large prime (e.g., 10^9+7, 10^9+9). Double hashing uses two different moduli to virtually eliminate collisions.

**Implementation of double rolling hash:**

```python
class DoubleRollingHash:
    def __init__(self, s, base1=131, base2=137, mod1=10**9+7, mod2=10**9+9):
        self.base1 = base1
        self.base2 = base2
        self.mod1 = mod1
        self.mod2 = mod2
        self.n = len(s)
        self.pref1 = [0] * (self.n + 1)
        self.pref2 = [0] * (self.n + 1)
        self.pow1 = [1] * (self.n + 1)
        self.pow2 = [1] * (self.n + 1)
        for i in range(1, self.n + 1):
            self.pref1[i] = (self.pref1[i-1] * base1 + ord(s[i-1])) % mod1
            self.pref2[i] = (self.pref2[i-1] * base2 + ord(s[i-1])) % mod2
            self.pow1[i] = (self.pow1[i-1] * base1) % mod1
            self.pow2[i] = (self.pow2[i-1] * base2) % mod2

    def get_hash(self, l, r):  # [l, r) 0-indexed
        h1 = (self.pref1[r] - self.pref1[l] * self.pow1[r-l]) % self.mod1
        h2 = (self.pref2[r] - self.pref2[l] * self.pow2[r-l]) % self.mod2
        return (h1, h2)
```

**Applications:**
- String matching (Rabin-Karp)
- Longest common prefix via binary search + hash
- Checking string equality in O(1) after preprocessing
- Palindrome detection in O(1) with forward and reverse hashes

---

## 24.5 Summary

```
┌──────────────────────┬──────────────┬────────────────────────────────┐
│ Algorithm/Structure  │ Preprocess   │ Query                          │
├──────────────────────┼──────────────┼────────────────────────────────┤
│ Naive                │ None         │ O(nm)                          │
│ Rabin-Karp           │ O(m)         │ O(n+m) avg                     │
│ KMP                  │ O(m)         │ O(n+m)                         │
│ Z                    │ O(m+n)       │ N/A (build once)               │
│ Boyer-Moore/Horspool │ O(m+Σ)       │ Sublinear avg                  │
│ Aho-Corasick         │ O(total m)   │ O(n + matches)                 │
│ Suffix Tree          │ O(n)         │ O(m + occ)                     │
│ Suffix Array + LCP   │ O(n log n)   │ O(m log n)                     │
│ BWT                  │ O(n)         │ Compression                     │
│ Eertree              │ O(n)         │ Palindrome queries             │
└──────────────────────┴──────────────┴────────────────────────────────┘
```

---

## 24.6 Practice Problems

### Pattern Matching
1. **Implement strStr()** (LeetCode 28) – use KMP or Rabin-Karp.
2. **Repeated String Match** (LeetCode 686)
3. **Shortest Palindrome** (LeetCode 214) – KMP on reversed string.
4. **Find All Anagrams in a String** (LeetCode 438) – sliding window + hash.
5. **Longest Happy Prefix** (LeetCode 1392) – KMP prefix function.

### Suffix Arrays / Trees
6. **Longest Repeating Substring** (LeetCode 1062) – use suffix array + LCP.
7. **Longest Common Substring** (SPOJ LCS) – suffix array.
8. **Number of Distinct Substrings** (LeetCode 1698) – use suffix array/tree.
9. **Shortest Unique Substring** (not on LC) – suffix array.

### Advanced
10. **Palindrome Pairs** (LeetCode 336) – use trie or rolling hash.
11. **Count Different Palindromic Subsequences** (LeetCode 730) – DP + set.
12. **Word Break II** (LeetCode 140) – DP with trie for efficiency.
13. **Stream of Characters** (LeetCode 1032) – Aho-Corasick.

---

## 24.7 Further Reading

1. **"Algorithms on Strings, Trees, and Sequences"** by Dan Gusfield – comprehensive.
2. **"Introduction to Algorithms" (CLRS)** – Chapter 32 (String Matching).
3. **"The Algorithm Design Manual"** by Steven Skiena – Chapter 8 (String Algorithms).
4. **"Flexible Pattern Matching in Strings"** by Navarro & Raffinot.
5. **Original Papers**:
   - Knuth, Morris, Pratt (1977) – "Fast pattern matching in strings"
   - Boyer, Moore (1977) – "A fast string searching algorithm"
   - Aho, Corasick (1975) – "Efficient string matching: an aid to bibliographic search"
   - Ukkonen (1995) – "On-line construction of suffix trees"
   - Burrows, Wheeler (1994) – "A block-sorting lossless data compression algorithm"

---

> **Coming in Chapter 25**: **Advanced String Structures** – More on suffix arrays, suffix trees, and their applications (already partially covered here; maybe we'll merge? But TOC has separate chapters. We'll follow TOC: Chapter 24 is "String Algorithms", Chapter 25 is "Advanced String Structures". We might adjust content distribution. Since this chapter already covers both, perhaps we should split: Chapter 24 focuses on matching algorithms, Chapter 25 on suffix structures. However, the user said "continue" and we need to output Chapter 24 as per TOC. I've included both, but to adhere strictly, maybe we should move advanced structures to Chapter 25. But the instruction is to continue the handbook, and we are now generating Chapter 24. I'll keep it as is, assuming Chapter 25 will be a deeper dive into suffix arrays/trees, BWT, etc. This chapter already has them, so maybe Chapter 25 can be about other topics? Actually TOC lists Chapter 25: "Advanced String Structures" with suffix trees, suffix arrays, BWT, etc. So we have covered them here. To avoid duplication, we might need to restructure. However, for now, I'll output this as Chapter 24, and if the user later asks for Chapter 25, we can focus on more advanced aspects or on implementations/details. Possibly we can mention that Chapter 25 will cover more advanced constructions and applications.

Let's proceed with this as Chapter 24, and note that Chapter 25 will expand on some topics.<｜end▁of▁thinking｜>**End of Chapter 24**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../6. algorithmic_paradigms_techniques/23. dynamic_programming.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='25. advanced_string_structures.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
