# Chapter 23: String Algorithms

## String Operations

There exist several common operations that are performed on strings:

### Substrings

Involve breaking the string down into smaller sequences

Every string is a substring of itself, but not a proper substring

### Pattern Matching

Given a text string T of length n and a pattern string P of length m, find whether P is a substring of T. This notion of a match means that there is a substring of T starting at some index i that matches P, shown by P = T[i...i+m-1]

### Brute-force Pattern Matching

Enumerate all possible configurations of inputs and pick the best of these options. Very inefficient due to its $O(nm)$ running time

## Boyer-Moore Algorithm

A pattern matching algorithm that improves running times by using the following two heuristics:

- **looking-glass heuristic**: when testing a possible placement of P against T, begin the comparisons from the end of P and move backward to the front of P
- **character-jump heuristic**: during the testing of a possible placement of P against T, a mismatch of text character T[i] = c with the pattern character P[j] is handled as follows:
  - if c is not in P: shift P completely past T[i]
  - else: shift P until an occurrence of c in P gets aligned with T[i]

This algorithm runs in $O(nm + |\Sigma|)$ time

## Knuth-Morris-Pratt Algorithm

Pre-processes the pattern string P so as to compute a **failure function** f that indicates the proper shift of P so that, to the largest extent possible, previously performed comparisons may be re-used

This algorithm runs in $O(n+m)$ time

## Tries

**Trie**: tree-based data structure used for storing strings in order to support fast pattern matching and prefix matching

### Standard Tries

Let S be a set of s strings from alphabet $\Sigma$, such that no string in S is a prefix of another string. A **standard trie** for S is an ordered tree T with the following properties:

- each node of T, except the root, is labeled with a character of $\Sigma$
- the ordering of the children of an internal node of T is determined by a canonical ordering of the alphabet $\Sigma$
- T has s external nodes, each associated with a string of S, such that the concatenation of the labels of the nodes on the path from the root to an external node v of T yields the string of S associated with v

A standard trie storing a collection S of s strings of total length n from an alphabet of size d has the following properties:

- every internal node of T has at most d children
- T has s external nodes
- the height of T is equal to the length of the longest string in S
- the number of nodes of T is $O(n)$

### Compressed Tries

Similar to a standard trie but it ensures that each internal node in the trie has at least two children. An internal node v of T is **redundant** if v has one child and is not the root

### Suffix Tries

A trie when the strings in the collection S are all the suffixes of a string X. Storing all suffixes takes $O(n)$ spaces, with n being the length of X

Pattern matching queries on X can be completed in $O(dm)$ time, where d is the alphabet size and m is the length of the pattern

Suffix tries may be constructed in $O(dn)$ time