|
| 1 | +# Z-Algorithm String Search |
| 2 | + |
| 3 | +Goal: Write a simple linear-time string matching algorithm in Swift that returns the indexes of all the occurrencies of a given pattern. |
| 4 | + |
| 5 | +In other words, we want to implement an `indexesOf(pattern: String)` extension on `String` that returns an array `[Int]` of integers, representing all occurrences' indexes of the search pattern, or `nil` if the pattern could not be found inside the string. |
| 6 | + |
| 7 | +For example: |
| 8 | + |
| 9 | +```swift |
| 10 | +let str = "Hello, playground!" |
| 11 | +str.indexesOf(pattern: "ground") // Output: [11] |
| 12 | + |
| 13 | +let traffic = "🚗🚙🚌🚕🚑🚐🚗🚒🚚🚎🚛🚐🏎🚜🚗🏍🚒🚲🚕🚓🚌🚑" |
| 14 | +traffic.indexesOf(pattern: "🚑") // Output: [4, 21] |
| 15 | +``` |
| 16 | + |
| 17 | +Many string search algorithms use a pre-processing function to compute a table that will be used in successive stage. This table can save some time during the pattern search stage because it allows to avoid un-needed characters comparisons. The [Z-Algorithm]() is one of these functions. It borns as a pattern pre-processing function (this is its role in the [Knuth-Morris-Pratt algorithm](../Knuth-Morris-Pratt/) and others) but, just like we will show here, it can be used also as a single string search algorithm. |
| 18 | + |
| 19 | +### Z-Algorithm as pattern pre-processor |
| 20 | + |
| 21 | +As we said, the Z-Algorithm is foremost an algorithm that process a pattern in order to calculate a skip-comparisons-table. |
| 22 | +The computation of the Z-Algorithm over a pattern `P` produces an array (called `Z` in the literature) of integers in which each element, call it `Z[i]`, represents the length of the longest substring of `P` that starts at `i` and matches a prefix of `P`. In simpler words, `Z[i]` records the longest prefix of `P[i...|P|]` that matches a prefix of `P`. As an example, let's consider `P = "ffgtrhghhffgtggfredg"`. We have that `Z[5] = 0 (f...h...)`, `Z[9] = 4 (ffgtr...ffgtg...)` and `Z[15] = 1 (ff...fr...)`. |
| 23 | + |
| 24 | +But how do we compute `Z`? Before we describe the algorithm we must indroduce the concept of Z-box. A Z-box is a pair `(left, right)` used during the computation that records the substring of maximal length that occurs also as a prefix of `P`. The two indices `left` and `right` represent, respectively, the left-end index and the right-end index of this substring. |
| 25 | +The definition of the Z-Algorithm is inductive and it computes the elements of the array for every position `k` in the pattern, starting from `k = 1`. The following values (`Z[k + 1]`, `Z[k + 2]`, ...) are computed after `Z[k]`. The idea behind the algorithm is that previously computed values can speed up the calculus of `Z[k + 1]`, avoiding some character comparisons that were already done before. Consider this example: suppose we are at iteration `k = 100`, so we are analyzing position `100` of the pattern. All the values between `Z[1]` and `Z[99]` were correctly computed and `left = 70` and `right = 120`. This means that there is a substring of length `51` starting at position `70` and ending at position `120` that matches the prefix of the pattern/string we are considering. Reasoning on it a little bit we can say that the substring of length `21` starting at position `100` matches the substring of length `21` starting at position `30` of the pattern (because we are inside a substring that matches a prefix of the pattern). So we can use `Z[30]` to compute `Z[100]` without additional character comparisons. |
| 26 | +This a simple description of the idea that is behind this algorithm. There are a few cases to manage when the use of pre-computed values cannot be directly applied and some comparisons are to be made. |
| 27 | + |
| 28 | +Here is the code of the function that computes the Z-array: |
| 29 | +``` |
| 30 | +swift |
| 31 | +func ZetaAlgorithm(ptrn: String) -> [Int]? { |
| 32 | +
|
| 33 | + let pattern = Array(ptrn.characters) |
| 34 | + let patternLength: Int = pattern.count |
| 35 | +
|
| 36 | + guard patternLength > 0 else { |
| 37 | + return nil |
| 38 | + } |
| 39 | +
|
| 40 | + var zeta: [Int] = [Int](repeating: 0, count: patternLength) |
| 41 | +
|
| 42 | + var left: Int = 0 |
| 43 | + var right: Int = 0 |
| 44 | + var k_1: Int = 0 |
| 45 | + var betaLength: Int = 0 |
| 46 | + var textIndex: Int = 0 |
| 47 | + var patternIndex: Int = 0 |
| 48 | +
|
| 49 | + for k in 1 ..< patternLength { |
| 50 | + if k > right { // Outside a Z-box: compare the characters until mismatch |
| 51 | + patternIndex = 0 |
| 52 | +
|
| 53 | + while k + patternIndex < patternLength && |
| 54 | + pattern[k + patternIndex] == pattern[patternIndex] { |
| 55 | + patternIndex = patternIndex + 1 |
| 56 | + } |
| 57 | +
|
| 58 | + zeta[k] = patternIndex |
| 59 | +
|
| 60 | + if zeta[k] > 0 { |
| 61 | + left = k |
| 62 | + right = k + zeta[k] - 1 |
| 63 | + } |
| 64 | + } else { // Inside a Z-box |
| 65 | + k_1 = k - left + 1 |
| 66 | + betaLength = right - k + 1 |
| 67 | +
|
| 68 | + if zeta[k_1 - 1] < betaLength { // Entirely inside a Z-box: we can use the values computed before |
| 69 | + zeta[k] = zeta[k_1 - 1] |
| 70 | + } else if zeta[k_1 - 1] >= betaLength { // Not entirely inside a Z-box: we must proceed with comparisons too |
| 71 | + textIndex = betaLength |
| 72 | + patternIndex = right + 1 |
| 73 | +
|
| 74 | + while patternIndex < patternLength && pattern[textIndex] == pattern[patternIndex] { |
| 75 | + textIndex = textIndex + 1 |
| 76 | + patternIndex = patternIndex + 1 |
| 77 | + } |
| 78 | +
|
| 79 | + zeta[k] = patternIndex - k |
| 80 | + left = k |
| 81 | + right = patternIndex - 1 |
| 82 | + } |
| 83 | + } |
| 84 | + } |
| 85 | + return zeta |
| 86 | +} |
| 87 | +``` |
| 88 | + |
| 89 | +Let's make an example reasoning with the code above. Let's consider the string `P = “abababab"`. The algorithm begins with `k = 1`, `left = right = 0`. So, no Z-box is "active" and thus, because `k > right` we start with the character comparisons beetwen `P[1]` and `P[0]`. |
| 90 | + |
| 91 | + |
| 92 | + 01234567 |
| 93 | + k: x |
| 94 | + abababbb |
| 95 | + x |
| 96 | + Z: 00000000 |
| 97 | + left: 0 |
| 98 | + right: 0 |
| 99 | + |
| 100 | +We have a mismatch at the first comparison and so the substring starting at `P[1]` does not match a prefix of `P`. So, we put `Z[1] = 0` and let `left` and `right` untouched. We begin another iteration with `k = 2`, we have `2 > 0` and again we start comparing characters `P[2]` with `P[0]`. This time the characters match and so we continue the comparisons until a mismatch occurs. It happens at position `6`. The characters matched are `4`, so we put `Z[2] = 4` and set `left = k = 2` and `right = k + Z[k] - 1 = 5`. We have our first Z-box that is the substring `"abab"` (notice that it matches a prefix of `P`) starting at position `left = 2`. |
| 101 | + |
| 102 | + 01234567 |
| 103 | + k: x |
| 104 | + abababbb |
| 105 | + x |
| 106 | + Z: 00400000 |
| 107 | + left: 2 |
| 108 | + right: 5 |
| 109 | + |
| 110 | +We then proceed with `k = 3`. We have `3 <= 5`. We are inside the Z-box previously found and inside a prefix of `P`. So we can look for a position that has a previously computed value. We calculate `k_1 = k - left = 1` that is the index of the prefix's character equal to `P[k]`. We check `Z[1] = 0` and `0 < (right - k + 1 = 3)` and we find that we are exactly inside the Z-box. We can use the previously computed value, so we put `Z[3] = Z[1] = 0`, `left` and `right` remain unchanged. |
| 111 | +At iteration `k = 4` we initially execute the `else` branch of the outer `if`. Then in the inner `if` we have that `k_1 = 2` and `(Z[2] = 4) >= 5 - 4 + 1`. So, the substring `P[k...r]` matches for `right - k + 1 = 2` chars the prefix of `P` but it could not for the following characters. We must then compare the characters starting at `r + 1 = 6` with those starting at `right - k + 1 = 2`. We have `P[6] != P[2]` and so we have to set `Z[k] = 6 - 4 = 2`, `left = 4` and `right = 5`. |
| 112 | + |
| 113 | + 01234567 |
| 114 | + k: x |
| 115 | + abababbb |
| 116 | + x |
| 117 | + Z: 00402000 |
| 118 | + left: 4 |
| 119 | + right: 5 |
| 120 | + |
| 121 | +With iteration `k = 5` we have `k <= right` and then `(Z[k_1] = 0) < (right - k + 1 = 1)` and so we set `z[k] = 0`. In iteration `6` and `7` we execute the first branch of the outer `if` but we only have mismatches, so the algorithms terminates returning the Z-array as `Z = [0, 0, 4, 0, 2, 0, 0, 0]`. |
| 122 | + |
| 123 | +The Z-Algorithm runs in linear time. More specifically, the Z-Algorithm for a string `P` of size `n` has a running time of `O(n)`. |
| 124 | + |
| 125 | +The implementation of Z-Algorithm as string pre-processor is contained in the [ZAlgorithm.swift](./ZAlgorithm.swift) file. |
| 126 | + |
| 127 | +### Z-Algorithm as string search algorithm |
| 128 | + |
| 129 | +The Z-Algorithm discussed above leads to the simplest linear-time string matching algorithm. To obtain it, we have to simply concatenate the pattern `P` and text `T` in a string `S = P$T` where `$` is a character that does not appear neither in `P` nor `T`. Then we run the algorithm on `S` obtaining the Z-array. All we have to do now is scan the Z-array looking for elements equal to `n` (which is the pattern length). When we find such value we can report an occurrence. |
| 130 | + |
| 131 | +```swift |
| 132 | +extension String { |
| 133 | + |
| 134 | + func indexesOf(pattern: String) -> [Int]? { |
| 135 | + let patternLength: Int = pattern.characters.count |
| 136 | + /* Let's calculate the Z-Algorithm on the concatenation of pattern and text */ |
| 137 | + let zeta = ZetaAlgorithm(ptrn: pattern + "💲" + self) |
| 138 | + |
| 139 | + guard zeta != nil else { |
| 140 | + return nil |
| 141 | + } |
| 142 | + |
| 143 | + var indexes: [Int] = [Int]() |
| 144 | + |
| 145 | + /* Scan the zeta array to find matched patterns */ |
| 146 | + for i in 0 ..< zeta!.count { |
| 147 | + if zeta![i] == patternLength { |
| 148 | + indexes.append(i - patternLength - 1) |
| 149 | + } |
| 150 | + } |
| 151 | + |
| 152 | + guard !indexes.isEmpty else { |
| 153 | + return nil |
| 154 | + } |
| 155 | + |
| 156 | + return indexes |
| 157 | + } |
| 158 | +} |
| 159 | +``` |
| 160 | + |
| 161 | +Let's make an example. Let `P = “CATA“` and `T = "GAGAACATACATGACCAT"` be the pattern and the text. Let's concatenate them with the character `$`. We have the string `S = "CATA$GAGAACATACATGACCAT"`. After computing the Z-Algorithm on `S` we obtain: |
| 162 | + |
| 163 | + 1 2 |
| 164 | + 01234567890123456789012 |
| 165 | + CATA$GAGAACATACATGACCAT |
| 166 | + Z 00000000004000300001300 |
| 167 | + ^ |
| 168 | + |
| 169 | +We scan the Z-array and at position `10` we find `Z[10] = 4 = n`. So we can report a match occuring at text position `10 - n - 1 = 5`. |
| 170 | + |
| 171 | +As said before, the complexity of this algorithm is linear. Defining `n` and `m` as pattern and text lengths, the final complexity we obtain is `O(n + m + 1) = O(n + m)`. |
| 172 | + |
| 173 | + |
| 174 | +Credits: This code is based on the handbook ["Algorithm on String, Trees and Sequences: Computer Science and Computational Biology"]() by Dan Gusfield, Cambridge University Press, 1997. |
| 175 | + |
| 176 | +*Written for Swift Algorithm Club by Matteo Dunnhofer* |
0 commit comments