Skip to content

Commit a172311

Browse files
added Z-Algorithm README
1 parent e3967df commit a172311

File tree

1 file changed

+176
-0
lines changed

1 file changed

+176
-0
lines changed

Z-Algorithm/README.markdown

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# Z-Algorithm String Search
2+
3+
Goal: Write a simple linear-time string matching algorithm in Swift that returns the indexes of all the occurrencies of a given pattern.
4+
5+
In other words, we want to implement an `indexesOf(pattern: String)` extension on `String` that returns an array `[Int]` of integers, representing all occurrences' indexes of the search pattern, or `nil` if the pattern could not be found inside the string.
6+
7+
For example:
8+
9+
```swift
10+
let str = "Hello, playground!"
11+
str.indexesOf(pattern: "ground") // Output: [11]
12+
13+
let traffic = "🚗🚙🚌🚕🚑🚐🚗🚒🚚🚎🚛🚐🏎🚜🚗🏍🚒🚲🚕🚓🚌🚑"
14+
traffic.indexesOf(pattern: "🚑") // Output: [4, 21]
15+
```
16+
17+
Many string search algorithms use a pre-processing function to compute a table that will be used in successive stage. This table can save some time during the pattern search stage because it allows to avoid un-needed characters comparisons. The [Z-Algorithm]() is one of these functions. It borns as a pattern pre-processing function (this is its role in the [Knuth-Morris-Pratt algorithm](../Knuth-Morris-Pratt/) and others) but, just like we will show here, it can be used also as a single string search algorithm.
18+
19+
### Z-Algorithm as pattern pre-processor
20+
21+
As we said, the Z-Algorithm is foremost an algorithm that process a pattern in order to calculate a skip-comparisons-table.
22+
The computation of the Z-Algorithm over a pattern `P` produces an array (called `Z` in the literature) of integers in which each element, call it `Z[i]`, represents the length of the longest substring of `P` that starts at `i` and matches a prefix of `P`. In simpler words, `Z[i]` records the longest prefix of `P[i...|P|]` that matches a prefix of `P`. As an example, let's consider `P = "ffgtrhghhffgtggfredg"`. We have that `Z[5] = 0 (f...h...)`, `Z[9] = 4 (ffgtr...ffgtg...)` and `Z[15] = 1 (ff...fr...)`.
23+
24+
But how do we compute `Z`? Before we describe the algorithm we must indroduce the concept of Z-box. A Z-box is a pair `(left, right)` used during the computation that records the substring of maximal length that occurs also as a prefix of `P`. The two indices `left` and `right` represent, respectively, the left-end index and the right-end index of this substring.
25+
The definition of the Z-Algorithm is inductive and it computes the elements of the array for every position `k` in the pattern, starting from `k = 1`. The following values (`Z[k + 1]`, `Z[k + 2]`, ...) are computed after `Z[k]`. The idea behind the algorithm is that previously computed values can speed up the calculus of `Z[k + 1]`, avoiding some character comparisons that were already done before. Consider this example: suppose we are at iteration `k = 100`, so we are analyzing position `100` of the pattern. All the values between `Z[1]` and `Z[99]` were correctly computed and `left = 70` and `right = 120`. This means that there is a substring of length `51` starting at position `70` and ending at position `120` that matches the prefix of the pattern/string we are considering. Reasoning on it a little bit we can say that the substring of length `21` starting at position `100` matches the substring of length `21` starting at position `30` of the pattern (because we are inside a substring that matches a prefix of the pattern). So we can use `Z[30]` to compute `Z[100]` without additional character comparisons.
26+
This a simple description of the idea that is behind this algorithm. There are a few cases to manage when the use of pre-computed values cannot be directly applied and some comparisons are to be made.
27+
28+
Here is the code of the function that computes the Z-array:
29+
```
30+
swift
31+
func ZetaAlgorithm(ptrn: String) -> [Int]? {
32+
33+
let pattern = Array(ptrn.characters)
34+
let patternLength: Int = pattern.count
35+
36+
guard patternLength > 0 else {
37+
return nil
38+
}
39+
40+
var zeta: [Int] = [Int](repeating: 0, count: patternLength)
41+
42+
var left: Int = 0
43+
var right: Int = 0
44+
var k_1: Int = 0
45+
var betaLength: Int = 0
46+
var textIndex: Int = 0
47+
var patternIndex: Int = 0
48+
49+
for k in 1 ..< patternLength {
50+
if k > right { // Outside a Z-box: compare the characters until mismatch
51+
patternIndex = 0
52+
53+
while k + patternIndex < patternLength &&
54+
pattern[k + patternIndex] == pattern[patternIndex] {
55+
patternIndex = patternIndex + 1
56+
}
57+
58+
zeta[k] = patternIndex
59+
60+
if zeta[k] > 0 {
61+
left = k
62+
right = k + zeta[k] - 1
63+
}
64+
} else { // Inside a Z-box
65+
k_1 = k - left + 1
66+
betaLength = right - k + 1
67+
68+
if zeta[k_1 - 1] < betaLength { // Entirely inside a Z-box: we can use the values computed before
69+
zeta[k] = zeta[k_1 - 1]
70+
} else if zeta[k_1 - 1] >= betaLength { // Not entirely inside a Z-box: we must proceed with comparisons too
71+
textIndex = betaLength
72+
patternIndex = right + 1
73+
74+
while patternIndex < patternLength && pattern[textIndex] == pattern[patternIndex] {
75+
textIndex = textIndex + 1
76+
patternIndex = patternIndex + 1
77+
}
78+
79+
zeta[k] = patternIndex - k
80+
left = k
81+
right = patternIndex - 1
82+
}
83+
}
84+
}
85+
return zeta
86+
}
87+
```
88+
89+
Let's make an example reasoning with the code above. Let's consider the string `P = “abababab"`. The algorithm begins with `k = 1`, `left = right = 0`. So, no Z-box is "active" and thus, because `k > right` we start with the character comparisons beetwen `P[1]` and `P[0]`.
90+
91+
92+
01234567
93+
k: x
94+
abababbb
95+
x
96+
Z: 00000000
97+
left: 0
98+
right: 0
99+
100+
We have a mismatch at the first comparison and so the substring starting at `P[1]` does not match a prefix of `P`. So, we put `Z[1] = 0` and let `left` and `right` untouched. We begin another iteration with `k = 2`, we have `2 > 0` and again we start comparing characters `P[2]` with `P[0]`. This time the characters match and so we continue the comparisons until a mismatch occurs. It happens at position `6`. The characters matched are `4`, so we put `Z[2] = 4` and set `left = k = 2` and `right = k + Z[k] - 1 = 5`. We have our first Z-box that is the substring `"abab"` (notice that it matches a prefix of `P`) starting at position `left = 2`.
101+
102+
01234567
103+
k: x
104+
abababbb
105+
x
106+
Z: 00400000
107+
left: 2
108+
right: 5
109+
110+
We then proceed with `k = 3`. We have `3 <= 5`. We are inside the Z-box previously found and inside a prefix of `P`. So we can look for a position that has a previously computed value. We calculate `k_1 = k - left = 1` that is the index of the prefix's character equal to `P[k]`. We check `Z[1] = 0` and `0 < (right - k + 1 = 3)` and we find that we are exactly inside the Z-box. We can use the previously computed value, so we put `Z[3] = Z[1] = 0`, `left` and `right` remain unchanged.
111+
At iteration `k = 4` we initially execute the `else` branch of the outer `if`. Then in the inner `if` we have that `k_1 = 2` and `(Z[2] = 4) >= 5 - 4 + 1`. So, the substring `P[k...r]` matches for `right - k + 1 = 2` chars the prefix of `P` but it could not for the following characters. We must then compare the characters starting at `r + 1 = 6` with those starting at `right - k + 1 = 2`. We have `P[6] != P[2]` and so we have to set `Z[k] = 6 - 4 = 2`, `left = 4` and `right = 5`.
112+
113+
01234567
114+
k: x
115+
abababbb
116+
x
117+
Z: 00402000
118+
left: 4
119+
right: 5
120+
121+
With iteration `k = 5` we have `k <= right` and then `(Z[k_1] = 0) < (right - k + 1 = 1)` and so we set `z[k] = 0`. In iteration `6` and `7` we execute the first branch of the outer `if` but we only have mismatches, so the algorithms terminates returning the Z-array as `Z = [0, 0, 4, 0, 2, 0, 0, 0]`.
122+
123+
The Z-Algorithm runs in linear time. More specifically, the Z-Algorithm for a string `P` of size `n` has a running time of `O(n)`.
124+
125+
The implementation of Z-Algorithm as string pre-processor is contained in the [ZAlgorithm.swift](./ZAlgorithm.swift) file.
126+
127+
### Z-Algorithm as string search algorithm
128+
129+
The Z-Algorithm discussed above leads to the simplest linear-time string matching algorithm. To obtain it, we have to simply concatenate the pattern `P` and text `T` in a string `S = P$T` where `$` is a character that does not appear neither in `P` nor `T`. Then we run the algorithm on `S` obtaining the Z-array. All we have to do now is scan the Z-array looking for elements equal to `n` (which is the pattern length). When we find such value we can report an occurrence.
130+
131+
```swift
132+
extension String {
133+
134+
func indexesOf(pattern: String) -> [Int]? {
135+
let patternLength: Int = pattern.characters.count
136+
/* Let's calculate the Z-Algorithm on the concatenation of pattern and text */
137+
let zeta = ZetaAlgorithm(ptrn: pattern + "💲" + self)
138+
139+
guard zeta != nil else {
140+
return nil
141+
}
142+
143+
var indexes: [Int] = [Int]()
144+
145+
/* Scan the zeta array to find matched patterns */
146+
for i in 0 ..< zeta!.count {
147+
if zeta![i] == patternLength {
148+
indexes.append(i - patternLength - 1)
149+
}
150+
}
151+
152+
guard !indexes.isEmpty else {
153+
return nil
154+
}
155+
156+
return indexes
157+
}
158+
}
159+
```
160+
161+
Let's make an example. Let `P = “CATA“` and `T = "GAGAACATACATGACCAT"` be the pattern and the text. Let's concatenate them with the character `$`. We have the string `S = "CATA$GAGAACATACATGACCAT"`. After computing the Z-Algorithm on `S` we obtain:
162+
163+
1 2
164+
01234567890123456789012
165+
CATA$GAGAACATACATGACCAT
166+
Z 00000000004000300001300
167+
^
168+
169+
We scan the Z-array and at position `10` we find `Z[10] = 4 = n`. So we can report a match occuring at text position `10 - n - 1 = 5`.
170+
171+
As said before, the complexity of this algorithm is linear. Defining `n` and `m` as pattern and text lengths, the final complexity we obtain is `O(n + m + 1) = O(n + m)`.
172+
173+
174+
Credits: This code is based on the handbook ["Algorithm on String, Trees and Sequences: Computer Science and Computational Biology"]() by Dan Gusfield, Cambridge University Press, 1997.
175+
176+
*Written for Swift Algorithm Club by Matteo Dunnhofer*

0 commit comments

Comments
 (0)