diff --git a/src/main/java/algorithms/patternFinding/KMP.java b/src/main/java/algorithms/patternFinding/KMP.java index 0bb5d30b..f55ca1a3 100644 --- a/src/main/java/algorithms/patternFinding/KMP.java +++ b/src/main/java/algorithms/patternFinding/KMP.java @@ -6,111 +6,112 @@ /** * Implementation of KMP. *
- * Illustration of getPrefixIndices: with pattern ABCABCNOABCABCA - * Here we make a distinction between position and index. The position is basically 1-indexed. - * Note the return indices are still 0-indexed of the pattern string. + * Illustration of getPrefixTable: with pattern ABCABCNOABCABCA + * We consider 1-indexed positions. Position 0 will be useful later in as a trick to inform that are no prefix matches * Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 - * Pattern: A B C A B C N O A B C A B C A ... - * Return: -1 0 0 0 1 2 3 0 0 1 2 3 4 5 6 4 ... - * Read: ^ an indexing trick; consider 1-indexed characters for clarity and simplicity in the main algor - * Read: ^ 'A' is the first character of the pattern string, - * there is no prefix ending before its index, 0, that can be matched with. - * Read: ^ ^ 'B' and 'C' cannot be matched with any prefix which are just 'A' and 'AB' respectively. - * Read: ^ Can be matched with an earlier 'A'. So we store 1. - * Prefix is the substring from idx 0 to 1 (exclusive). Note consider prefix from 0-indexed. - * Realise 1 can also be interpreted as the index of the next character to match against! - * Read: ^ ^ Similarly, continue matching - * Read: ^ ^ No matches, so 0 - * Read: ^ ^ ^ ^ ^ ^ Match with prefix until position 6! - * Read: ^ where the magic happens, we can't match 'N' - * at position 7 with 'A' at position 15, but - * we know ABC of position 1-3 (or index 0-2) - * exists and can 'restart' from there. - *
- *
+ * Pattern: A B C A B C N O A B C A B C A ... + * Return: -1 0 0 0 1 2 3 0 0 1 2 3 4 5 6 4 ... CAN BE READ AS NUM OF CHARS MATCHED + * Read: ^ -1 can be interpreted as invalid number of chars matched but exploited for simplicity in the main algor. + * Read: ^ 'A' is the first character of the pattern, there is no prefix ending before itself, to match. + * Read: ^ ^ 'B' and 'C' cannot be matched with any prefix which are just 'A' and 'AB' respectively. + * Read: ^ can be matched with an earlier prefix, 'A'. So we store 1, the number of chars matched. + * Realise 1 can also be interpreted as the index of the next character to match against! + * Read: ^ ^ Similarly, continue matching + * Read: ^ ^ No matches, so 0 + * Read: ^ ^ ^ ^ ^ ^ Match with prefix, "ABCABC", until 6th char + * of pattern string. + * Read: ^ where the magic happens, we can't match 'N' + * at position 7 with 'A' at position 15, but + * we know "ABC" exists as an earlier sub-pattern + * from 1st to 3rd and start matching the 4th + * char onwards. *
* Illustration of main logic: * Pattern: ABABAB * String : ABABCABABABAB *
- * A B A B C A B A B A B A B
- * Read: ^ to ^ Continue matching where possible, leading to Pattern[0:4] matched.
- * unable to match Pattern[4]. But notice that last two characters of String[0:4]
- * form a sub-pattern with Pattern[0:2] Maybe Pattern[2] == 'C' and we can 're-use' Pattern[0:2]
- * Read: ^ try ^ by checking if Pattern[2] == 'C'
+ * A B A B C A B A B A B A B
+ * Read: ^ to ^ Continue matching where possible, leading to 1st 4 characters matched.
+ * unable to match Pattern[4]. But notice that last two characters
+ * form a sub-pattern with the 1st 2, Maybe Pattern[2] == 'C' and we can 're-use' "AB"
+ * Read: ^ ^ check if Pattern[2] == 'C'
* Read: Turns out no. No previously identified sub-pattern with 'C'. Restart matching Pattern.
- * Read: ^ to ^ Found complete match! But rather than restart, notice that last 4 characters
- * Read: form a prefix sub-pattern of Pattern, which is Pattern[0:4] = "ABAB", so,
- * Read: ^ ^ Start matching from Pattern[4] and finally Pattern[5]
+ * Read: ^ ^ Found complete match! But rather than restart, notice that last 4 characters
+ * Read: of "ABABAB" form a prefix sub-pattern of Pattern, which is "ABAB", so,
+ * Read: ^ reuse ^ ^ then match 5th and 6th char of pattern which happens to be "AB"
*/
public class KMP {
/**
- * Find and indicate all suffix that match with a prefix.
+ * Captures the longest prefix which is also a suffix for some substring ending at each index, starting from 0.
+ * Does this by tracking the number of characters (of the prefix and suffix) matched.
*
* @param pattern to search
- * @return an array of indices where the suffix ending at each position of they array can be matched with
- * corresponding a prefix of the pattern ending before the specified index
+ * @return an array of indices
*/
- private static int[] getPrefixIndices(String pattern) {
+ private static int[] getPrefixTable(String pattern) {
+ // 1-indexed implementation
int len = pattern.length();
- int[] prefixIndices = new int[len + 1];
- prefixIndices[0] = -1;
- prefixIndices[1] = 0; // 1st character has no prefix to match with
+ int[] numCharsMatched = new int[len + 1];
+ numCharsMatched[0] = -1;
+ numCharsMatched[1] = 0; // 1st character has no prefix to match with
int currPrefixMatched = 0; // num of chars of prefix pattern currently matched
- int pos = 2; // Starting from the 2nd character, recall 1-indexed
+ int pos = 2; // Starting from the 2nd character
while (pos <= len) {
if (pattern.charAt(pos - 1) == pattern.charAt(currPrefixMatched)) {
currPrefixMatched += 1;
// note, the line below can also be interpreted as the index of the next char to match
- prefixIndices[pos] = currPrefixMatched; // an indexing trick, store at the pos, num of chars matched
+ numCharsMatched[pos] = currPrefixMatched;
pos += 1;
} else if (currPrefixMatched > 0) {
// go back to a previous known match and try to match again
- currPrefixMatched = prefixIndices[currPrefixMatched];
+ currPrefixMatched = numCharsMatched[currPrefixMatched];
} else {
// unable to match, time to move on
- prefixIndices[pos] = 0;
+ numCharsMatched[pos] = 0;
pos += 1;
}
}
- return prefixIndices;
+ return numCharsMatched;
}
/**
- * Main logic of KMP. Iterate the sequence, looking for patterns. If a difference is found, resume matching from
- * a previously identified sub-pattern, if possible. Length of pattern should be at least one.
- *
+ * Main logic of KMP. Iterate the sequence, looking for patterns. If a mismatch is found, resume matching from
+ * a previously identified sub-pattern, if possible. Here we assume length of pattern is at least one.
* @param sequence to search against
* @param pattern to search for
* @return start indices of all occurrences of pattern found
*/
public static List
Pattern-searching problems is prevalent across many fields of CS, for instance,
in text editors when searching for a pattern, in computational biology sequence matching problems,
in NLP problems, and even for looking for file patterns for effective file management.
@@ -11,9 +14,31 @@ Typically, the algorithm returns a list of indices that denote the start of each

Image Source: GeeksforGeeks
-## Analysis
+### Intuition
+It's efficient because it utilizes the information gained from previous character comparisons. When a mismatch occurs,
+the algorithm uses this information to skip over as many characters as possible.
-**Time complexity**:
+Considering the string pattern:
+
+Instead, it leverages the information that "XYXY" has already been matched.
+
+Therefore, the algorithm continues matching from the 5th character of the pattern string (C in "XYXYCXYXYF").
+It checks this against the 10th character of the string (C in "XYXYCXYXYCXYXYFGABC").
+Since they match, the algorithm continues from there without re-checking the initial "XYXY".
+
+## Complexity Analysis
+Let k be the length of the pattern and n be the length of the string to match against.
+**Time complexity**: O(n+k)
Naively, we can look for patterns in a given sequence in O(nk) where n is the length of the sequence and k
is the length of the pattern. We do this by iterating every character of the sequence, and look at the
@@ -27,7 +52,10 @@ O(n) traversal of the sequence. More details found in the src code.
**Space complexity**: O(k) auxiliary space to store suffix that matches with prefix of the pattern string
## Notes
-
-A detailed illustration of how the algorithm works is shown in the code.
+1. A detailed illustration of how the algorithm works is shown in the code.
But if you have trouble understanding the implementation,
here is a good [video](https://www.youtube.com/watch?v=EL4ZbRF587g) as well.
+2. A subroutine to find Longest Prefix Suffix (LPS) is commonly involved in the preprocessing step of KMP.
+It may be useful to interpret these numbers as the number of characters matched between the suffix and prefix.
+Knowing the number of characters of prefix would help in informing the position of the next character of the pattern to
+match.
diff --git a/src/test/java/algorithms/patternFinding/KmpTest.java b/src/test/java/algorithms/patternFinding/KmpTest.java
index 1d795d64..0647817b 100644
--- a/src/test/java/algorithms/patternFinding/KmpTest.java
+++ b/src/test/java/algorithms/patternFinding/KmpTest.java
@@ -34,11 +34,16 @@ public void testEmptySequence_findOccurrences_shouldReturnStartIndices() {
@Test
public void testNoOccurence_findOccurrences_shouldReturnStartIndices() {
String seq = "abcabcabc";
- String pattern = "noway";
+ String patternOne = "noway";
+ String patternTwo = "cbc";
- List