# String matching.

The problem of matching string A with B, confirm that whether the string A is the substring of B. The brute force approach is O(N*M). N and M represents the length of string A and string B.

In [1]:
def stringMatchingBF(A, B):
    for i in range(len(B)):
        if A == B[i: i + len(A)]:
            return True
    return False

In [91]:
%timeit for x in range(10): stringMatchingBF('saveearth','thehelloworldchapterofsaveearth')

171 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [9]:
stringMatchingBF("hell", "thehelloworld" )

True

## Rolling Hash

The core concept of rolling hash is to defined the function. Which will help us to check the sting A and B. `hash(A) == hash(B) and A == B`. Also this function will calculate base on O(1) time. Which can help us reduce the time complexity to O(m). m is the length of B.

#### Algorithm

For a string A, say `hash(S) = (S[0] * P**0 + S[1] * P**1 + S[2] * P**2 + ...) % MOD`, where `X**Y` represents exponentiation, and `S[i]` is the ASCII character code of the string at that index.

The idea is that `hash(S)` has output that is approximately uniformly distributed between `[0, 1, 2, ..., MOD-1]`, and so if `hash(S) == hash(T`) it is very likely that S == T.

Now say we have a hash `hash(A)`, and we want the hash of `A[1], A[2], ..., A[N-1], A[0]`. We can subtract `A[0]` from the hash, divide by P, and add `A[0] * P**(N-1)`. (Our division is under the finite field $\mathbb{F}_\text{MOD}$ done by multiplying by the modular inverse `Pinv = pow(P, MOD-2, MOD).)`.

In [2]:
def matchingStringRH(A, B):
    target = 0
    for i, a in enumerate(A):
        target += (ord(a) - ord('a')) * 26 ** i
    ans = 0
    for j, b in enumerate(B):
        if j >= len(A):
            k = ord(B[j - len(A)]) - ord('a')
            ans -= k * 26 ** 0
            ans //= 26 
            ans += (ord(b) - ord('a')) * 26 ** (len(A)-1)
        else:
            ans += (ord(b) - ord('a')) * 26 ** j
        if ans == target:
            return True
    if ans == target:
        return True
    return False

In [3]:
%timeit for x in range(10): matchingStringRH('saveearth','thehelloworldchapterofsaveearth')

952 µs ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## KMP (Kunth Morris Pratt) Pattern Searching

#### Preprocessing Overview:

  * KMP algorithm preprocesses pattern A[] and construct an auxiliary array aux[] of size m (size of pattern A) which is used to skip charaters while matching.
  * Name aux indicates longest proper prefix which is also suffix. A proper prefix is prefix with whole string not allowed. For example, prefixes of "ABC" are "", "A", "AB", "ABC". Proper prefixes are "", "A", and "AB". Suffixes of the string are "", "C", "BC", and "ABC".
  * We search for aux[] is sub-patterns. More clearly we focus on sub-stings of patterns that are either prefix and suffix.
  * For each sub-pattern A[0..i] where i = o to m - 1, aux[i] stores length of the max matching proper prefix which is also a suffix of the sub-pattern A[0..i].
  
```py
aux[i] = ther longest proper prefix of A[0..i] which is also a suffix of A[0..i].

Examples of lps[] construction:
For the pattern “AAAA”, 
aux[] is [0, 1, 2, 3]

For the pattern “ABCDE”, 
aux[] is [0, 0, 0, 0, 0]

For the pattern “AABAACAABAA”, 
aux[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

For the pattern “AAACAAAAAC”, 
aux[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4] 

For the pattern “AAABAAA”, 
aux[] is [0, 1, 2, 0, 1, 2, 3]
```

#### Searching Algorithm

Unlike Naive algorithm, where we slide the pattern by one and compare all characters at each shift, we use a value from aux[] to decide the next charaters to be matched, the idea is to not match a charater that we know will anyway match.

How to use aux[] to decide next positions (or to know a number of charaters to be skipped) ?
  * We start comparsion of A[j] with j = 0 with charaters of current window of text.
  * We keep matching charaters string B[i] and pattern A[j] and keep incrementing i and j while pattern A[j] and string B[i] keep matching.
  * When we see a mismatch
    * We know that characters pattern A[0..j-1] match with string B[i-j..i-1].
    * We also know (from above definition) that aux[j-1] is count of charaters of pattern A[0..j-1] that are both proper prefix and suffix.
    * From above two points, we can conclude that we do not need to match these aux[j-i] characters with string B[i-j..j-1] because we know that these characters will anyway match.
    
```
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
aux[] = {0, 1, 2, 3} 
```

```
i = 0, j = 0
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
```

```
i = 1, j = 1
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
```

```
i = 2, j = 2
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
pat[i] and pat[j] match, do i++, j++
```

```
i = 3, j = 3
txt[] = "AAAAABAAABA" 
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
```

```
i = 4, j = 4
Since j == M, print pattern found and reset j,
j = aux[j-1] = aux[3] = 3
```

```
Here unlike Naive algorithm, we do not match first three 
characters of this window. Value of aux[j-1] (in above 
step) gave us index of next character to match.
```

```
i = 4, j = 3
txt[] = "AAAAABAAABA" 
pat[] =  "AAAA"
txt[i] and pat[j] match, do i++, j++
```

```
i = 5, j = 4
Since j == M, print pattern found and reset j,
j = aux[j-1] = aux[3] = 3
```

```
Again unlike Naive algorithm, we do not match first three 
characters of this window. Value of aux[j-1] (in above 
step) gave us index of next character to match.
```

```
i = 5, j = 3
txt[] = "AAAAABAAABA" 
pat[] =   "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = aux[j-1] = aux[2] = 2
```

```
i = 5, j = 2
txt[] = "AAAAABAAABA" 
pat[] =    "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = aux[j-1] = aux[1] = 1 
```

```
i = 5, j = 1
txt[] = "AAAAABAAABA" 
pat[] =     "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = aux[j-1] = aux[0] = 0
```

```
i = 5, j = 0
txt[] = "AAAAABAAABA" 
pat[] =      "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.
```

```
i = 6, j = 0
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++
```

```
i = 7, j = 1
txt[] = "AAAAABAAABA" 
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++
```

```
We continue this way...
```

In [43]:
def kmpPrepare(pattern):
    aux = [0] * len(pattern)
    for i, p in enumerate(pattern):
        if i:
            for j in range(i+1):
                prefix = pattern[:j]
                suffix = pattern[i-j+1:i+1]
                if prefix == suffix:
                    aux[i] = max(len(prefix), aux[i])
    return aux

In [90]:
%timeit kmpPrepare("AAABAAA")

44.5 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [87]:
def kmpPrepareIm(pattern):
    pre = 0 # length of the previous longest prefix suffix 
    aux = [0] * len(pattern)
    i = 1
    while i < len(pattern):
        if pattern[i] == pattern[pre]:
            pre += 1
            aux[i] = pre
            i += 1
        else:
            # This is tricky. Consider the example. 
            # AAACAAAA and i = 7. The idea is similar  
            # to search step. 
            if pre != 0:
                pre = aux[pre - 1]
            else:
                aux[i] = 0
                i += 1
    return aux

In [89]:
%timeit kmpPrepareIm("AAABAAA")

6.94 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [94]:
import pysnooper
#@pysnooper.snoop()
def KMPSearch(A, B):
    M = len(A)
    N = len(B)
    
    aux = kmpPrepareIm(A)
    i, j = 0, 0
    while i < N:
        if A[j] == B[i]:
            i += 1
            j += 1
        if j == M:
            return True
        elif i < N  and A[j] != B[i]:
            if j != 0:
                j = aux[j-1]
            else:
                i += 1
    return False

In [96]:
KMPSearch('saveearth','thehelloworldchapterofsaveearth')

True

In [97]:
%timeit for x in range(10): KMPSearch('saveearth','thehelloworldchapterofsaveearth')

338 µs ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
