# Knuth Morris Pratt Algorithm

The Knuth-Morris-Pratt (KMP) algorithm is a string matching algorithm that finds the occurrences of a pattern within a larger text. The significance of the KMP algorithm is that it has a time complexity of O(n+m) where n is the length of the text and m is the length of the pattern. This makes it more efficient than naive string matching algorithms, which have a time complexity of O(n*m).

The KMP algorithm uses a prefix function to preprocess the pattern, which allows it to avoid unnecessary character comparisons when searching for the pattern in the text. The prefix function calculates the length of the longest proper prefix that matches a proper suffix of the pattern, and uses this information to skip over already matched characters in the text.

In [1]:
def kmp(text, pattern):
    n = len(text)
    m = len(pattern)
    prefix = get_prefix(pattern)
    j = 0
    for i in range(n):
        while j > 0 and text[i] != pattern[j]:
            j = prefix[j-1]
        if text[i] == pattern[j]:
            j += 1
        if j == m:
            return i - m + 1
    return -1

def get_prefix(pattern):
    m = len(pattern)
    prefix = [0] * m
    j = 0
    for i in range(1, m):
        while j > 0 and pattern[i] != pattern[j]:
            j = prefix[j-1]
        if pattern[i] == pattern[j]:
            j += 1
        prefix[i] = j
    return prefix


In [2]:
text = "ABABDABACDABABCABAB"
pattern = "ABABCABAB"
result = kmp(text, pattern)
print("Pattern found at index:", result)

Pattern found at index: 10


As for an important application of the KMP algorithm, it is commonly used in **bioinformatics for DNA sequence analysis**. In particular, it is used for finding sequence similarities in DNA and protein sequences, which can be used to identify functional and structural features of biological molecules.

Find all occurrences of a DNA sequence motif in a larger DNA sequence

In [3]:
def find_motif(dna, motif):
    positions = []
    n = len(dna)
    m = len(motif)
    prefix = get_prefix(motif)
    j = 0
    for i in range(n):
        while j > 0 and dna[i] != motif[j]:
            j = prefix[j-1]
        if dna[i] == motif[j]:
            j += 1
        if j == m:
            positions.append(i - m + 1)
            j = prefix[j-1]
    return positions


In [4]:
# Example DNA sequence
dna = "ATGCGCTAGCTAGATGCTAGCTAGCTAGCTAG"

# Motif to search for
motif = "ATG"

# Call the find_motif function to get the positions of the motif
positions = find_motif(dna, motif)

# Print the positions of the motif in the DNA sequence
print("Motif found at positions:", positions)


Motif found at positions: [0, 13]


Regular expressions (regex) and the Knuth-Morris-Pratt (KMP) algorithm are both used for pattern matching in strings, but they differ in their approach and use cases.

Regular expressions are a general-purpose pattern matching tool used to match complex patterns in strings. Regex allows the user to specify a pattern using a combination of alphanumeric characters, operators, and wildcards to match against a string. The regex engine then uses a combination of deterministic and non-deterministic algorithms to search the input string for matches. Regex is often used in text processing, data mining, and natural language processing applications.

On the other hand, the KMP algorithm is a specialized algorithm used for finding the occurrence of a specific pattern in a string. It works by pre-computing a partial match table, which is then used to find matches efficiently. KMP is often used in bioinformatics, text editors, and image processing applications.

In summary, regular expressions are more powerful and flexible than KMP but are slower for simple string matching tasks. KMP is faster and more efficient for specific pattern matching tasks, but it is not as flexible as regex. The choice between the two depends on the specific requirements of the task at hand.

The choice between regex and the KMP algorithm depends on the specific requirements of the task at hand. Both have their strengths and weaknesses, and one may be better suited than the other depending on the application.

If the goal is to find the occurrence of a specific pattern in a string, the KMP algorithm is generally faster and more efficient than regex. However, if the pattern is complex and includes a variety of possible variations, regex may be more appropriate.

On the other hand, if the goal is to match complex patterns in a string, such as identifying email addresses or URLs, regex is generally more powerful and flexible than the KMP algorithm. In this case, the KMP algorithm may not be appropriate or even feasible.