# String Matching Algorithms

Welcome to this Jupyter Notebook on String Matching Algorithms! This notebook is designed to provide a comprehensive introduction to string matching algorithms, which play a crucial role in various applications, such as text processing, data retrieval, and computational biology. Understanding these algorithms is essential for developing efficient solutions to problems involving pattern searching, string comparison, and text manipulation.

### What Are String Matching Algorithms?

String matching algorithms are techniques used to find occurrences of a pattern (substring) within a larger text (string). These algorithms are fundamental in areas such as search engines, DNA sequencing, and text editors, where efficiently locating specific sequences is essential.

### Key String Matching Algorithms

This notebook will cover essential string matching algorithms, which solve different problems related to pattern searching and string manipulation.

1. **Naive String Matching**: A straightforward approach that checks for the pattern at every position in the text, providing a basic understanding of string matching.

2. **Knuth-Morris-Pratt (KMP) Algorithm**: An efficient algorithm that preprocesses the pattern to create a longest prefix-suffix table, allowing for skipping unnecessary comparisons in the text.

3. **Boyer-Moore Algorithm**: A highly efficient string matching algorithm that uses heuristics to skip sections of the text, making it particularly effective for large alphabets.

### Learning Objectives

By the end of this notebook, you will:
- Understand the fundamental concepts and applications of string matching algorithms.
- Learn how to implement and apply these algorithms in Python.

The choice of string matching algorithm depends on the specific requirements of your application, such as the size of the text and pattern, the frequency of searches, and whether preprocessing of the pattern is feasible. Efficient algorithms like KMP and Boyer-Moore are generally preferred for larger texts and patterns, while naive approaches may suffice for simpler tasks.

## Naive String Matching Algorithm

The **Naive String Matching Algorithm** is a straightforward approach to finding occurrences of a pattern string within a text string. This algorithm checks for matches by comparing the characters of the pattern to the text sequentially, making it simple to understand and implement. Despite its simplicity, it can be inefficient for larger texts or patterns.

### String Representation

In the context of string matching, strings are typically represented as sequences of characters. The text string $ T $ has a length $ n $, and the pattern string $ P $ has a length $ m $.

### Properties of Naive String Matching Algorithm

1. **Simple and Intuitive:** The algorithm is easy to understand and implement, making it a good educational tool for understanding string matching concepts.

2. **Exhaustive Search:** It systematically checks each possible position in the text for a potential match with the pattern.

3. **Comparison-Based:** The algorithm relies on character-by-character comparisons to determine if the pattern matches a substring of the text.

4. **Brute Force Approach:** It is a brute force method, which means it does not employ any advanced techniques to optimize the search process.

### How Naive String Matching Operates

1. **Initialization:** Start by determining the lengths of the text string $ T $ and the pattern string $ P $. Let $ n $ be the length of $ T $ and $ m $ be the length of $ P $.

2. **Iterate Through Text:** The algorithm uses a loop to slide the pattern over the text, checking for matches:
   - For each index $ i $ in the text, from $ 0 $ to $ n - m $ (the last position where the pattern can fit):
     - Compare the substring of $ T $ starting at index $ i $ with the pattern $ P $.

3. **Character Comparison:**
   - For each position $ i $, compare each character of the pattern $ P[j] $ with the corresponding character in the text substring $ T[i + j] $ for $ j $ from $ 0 $ to $ m - 1 $.
   - If all characters match, a successful match is found, and the starting index $ i $ is recorded.

4. **Termination:** After checking all possible positions, the algorithm concludes, returning the starting indices of all matches found in the text.

In [13]:
def naive_string_matching(text: str, pattern: str) -> list:
    # Determine the lengths of the text and the pattern
    # Length of the text
    text_length = len(text) # n
    # Length of the pattern
    pattern_length = len(pattern) # m

    # Initialize a list to store the starting indices of matches
    match_indices = []  

    # Iterate through the text to check for the pattern
    # From index 0 to n - m
    for i in range(text_length - pattern_length + 1):  
        # Assume a match is found
        match_found = True  

        # Compare each character in the pattern with the corresponding character in the text
        # For each character in the pattern
        for j in range(pattern_length):  
            # If characters do not match
            if text[i + j] != pattern[j]:  
                # Mark as not found
                match_found = False  
                # No need to check further; break out of the loop
                break  

        # If a match was found, record the starting index
        if match_found:  
            # Add the starting index to the list
            match_indices.append(i)  

    # Return the list of starting indices where matches were found
    return match_indices  


Example of use:

In [None]:

text_string = "ababcabcabababdababcabcabababd"
pattern_string = "bca"
    
results = naive_string_matching(text_string, pattern_string)
    
print("Pattern found at indices:", results)

## Knuth-Morris-Pratt (KMP) Algorithm

The **Knuth-Morris-Pratt (KMP) Algorithm** is a sophisticated string matching algorithm designed to efficiently search for occurrences of a pattern string within a text string. Unlike the naive approach, which can perform unnecessary comparisons, KMP leverages the structure of the pattern to avoid re-evaluating characters that have already been matched, significantly improving performance.

### String Representation

In the context of string matching, strings are represented as sequences of characters. The text string $ T $ has a length $ n $, and the pattern string $ P $ has a length $ m $.

### Properties of the KMP Algorithm

1. **Preprocessing Step:** KMP includes a preprocessing phase that builds a **partial match table** (also known as the **prefix table** or **failure function**), which helps to skip unnecessary comparisons during the search.

2. **Linear Time Complexity:** The KMP algorithm runs in linear time, $ O(n + m) $, where $ n $ is the length of the text and $ m $is the length of the pattern. This efficiency is achieved through the use of the partial match table.

3. **No Backtracking in Text:** When a mismatch occurs, KMP does not backtrack in the text; instead, it uses the information in the partial match table to determine the next positions to compare.

4. **Versatile Application:** The KMP algorithm can be applied to a variety of string matching problems, including searching for multiple patterns or in more complex scenarios such as DNA sequence analysis.

### How the KMP Algorithm Operates

1. **Preprocessing Phase (Building the Partial Match Table):**
   - Create a partial match table for the pattern $ P $ that indicates the longest proper prefix of the pattern that is also a suffix.
   - The table helps identify how far to move the pattern upon a mismatch. It contains values that represent how many characters can be skipped when a mismatch occurs.

   For example, for a pattern `ABABAC`, the partial match table would look like this:
   - `P[0]`: 0 (no proper prefix)
   - `P[1]`: 0 (no proper prefix)
   - `P[2]`: 1 (prefix `A`)
   - `P[3]`: 2 (prefix `AB`)
   - `P[4]`: 3 (prefix `ABA`)
   - `P[5]`: 0 (no proper prefix)

2. **Searching Phase:**
   - Initialize two indices: one for the text $ i $ (starting at 0) and one for the pattern $ j $ (also starting at 0).
   - Traverse the text string $ T $ using the index $ i $:
     - If $ T[i] $ matches $ P[j] $, increment both $ i $ and $ j \$
     - If $ j $ reaches the length of the pattern $ m $ a match is found at index $ i - j $.
     - If there is a mismatch after some matches:
       - Instead of resetting $ j $ to zero, use the partial match table to skip unnecessary comparisons. Set $ j $ to the value in the table corresponding to the last matched position.

3. **Termination:** Continue the process until all characters in the text have been examined. The algorithm concludes, providing the starting indices of all matches found in the text.

### Applications of the KMP Algorithm

1. **Text Searching:** KMP is widely used in text searching applications, such as word processors, search engines, and DNA sequencing tools, where efficient pattern searching is essential.

2. **Natural Language Processing (NLP):** KMP can assist in NLP tasks involving substring searches, allowing for efficient text analysis and processing.

In [15]:
def create_partial_match_table(pattern: str) -> list:
    # Initialize the partial match table
    # Length of the pattern
    pattern_length = len(pattern) # m
    partial_match_table = [0] * pattern_length  
    # Length of previous longest prefix suffix
    j = 0  

    # Fill the partial match table
    # Start from the second character
    for i in range(1, pattern_length):  
        # If there is a mismatch
        while j > 0 and pattern[i] != pattern[j]:  
            # Use the table to skip comparisons
            j = partial_match_table[j - 1]  

        # If there is a match
        if pattern[i] == pattern[j]:  
            # Increase length of the prefix suffix
            j += 1  
            # Update the table
            partial_match_table[i] = j  
        else:
            # No prefix suffix
            partial_match_table[i] = 0  

    # Return the completed table
    return partial_match_table  

def kmp_search(text: str, pattern: str) -> list:
    # Preprocessing phase: Building the partial match table
    # Create the table
    partial_match_table = create_partial_match_table(pattern)  
    # Length of the pattern
    pattern_length = len(pattern) # m
    # Length of the text
    text_length = len(text) # n

    # Searching phase
    # Index for text
    i = 0  
    # Index for pattern
    j = 0  
    # To store the starting indices of matches
    match_indices = []  

    # Traverse the text
    while i < text_length:
        # If there is a match
        if text[i] == pattern[j]:  
            # Move to the next character in text
            i += 1  
            # Move to the next character in pattern
            j += 1  

        # If we found a match
        if j == pattern_length:  
            # Record the starting index of the match
            match_indices.append(i - j)  
            # Use the table to skip unnecessary comparisons
            j = partial_match_table[j - 1]  

        # If there is a mismatch after some matches
        elif i < text_length and text[i] != pattern[j]:  
            # Use the partial match table to avoid unnecessary comparisons
            if j != 0:  
                # Move to the next candidate position in the pattern
                j = partial_match_table[j - 1]  
            else:
                # Move to the next character in the text
                i += 1  

    # Return the list of starting indices where matches were found
    return match_indices  

Example of use:

In [None]:
text_string = "ababcababacabcababcababacabc"
pattern_string = "ababa"

results = kmp_search(text_string, pattern_string)

print("Partial Match Table:", create_partial_match_table(pattern_string))

print("Pattern found at indices:", results)

## Boyer-Moore Algorithm

The **Boyer-Moore Algorithm** is a highly efficient string matching algorithm that skips sections of the text during the search process, making it much faster in practice than many other algorithms like Naive String Matching or even Knuth-Morris-Pratt (KMP). Boyer-Moore achieves this by processing the pattern string in reverse order and using information from mismatches to skip over portions of the text, reducing the number of comparisons.

### String Representation

In string matching, the text string $ T $ has a length $ n $, and the pattern string $ P $ has a length $ m $. Boyer-Moore compares the pattern to the text by aligning the right end of the pattern with a substring in the text.

### Properties of the Boyer-Moore Algorithm

1. **Right-to-Left Comparison:** Unlike most string matching algorithms, Boyer-Moore compares the characters of the pattern and the text from right to left. This unique approach allows it to skip large portions of the text when mismatches occur.

2. **Two Key Heuristics:** The Boyer-Moore algorithm uses two powerful heuristics:
   - **Bad Character Rule:** When a mismatch occurs, the algorithm shifts the pattern to align the mismatched character in the text with the last occurrence of that character in the pattern (or skips past the character if it doesn't appear in the pattern).
   - **Good Suffix Rule:** If a mismatch occurs after matching a suffix of the pattern, the pattern is shifted so that the next occurrence of this suffix in the pattern (if any) is aligned with the same position in the text.

3. **Highly Efficient for Large Alphabets:** Boyer-Moore performs exceptionally well when the alphabet is large and the pattern is long, as it tends to skip large sections of the text in each iteration.

4. **Worst-Case Behavior:** Although its worst-case time complexity can be quadratic in nature, Boyer-Moore is often extremely efficient in practice, especially for long patterns and large texts.

### How the Boyer-Moore Algorithm Operates

1. **Preprocessing Phase:**
   - **Bad Character Table:** The algorithm preprocesses the pattern to create a bad character table. For each character in the pattern, this table records the rightmost position of that character in the pattern. If the character does not appear in the pattern, its value is set to -1.
   - **Good Suffix Table:** The good suffix table is also preprocessed. This table provides shift values based on matching suffixes. It helps to determine how far the pattern can be shifted when a suffix is matched.

2. **Searching Phase:**
   - Start by aligning the pattern $ P $ with the beginning of the text $ T $ (with the right end of the pattern aligned).
   - Compare the characters of the pattern and text starting from the rightmost character of the pattern.
   - If all characters match, a match is found, and the index of the match is recorded.
   - If a mismatch occurs, use the **bad character rule** and **good suffix rule** to determine how far to shift the pattern for the next comparison:
     - **Bad Character Rule:** If the mismatched character in the text appears earlier in the pattern, shift the pattern so that the last occurrence of this character in the pattern aligns with the text. If it doesn't appear in the pattern, skip past the mismatched character in the text.
     - **Good Suffix Rule:** If the mismatch occurs after a partial match of the suffix, shift the pattern to align the next occurrence of the matched suffix with the text or as far as necessary to ensure progress.

3. **Iterative Process:** Continue shifting the pattern and comparing it with substrings of the text until the entire text has been searched.

4. **Termination:** Once all positions in the text have been checked, the algorithm returns the starting indices of all matches found.

### Applications of the Boyer-Moore Algorithm

1. **Text Search Tools:** Boyer-Moore is commonly used in text searching applications such as grep, where efficiency with large text files is critical.

2. **Pattern Matching in Large Data:** The algorithm is useful in any scenario where large amounts of data need to be searched for patterns, such as in search engines or large text datasets.

3. **Bioinformatics:** Boyer-Moore can be applied in bioinformatics for sequence alignment and searching for specific DNA or protein sequences in large biological datasets.

4. **Plagiarism Detection:** Its efficiency in handling large documents makes it suitable for plagiarism detection tools, where multiple documents are compared for similarity.

In [17]:
def create_bad_character_table(pattern: str) -> dict:
    # Create the bad character table
    bad_char_table = {}
    # Length of the pattern
    pattern_length = len(pattern) # m

    # Fill the table with the rightmost occurrence of each character in the pattern
    for i in range(pattern_length):
        # Store the index of the last occurrence of pattern[i]
        bad_char_table[pattern[i]] = i  

    return bad_char_table

def create_good_suffix_table(pattern: str) -> list:
    # Length of the pattern
    pattern_length = len(pattern) # m
    # Initialize the good suffix table and border positions
    good_suffix_table = [0] * pattern_length
    border_positions = [-1] * (pattern_length + 1)
    i = pattern_length
    j = pattern_length + 1
    border_positions[i] = j

    # Create the table by iterating from the end of the pattern
    while i > 0:
        while j <= pattern_length and pattern[i - 1] != pattern[j - 1]:
            # If no good suffix rule exists, set the value for skipping
            if good_suffix_table[j - 1] == 0:  
                good_suffix_table[j - 1] = j - i
            # Update j to the next border position
            j = border_positions[j]
        # Move the border one position left
        i -= 1
        j -= 1
        border_positions[i] = j

    # Fill in the remainder of the table
    j = border_positions[0]
    for i in range(pattern_length):
        # If no good suffix rule exists, set to the final border value
        if good_suffix_table[i] == 0:
            good_suffix_table[i] = j
        # Update j when the current index matches the border
        if i == j:
            j = border_positions[j]

    return good_suffix_table

def boyer_moore_search(text: str, pattern: str) -> list:
    # Preprocessing phase
    # Create the bad character rule table
    bad_char_table = create_bad_character_table(pattern)  
    # Create the good suffix rule table
    good_suffix_table = create_good_suffix_table(pattern)  

    # Length of the pattern
    pattern_length = len(pattern) # m 
    # Length of the text
    text_length = len(text) # n
    # List to store the starting indices of matches found
    match_indices = []  

    # Searching phase
    # Shift of the pattern with respect to the text
    s = 0  
    # Continue the search while the pattern can still fit within the text
    while s <= text_length - pattern_length:
        # Start comparing from the last character of the pattern
        j = pattern_length - 1  

        # Compare the pattern and text from right to left
        while j >= 0 and pattern[j] == text[s + j]:
            j -= 1

        # If the pattern is fully matched
        if j < 0:
            # Record the starting index of the match
            match_indices.append(s)  
            # Shift the pattern according to the good suffix rule
            s += good_suffix_table[0]  

        else:
            # Calculate the shift for the bad character rule
            bad_char_shift = bad_char_table.get(text[s + j], -1)
            bad_char_move = j - bad_char_shift

            # Shift according to the good suffix rule
            good_suffix_move = good_suffix_table[j]

            # Choose the maximum shift from the bad character and good suffix rules
            s += max(bad_char_move, good_suffix_move)

    # Return the list of starting indices where matches were found
    return match_indices

Example of use:

In [None]:
text_string = "ababcababcabc"
pattern_string = "abc"

results = boyer_moore_search(text_string, pattern_string)

print("Bad Character table", create_bad_character_table(pattern_string))
print("Good Suffix table", create_good_suffix_table(pattern_string))
print("Pattern found at indices:", results)

## Time Complexity of Key String Matching Algorithms

### Naive String Matching
**Time Complexity: $O((n - m + 1) \cdot m)$**

The Naive String Matching algorithm checks for the pattern at every position in the text. It slides the pattern one position at a time, performing character-by-character comparison at each shift. The time complexity is proportional to the length of the text $n$ and the pattern $m$, as the algorithm makes up to $n - m + 1$ shifts and performs $m$ comparisons at each shift. In the worst case, this results in a time complexity of $O((n - m + 1) \cdot m)$, which is inefficient for large inputs.

### Knuth-Morris-Pratt (KMP) Algorithm
**Time Complexity: $O(n + m)$**

The KMP algorithm improves over the naive approach by preprocessing the pattern to create a longest prefix-suffix (LPS) table. This allows the algorithm to avoid unnecessary comparisons by using the LPS table to skip over sections of the text. The preprocessing of the pattern takes $O(m)$ time, and the actual search phase runs in $O(n)$, as the algorithm scans each character of the text once. Therefore, the overall time complexity is $O(n + m)$, making KMP much more efficient for string matching.

### Boyer-Moore Algorithm
**Time Complexity: Best Case: $O(n / m)$, Worst Case: $O(n \cdot m)$**

The Boyer-Moore algorithm uses two heuristics (the bad character rule and the good suffix rule) to skip over sections of the text, making it highly efficient in practice, especially for large patterns or large alphabets. In the best case, the algorithm can skip entire segments of the text, achieving a time complexity of $O(n / m)$. However, in the worst case, particularly with small alphabets or repetitive patterns, the time complexity can degrade to $O(n \cdot m)$, but this is rare in practical applications.

### Rabin-Karp Algorithm
**Time Complexity: Average Case: $O(n + m)$, Worst Case: $O(n \cdot m)$**

The Rabin-Karp algorithm uses hashing to compare the pattern with substrings of the text. In the average case, the rolling hash function allows for efficient comparisons, and the time complexity is $O(n + m)$. However, in the worst case (due to hash collisions), the algorithm may need to perform character-by-character comparisons, leading to a time complexity of $O(n \cdot m)$. Despite this, Rabin-Karp is effective when searching for multiple patterns or in scenarios where efficient hash functions are used.