# Pattern Matching Algorithm Speed Comparison in Jupyter Notebook

## 1. Introduction and problem description
In this project, we compare the performance of three string pattern-matching algorithms: 
1. Brute-force
2. Knuth-Morris-Pratt (KMP)
3. Boyer-Moore

These algorithms are fundamental in string searching tasks and are used in many applications including text editors, search engines, and DNA sequence analysis.
Our goal is to experimentally analyze their relative performance by measuring the time each takes to find patterns of various lengths in large text documents.

## 2. Description of the 3 algorithms and how they work

### I. Brute-force
Brute Force algorithm is a straightforward algorithmic approach that systematically explores all possible solutions to a problem until the correct one is found. It is often used when the problem space is small or when no optimized solution is available. This method is simple to implement but can be computationally expensive for larger problems due to its exhaustive nature.

#### Key Characteristics
Brute Force Search involves methodically testing every possible solution in a specified order. It does not use optimization techniques or heuristics, relying solely on exhaustive exploration. This makes it a generic approach applicable to various domains, such as string matching, combinatorial problems, and cybersecurity.

For example, in string search, the algorithm compares a substring against every possible position in the target string until a match is found. Similarly, in combinatorial problems, it tests all combinations of elements to find the desired result.


In [None]:
def brute_force(text, pattern):
    n = len(text)
    m = len(pattern)

    for i in range(n - m + 1): # Iterate through all possible starting positions
        match = True
        for j in range(m): # Compare each character of the pattern
            if text[i + j] != pattern[j]:
                match = False
                break
            if match:
                return i # Return the starting index of the match
    return -1 # Return -1 if no match is found


### II. Knuth-Morris-Pratt (KMP)

The Knuth-Morris-Pratt (KMP) algorithm is an efficient string-searching method that finds all occurrences of a "pattern" within a "text" by avoiding unnecessary re-examinations of characters. Developed by Donald Knuth, Vaughan Pratt, and James H. Morris, KMP improves search speed by using information gathered during the pattern preprocessing phase.

#### Key Principles

KMP preprocesses the pattern to build an auxiliary array called the Longest Prefix Suffix (LPS) array. The LPS array indicates the longest proper prefix of the pattern that is also a suffix for each sub-pattern. This allows the algorithm to skip certain comparisons when a mismatch occurs, leading to improved performance over naive approaches.

##### Preprocessing the Pattern

The LPS array is constructed for the pattern before the search begins. For each position in the pattern, it stores the length of the longest prefix that matches a suffix ending at that position. This preprocessing step enables the KMP algorithm to efficiently determine how far to shift the pattern after a mismatch, minimizing redundant comparisons during the search phase.

In [None]:
def computeLPSArray(pattern):
    """
    Computes the Longest Prefix Suffix (LPS) array for the given pattern.
    The LPS array is used in the KMP algorithm to skip unnecessary comparisons.
    """
    M = len(pattern)
    lps = [0] * M  # Initialize LPS array
    length = 0     # Length of the previous longest prefix suffix
    i = 1
    while i < M:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]  # Use previous LPS value
            else:
                lps[i] = 0
                i += 1
    return lps

##### Searching the Pattern

Once the LPS array is ready, the KMP algorithm uses it to search the pattern in the text. The algorithm compares characters of the pattern with the text and uses the LPS array to skip unnecessary comparisons when a mismatch occurs

In [None]:
def KMPSearch(pattern, text):
    """
    Searches for occurrences of 'pattern' in 'text' using the KMP algorithm.
    Returns a list of starting indices where 'pattern' is found in 'text'.
    """
    M = len(pattern)
    N = len(text)
    lps = computeLPSArray(pattern)  # Preprocess the pattern to get the LPS array
    i = 0  # index for text
    j = 0  # index for pattern
    result = []
    while i < N:
        if pattern[j] == text[i]:
            i += 1
            j += 1
        if j == M:
            result.append(i - j)  # Pattern found, append starting index
            j = lps[j - 1]       # Continue to search for next possible match
        elif i < N and pattern[j] != text[i]:
            if j != 0:
                j = lps[j - 1]   # Use LPS to avoid unnecessary comparisons
            else:
                i += 1           # Move to the next character in text
    return result

### III. Boyer-Moore

The Boyer-Moore algorithm is a highly efficient string-searching algorithm, especially effective when the pattern is much shorter than the text. Unlike naive approaches, Boyer-Moore skips sections of the text, often resulting in faster searches.

#### Key Concepts

- **Right-to-Left Matching:** The algorithm compares the pattern to the text from right to left.
- **Bad Character Rule:** On a mismatch, the pattern is shifted so that the mismatched character in the text aligns with its last occurrence in the pattern. If the character is not present in the pattern, the pattern is shifted past the mismatched character.
- **Good Suffix Rule:** If a mismatch occurs after matching a suffix, the pattern is shifted to align the next occurrence of this suffix (or a matching prefix) in the pattern.

#### Steps of the Algorithm

1. Preprocess the pattern to create:
    - A bad character table.
    - A good suffix table.
2. Start comparing the pattern with the text from the rightmost character.
3. Use the heuristics to determine the shift after a mismatch.
4. Repeat until the pattern is found or the text is fully scanned.

In [None]:
def boyer_moore(text, pattern):
    # Preprocess the pattern for the bad character rule
    def preprocess_bad_character(pattern):
        bad_char_table = {}
        for i, char in enumerate(pattern):
            bad_char_table[char] = i  # Store the last occurrence of each character
        return bad_char_table

    # Preprocess the pattern for the good suffix rule (basic version)
    def preprocess_good_suffix(pattern):
        m = len(pattern)
        good_suffix_table = [0] * (m + 1)
        border_pos = [0] * (m + 1)
        i, j = m, m + 1
        border_pos[i] = j

        while i > 0:
            while j <= m and pattern[i - 1] != pattern[j - 1 if j - 1 < m else 0]:
                if good_suffix_table[j] == 0:
                    good_suffix_table[j] = j - i
                j = border_pos[j]
            i -= 1
            j -= 1
            border_pos[i] = j

        j = border_pos[0]
        for i in range(m + 1):
            if good_suffix_table[i] == 0:
                good_suffix_table[i] = j
            if i == j:
                j = border_pos[j]

        return good_suffix_table

    bad_char_table = preprocess_bad_character(pattern)
    good_suffix_table = preprocess_good_suffix(pattern)

    n, m = len(text), len(pattern)
    s = 0  # s is the shift of the pattern with respect to text

    while s <= n - m:
        j = m - 1
        # Compare pattern from right to left
        while j >= 0 and pattern[j] == text[s + j]:
            j -= 1
        if j < 0:
            print(f"Pattern found at index {s}")
            s += good_suffix_table[0] if s + m < n else 1
        else:
            bad_char_shift = j - bad_char_table.get(text[s + j], -1)
            good_suffix_shift = good_suffix_table[j + 1]
            s += max(bad_char_shift, good_suffix_shift)


## 3. Description of the text document used for testing

We use several large text documents for our experiments:

- **Project Gutenberg books** (e.g., *Pride and Prejudice*, *Moby Dick*)
- **Synthetic large text** generated by repeating characters or words

Each document is preloaded and stored as a string in memory.  
Patterns are chosen to vary in length (short: 5 chars, medium: 20 chars, long: 100+ chars).

You can upload or download text from Project Gutenberg and load them like this:

In [None]:
with open("moby_dick.txt", "r", encoding="utf-8") as file:
    text = file.read()


## 4. Experimental results

Use Python’s time module to record runtimes:

In [None]:
import time

def measure_time(algorithm, text, pattern):
    start = time.perf_counter()
    result = algorithm(text, pattern)
    end = time.perf_counter()
    return end - start, result


Run tests:

In [None]:
patterns = ["whale", "Call me Ishmael.", "ZZZUNLIKELYPATTERNZZZ", "Captain Ahab stood upon the quarter-deck", "the white whale"]
results = []

for pattern in patterns:
    row = {
        "Pattern": pattern,
        "Brute-force": measure_time(brute_force, text, pattern)[0],
        "KMP": measure_time(KMPSearch, text, pattern)[0],
        "Boyer-Moore": measure_time(boyer_moore, text, pattern)[0]
    }
    results.append(row)

df = pd.DataFrame(results)
print(df)


Visualizing the results with matplotlib

In [None]:
import matplotlib.pyplot as plt

df.set_index("Pattern").plot(kind="bar", figsize=(10,6))
plt.ylabel("Time (seconds)")
plt.title("Comparison of Pattern Matching Algorithms")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## 5. Conclusion and prospects

The experiments confirm the following:

- **Brute-force** is the slowest, especially for long texts and patterns.
- **KMP** provides consistent performance due to its preprocessing step.
- **Boyer-Moore** is generally the fastest for longer patterns, benefiting from heuristic-based skipping.

**However**, While theory suggests that KMP and Boyer-Moore are more efficient than brute-force, we observed that brute-force was sometimes faster in practice. This is largely due to early matches in the text and the preprocessing overhead in KMP and Boyer-Moore. However, in harder scenarios—such as long patterns near the end or not present at all—the advanced algorithms clearly outperformed brute-force.

**Future prospects:**

- Implement a complete Boyer-Moore algorithm with an optimized good-suffix rule.
- Test on real-world large datasets (e.g., system logs, DNA sequences).
- Explore parallelized search or approximate matching techniques.


**By Bakwowi Junior**