Create aho_codesick.py.#14390
Closed
signore662-beep wants to merge 1 commit intoTheAlgorithms:masterfrom
Closed
Conversation
Closing this pull request as invalid@signore662-beep, this pull request is being closed as the files submitted contains an invalid extension. This repository only accepts Python algorithms. Please read the Contributing guidelines first. Invalid files in this pull request: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
"""
Aho-Corasick Multi-Pattern String Matching
Builds a finite-state automaton (trie + failure links) from a set of patterns
and finds all occurrences of every pattern in a text in a single O(n + m + z)
pass, where:
n = length of the text
m = total characters across all patterns
z = total number of matches found
This beats the naive approach of running a single-pattern algorithm (KMP,
Z-function, Rabin-Karp) once per pattern, which costs O(n * k) for k patterns.
Typical real-world uses: network intrusion detection (Snort/fgrep),
antivirus signature scanning, spam filtering, DNA motif search.
References:
"Efficient string matching: an aid to bibliographic search."
Communications of the ACM, 18(6), 333-340.
https://doi.org/10.1145/360825.360855
"""
from future import annotations
from collections import deque
def _build_trie(
patterns: list[str],
) -> tuple[list[dict[str, int]], list[set[str]]]:
"""
Insert every non-empty, unique pattern into a trie.
def _follow(goto_table: list[dict[str, int]], state: int, text: str) -> int:
"""
Follow trie transitions from state consuming every character in text.
def _build_failure_links(
goto_table: list[dict[str, int]],
output_table: list[set[str]],
) -> list[int]:
"""
Compute failure (suffix) links for every trie state via BFS.
def build_automaton(
patterns: list[str],
) -> tuple[list[dict[str, int]], list[set[str]], list[int]]:
"""
Construct the complete Aho-Corasick automaton from patterns.
def search(
text: str,
goto_table: list[dict[str, int]],
output_table: list[set[str]],
fail: list[int],
) -> dict[str, list[int]]:
"""
Find all pattern occurrences in text using a pre-built automaton.
def search_all(text: str, patterns: list[str]) -> dict[str, list[int]]:
"""
One-shot convenience wrapper: build automaton from patterns, search text.
if name == "main":
import doctest