# Aho-Corasick Algorithm for Pattern Matching
The Aho-Corasick algorithm is a powerful search technique used for matching multiple patterns in a given text. It constructs a Trie (prefix tree) and builds failure links to perform efficient string matching. This method allows us to search for all occurrences of a set of keywords in linear time.

## Key Concepts:
- **Trie Construction**: A tree structure built for all the input patterns (keywords).
- **Failure Links**: These links allow us to fall back to previously matched states if a mismatch occurs.
In this notebook, we'll break down the algorithm step by step and implement it in Scala.

## Step 1: Define the State Case Class
Each node in the Trie corresponds to a state. A state has the following properties:
- **ID**: A unique identifier for the state.
- **Successors**: A map of characters leading to other states (the transitions).
- **EndState**: A boolean flag to mark whether this state corresponds to the end of a keyword.
- **Keyword**: An optional field to store the actual keyword when the state is an end state.

Let's define a `State` class to represent a node in our Trie.

In [2]:
import scala.collection.mutable.Queue
case class State(ID: Int, Successor: Map[String, Int], endState: Boolean, keyword: Option[String] = None)

[32mimport [39m[36mscala.collection.mutable.Queue
[39m
defined [32mclass[39m [36mState[39m

## Step 2: Build the Trie
The next step in the Aho-Corasick algorithm is constructing the Trie. For each keyword, we traverse the Trie, creating new states as needed, and linking them accordingly. The last state of each keyword will be marked as an 'end state' to signify the completion of a keyword.
We will now define a function `buildGraph` to build the Trie from a list of keywords.

In [3]:
def buildTrie(keywords: List[String]): Map[Int, State] = {
  var nextID = 0
  var states = Map[Int, State](nextID -> State(nextID, Map(), endState = false)) // Root state

  for (keyword <- keywords) {
    var currentStateID = 0 // Start at the root state

    for (char <- keyword) {
      val currentState = states(currentStateID)

      // Check if there's already a state for this character, else create a new state
      val nextStateID = currentState.Successor.getOrElse(char.toString, {
        nextID += 1
        nextID
      })

      // Update the current state to include the new successor
      states = states.updated(currentStateID, currentState.copy(Successor = currentState.Successor + (char.toString -> nextStateID)))

      // Add the new state if it doesn't exist
      if (!states.contains(nextStateID)) {
        states += nextStateID -> State(nextStateID, Map(), endState = false)
      }

      // Move to the next state
      currentStateID = nextStateID
    }

    // Mark the last state of the keyword as an end state
    val finalState = states(currentStateID)
    states += currentStateID -> finalState.copy(endState = true, keyword = Some(keyword))  }

  states
}

defined [32mfunction[39m [36mbuildTrie[39m

## Step 3: Visualize the Trie
To better understand the structure of our Trie, we can visualize it. The following function `printGraph` will display each state and its successors, as well as whether it is an end state and which keyword it represents (if applicable).

In [3]:
def printGraph(states: Map[Int, State]): Unit = {
  states.foreach { case (id, state) =>
    val keywordStr = state.keyword match {
      case Some(kw) => s", Keyword = $kw"
      case None => ""
    }
    println(s"State $id: Successors = ${state.Successor}, End State = ${state.endState}$keywordStr")
  }

var keywords = List("hers", "she", "his")
printGraph(buildTrie(keywords))

## Step 4: Compute Failure Links
Now we need to compute the failure links, which will allow the algorithm to jump to the next possible state when a mismatch occurs. The failure link for a state is a pointer to another state that might lead to a match.
We will now define the `computeFail` function to compute these failure links for all states.

In [4]:
def computeFail(states: Map[Int, State]): Map[Int, Int] = {
  // Initialize the fail function, setting all states to fail to the root (state 0)
  var fail = Map[Int, Int]().withDefaultValue(0)

  // Queue for Breadth-First Search (BFS) traversal of the automaton
  val queue = scala.collection.mutable.Queue[Int]()

  // Set failure links for root's direct successors
  for ((input, stateID) <- states(0).Successor) {
    fail += stateID -> 0 // Direct successors of root point back to root
    queue.enqueue(stateID) // Enqueue each direct successor for BFS processing
  }

  // BFS to compute failure links for the remaining states
  while (queue.nonEmpty) {
    val currentStateID = queue.dequeue() // Dequeue the current state for processing
    val currentState = states(currentStateID)

    // Process each transition from the current state
    for ((input, successorID) <- currentState.Successor) {
      queue.enqueue(successorID) // Enqueue the successor state for further processing

      // Start fallback resolution from the failure state of the current state
      var fallbackID = fail(currentStateID)
      while (fallbackID != 0 && !states(fallbackID).Successor.contains(input)) {
        fallbackID = fail(fallbackID) // Move to the next fallback state
      }

      // Update the fail link of the successor state
      val fallbackSuccessorID = states(fallbackID).Successor.getOrElse(input, 0)
      fail += successorID -> fallbackSuccessorID
    }
  }

  fail // Return the computed failure function as a map
}

defined [32mfunction[39m [36mcomputeFail[39m

## Step 5: Goto Function

The Aho-Corasick algorithm relies on an important function called the **Goto function**. This function determines the next state in the Trie based on the current state and the character being processed. If the character leads to an existing state, the Goto function returns the ID of that state. If no such state exists, it follows the failure link and attempts to find a valid state from a fallback point.

We will define a `goto` function that takes the current state and the character, and either returns the next state or follows the failure links until a valid state is found.

Let's implement the `goto` function.


In [4]:
def goto(input: String): Int ={
    val currentStateOpt: Option[State] = states.get(currentStateID)

    currentStateOpt match {
      case Some(currentState) =>
        // Need to check if EndState!! --> output
        if (currentState.Successor != null) {
          // Access the Successor map of the current state
          currentState.Successor.get(input) match {
            case Some(nextStateID) =>
              // Update the current state ID and move to the next state
              currentStateID = nextStateID
              nextStateID
            case None =>
              // If there's no matching input in the Successor map, tell PMM to call fail
              if (currentStateID == 0) {
                0
              } else {
                -1
              }
          }
        } else {
          -1
        }
    }

## Step 6: Connecting the Pieces

In this step, we combine all the previously implemented functions into a single class: `FiniteStateMachine`. This class is responsible for handling the search process using the Aho-Corasick algorithm. It utilizes the following components:

- **Trie construction** (`buildTrie`): Builds a Trie from the list of keywords.
- **Goto function** (`goto`): Determines the next state based on the current state and input character.
- **Failure function** (`computeFail`): Computes the fail links for the Trie, allowing the search to continue efficiently even when a mismatch occurs.
  
The class `FiniteStateMachine` has a method called `PMM()` (Pattern Matching Method) that performs the actual pattern matching of the keywords against the given search text. It tracks the current state in the FSM and, when an end state is reached, it outputs the matched keyword along with the position of the last character.

The key functions in this class are:
1. `getCurrentStateID`: A getter for the current state ID.
2. `PMM`: Performs the pattern matching and returns a list of pairs containing the index of the last character of a matched keyword and the keyword itself.
3. `fail`: Retrieves the fail link for the current state.

This class is an integrated solution that uses the previous building blocks to search for multiple keywords in a text efficiently.

Here is the code for the `FiniteStateMachine`:

In [4]:

class FiniteStateMachine(SearchText: String, Keywords: List[String] ) {

  private val states: Map[Int, State] = buildTrie(Keywords) // This map represents the finite state machine
  private val text: String = SearchText
  private val keywords: List[String] = Keywords
  private val fails : Map[Int,Int] = computeFail(states)

  private var currentStateID: Int = 0 // starting State is by default 0 !

  def getCurrentStateID: Int = currentStateID // simple getter for the current ID


  /**
   *  This function performs the PatternMatching of the keywords in the SearchString
   * @return Returns a List of Pairs with Index of the Last char of a keyword and the keyword itself, so List[(Int,String)]
   */
  def PMM(): List[(Int, String)] =

    var Output: List[(Int, String)] = List() // empty list
    if( text.nonEmpty && keywords.nonEmpty) {
      var charPos: Int = 0
      for (index <- 0 until text.length) {

        //  println(currentStateID)
        charPos += 1
        val gotoOutput: Int = goto(text.charAt(index).toString)

        if (gotoOutput == -1) {
          currentStateID = fail(currentStateID)
        } else {
          currentStateID = gotoOutput
          val currentStateOpt: Option[State] = states.get(currentStateID)
          currentStateOpt match {
            case Some(currentState) => if (currentState.endState) {
              Output = Output :+ (charPos, currentState.keyword.get)


            } // Reassign Output with the new list
            case None => throw Exception(s"This State ID: $currentStateID does not exist!")

          }

        }
      }
    }
    Output
  def goto(input: String): Int ={
    val currentStateOpt: Option[State] = states.get(currentStateID)

    currentStateOpt match {
      case Some(currentState) =>
        // Need to check if EndState!! --> output
        if (currentState.Successor != null) {
          // Access the Successor map of the current state
          currentState.Successor.get(input) match {
            case Some(nextStateID) =>
              // Update the current state ID and move to the next state
              currentStateID = nextStateID
              nextStateID
            case None =>
              // If there's no matching input in the Successor map, tell PMM to call fail
              if (currentStateID == 0) {
                0
              } else {
                -1
              }
          }
        } else {
          -1
        }
    }
    
  def fail(stateID: Int): Int = { // actual magic happens in computeFail
    fails(stateID) // Return the fail link for the current state
  }
}

## Conclusion
The Aho-Corasick algorithm efficiently searches for multiple keywords in a given text. By building a Trie and computing failure links, we can ensure that we do not need to start from the beginning each time a mismatch occurs. This results in faster searches, especially when dealing with large datasets.
We have implemented the algorithm step-by-step and tested it with sample data.

In [4]:
// Sample keywords and search text
val keywords = List("he", "she", "his", "hers")
val searchText = "ushers"



// Create the FiniteStateMachine instance
val fsm = new FiniteStateMachine(searchText, keywords)

// Run the pattern matching method
val result = fsm.PMM()

// Print the result
println("Matched Keywords:")
result.foreach { case (index, keyword) => 
  println(s"Keyword: $keyword, Position: $index")
}


-- [E006] Not Found Error: cmd5.sc:7:14 ----------------------------------------
7 |val fsm = new FiniteStateMachine(searchText, keywords)
  |              ^^^^^^^^^^^^^^^^^^
  |              Not found: type FiniteStateMachine
  |
  | longer explanation available when compiling with `-explain`
Compilation Failed