### Using FSMs to Lex in Project 1 ##

Let's implement the following FSM.

[![colon-dash FSM](colon-dash_fsm.png)](https://www.dropbox.com/scl/fi/fzyo4nqfty84hpjmdefnz/colon-dash_fsm.png?rlkey=c8xn514yhua8s2px1nn0t8ytj&dl=0)



First, notice what is output by the FSM: some outputs are Token types and others use the None type in Python. We'll have to deal with this when we define the class.

Second, let's discuss the notation for the transitions. The input set $I$ is the set of all legal characters. A symbol inside the quotation marks means that the transition occurs if that single character is seen. Thus, ':' means transition from state $s_0$ to state $s_1$ when the colon character is read. When the input set symbol $I$ is used on a transition (see the self loop on state $s_{\rm err}$) it means the transition occurs for any input character. Finally, when the transition is labeled with $I-\{c\}$ it means the transition occurs for any character in the set produced by the set difference operator. In other words, the transition occurs on any character except the $c$. For example, $I-\{':'\}$ means the transition occurs on any character except the colon charater.

Stated simply, this FSM outputs  the _colon-dash_ token when it knows that it has seen the ":-" pattern and otherwise outputs nothing.  We can implement it, but first we'll copy and paste the token.py file from the starter code for Project 1.


The code below defines a TokenType, which is a list of the token types from the Project 1 description. The code below also defines the Token class, which creates the data structure seen in the slides: a tuple of token type, token value, and line number.

In [14]:
############
## Cell 1 ##
############

from typing import Literal, Any

TokenType = Literal[
    "COLON",
    "COLON_DASH",
    "COMMA",
    "COMMENT",
    "UNDEFINED",
    "EOF",
    "FACTS",
    "ID",
    "LEFT_PAREN",
    "PERIOD",
    "QUERIES",
    "Q_MARK",
    "RIGHT_PAREN",
    "RULES",
    "SCHEMES",
    "STRING",
    "WHITESPACE",
]


class Token:
    __slots__ = ["token_type", "value", "line_num"]

    def __init__(self, token_type: TokenType, value: str, line_num: int = 0):
        self.token_type: TokenType = token_type
        self.value: str = value
        self.line_num: int = line_num

    def __str__(self) -> str:
        return (
            "(" + self.token_type + ',"' + self.value + '",' + str(self.line_num) + ")"
        )

    def __eq__(self, other: Any) -> bool:
        if isinstance(other, Token):
            return (
                self.token_type == other.token_type
                and self.value == other.value
                and self.line_num == other.line_num
            )
        return False

    @staticmethod
    def colon(value: Literal[":"]) -> "Token":
        return Token("COLON", value)

    @staticmethod
    def colon_dash(value: Literal[":-"]) -> "Token":
        return Token("COLON_DASH", value)

    @staticmethod
    def comma(value: Literal[","]) -> "Token":
        return Token("COMMA", value)

    @staticmethod
    def comment(value: str) -> "Token":
        return Token("COMMENT", value)

    @staticmethod
    def undefined(value: str) -> "Token":
        return Token("UNDEFINED", value)

    @staticmethod
    def eof(value: Literal[""]) -> "Token":
        return Token("EOF", value)

    @staticmethod
    def facts(value: Literal["Facts"]) -> "Token":
        return Token("FACTS", value)

    @staticmethod
    def id(value: str) -> "Token":
        return Token("ID", value)

    @staticmethod
    def left_paren(value: Literal["("]) -> "Token":
        return Token("LEFT_PAREN", value)

    @staticmethod
    def period(value: Literal["."]) -> "Token":
        return Token("PERIOD", value)

    @staticmethod
    def queries(value: Literal["Queries"]) -> "Token":
        return Token("QUERIES", value)

    @staticmethod
    def q_mark(value: Literal["?"]) -> "Token":
        return Token("Q_MARK", value)

    @staticmethod
    def right_paren(value: Literal[")"]) -> "Token":
        return Token("RIGHT_PAREN", value)

    @staticmethod
    def rules(value: Literal["Rules"]) -> "Token":
        return Token("RULES", value)

    @staticmethod
    def schemes(value: Literal["Schemes"]) -> "Token":
        return Token("SCHEMES", value)

    @staticmethod
    def string(value: str) -> "Token":
        return Token("STRING", value)

    @staticmethod
    def whitespace(value: str) -> "Token":
        for i in value:
            assert i == " " or i == "\t" or i == "\n" or i == "\r"
        return Token("WHITESPACE", value)

Let's now create the FSM class.

In [15]:
############
## Cell 2 ##
############

from typing import Callable as function

# Define a type for State
State = function[[str], tuple[function, Token]]

class ColonDashFSM:
    def __init__(self) -> None:
        self.start_state: State = self.s_0
        self.terminating_states: set[State] = {self.s_acc, self.s_rej}

    ######################################
    ## Define a Function for each state ##
    ######################################
    def s_0(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        match input_symbol:
            case ":": # Colon
                return (self.s_1, None)
            case _: # Anything else
                return (self.s_rej, None)
    
    def s_1(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        match input_symbol:
            case "-": # Dash
                return (self.s_acc, Token.colon_dash(":-"))
            case _: # Anything else
                return (self.s_rej, None)
    
    def s_rej(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        return (self.s_rej, None)

    def s_acc(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        return (self.s_acc, None)

We'll talk about what the terminating\_states attribute does in class.

Note that each state function returns a next state, which is denoted by the "State" in the return type "[State, Token | None]". Finally, that each state function returns an output that can have a Nonetype or a Token type, which is denoted by the "Token | None" in the return type. 

In [16]:
#############################################
## Define a function that runs the machine ##
#############################################
def run(input_sequence: str, fsm: ColonDashFSM) -> Token:
    # Use a C++ style declaration of local variables
    present_state: State = fsm.start_state
    next_state: State
    token: Token

    # We'll talk about what this line does when we discuss the whitespace FSM
    input_sequence += "#"

    # Step through each input symbol
    for symbol in list(input_sequence):
        # Ask the present state what the next state and output should be
        next_state, token = present_state(symbol)
        #print(f"present state = {present_state}, symbol = {symbol}, next_state = {next_state}")
        # Update the state
        present_state = next_state
        # Don't keep running if you know you've succeeded or failed
        if present_state in fsm.terminating_states: 
            break
        
    return token

In [17]:
fsm: ColonDashFSM = ColonDashFSM()
def test_fsm(fsm):
    assert run(":-", fsm).token_type == "COLON_DASH", "\':-\' failed"
    assert run(":", fsm) is None, ": failed"
    assert run(":---", fsm).token_type == "COLON_DASH", "\':---\' failed"
    assert run("-:", fsm) is None, "\'-:\' failed"
    print("All tests of the colon-dash fsms succeeded")

test_fsm(fsm)

All tests of the colon-dash fsms succeeded


---

### Inheritance ###

Since we are going to create a lot of FSMs, let's use inheritance to provide a consistent interface for the _run_ method. The base class should specify required elements of all FSMs.

In [18]:
############
## Cell 3 ##
############

class FiniteStateMachine:
    def __init__(self, start_state: State, terminating_states: set[State]) -> None:
        self.start_state = start_state
        self.terminating_states = terminating_states # To be explained later

    # The starter code has a few more elements in the base class

Let's now redo the colon-dash FSM so that it inherits from the FiniteStateMachine base class.

In [19]:
############
## Cell 4 ##
############

class ColonDashFSM(FiniteStateMachine):
    def __init__(self) -> None:
        start_state: State = self.s_0
        terminating_states: set[State] = {self.s_acc, self.s_rej}
        super().__init__(start_state, terminating_states)

    ######################################
    ## Define a Function for each state ##
    ######################################
    def s_0(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        match input_symbol:
            case ":": # Colon
                return (self.s_1, None)
            case _: # Anything else
                return (self.s_rej, None)
    
    def s_1(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        match input_symbol:
            case "-": # Dash
                return (self.s_acc, Token.colon_dash(":-"))
            case _: # Anything else
                return (self.s_rej, None)
    
    def s_rej(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        return (self.s_rej, None)

    def s_acc(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        return (self.s_acc, None)

We won't see a lot of benefit from inheritance yet, but hang on.  A small benefit is that we can change the _run_ function to run on any FSM we design as long as the FSM inherits from the base class. The only difference is the type of the fsm. Notice that the colon-dash FSM class inherits from the FiniteStateMachine class, so we list the type of this class as _FiniteStateMachine_ in the arguments of the _run_ method.

In [20]:
############
## Cell 5 ##
############

#############################################
## Define a function that runs the machine ##
#############################################
del run
def run(input_sequence: str, fsm: FiniteStateMachine) -> Token:
    # Use a C++ style declaration of local variables
    present_state: State = fsm.start_state
    next_state: State
    token: Token

    # We'll talk about what this line does when we discuss the whitespace FSM
    input_sequence += "#"

    # Step through each input symbol
    for symbol in list(input_sequence):
        # Ask the present state what the next state and output should be
        next_state, token = present_state(symbol)
        # Update the state
        present_state = next_state
        # Don't keep running if you know you've succeeded or failed
        if present_state in fsm.terminating_states: 
            break
        
    return token

And test

In [21]:
def test_colondash_fsm() -> None: # How define the type passes to this test function
    fsm: ColonDashFSM = ColonDashFSM()
    assert run(":-", fsm).token_type == "COLON_DASH", "\':-\' failed"
    assert run(":", fsm) is None, ": failed"
    assert run(":---", fsm).token_type == "COLON_DASH", "\':---\' failed"
    assert run("-:", fsm) is None, "\'-:\' failed"
    print("All tests of the colon-dash fsms succeeded")

test_colondash_fsm()

All tests of the colon-dash fsms succeeded


---

Let's repeat the exercise but with a FSM that counts whitespace. Recall that white space includes spaces (' '), tabs ('\\t'), carriage returns ('\\r'), and newlines ('\n'). Define the set $W = ${' ', '\\t\', '\\r', '\\n'}.

We can implement the following FSM that counts the whitespaces until it finds a character that is not a whitespace. Let's try to implement it in the same way as the previous machine.

[![whitespace FSM](whitespace-token_fsm.png)](https://www.dropbox.com/scl/fi/4vqh101flb1ajvjr204jw/whitespace-token_fsm.png?rlkey=h2pfjjz8tacr4zsip0d7octnd&dl=0)

The class for this whitespace FSM follows the same pattern as the colon-dash FSM.

In [22]:
class WhitespaceFSM(FiniteStateMachine):
    def __init__(self) -> None:
        start_state: State = self.s_0
        terminating_states: set[State] = {self.s_acc, self.s_rej}
        super().__init__(start_state, terminating_states)

    ######################################
    ## Define a Function for each state ##
    ######################################
    def s_0(self, input_symbol: str) -> tuple[State, Token | None]:
        if input_symbol in {' ', '\t', '\r', '\n'}:
            return (self.s_1, None)
        else:
            return (self.s_rej, None)
    
    def s_1(self, input_symbol: str) -> tuple[State, Token | None]:
        if input_symbol in {' ', '\t', '\r', '\n'}:
            return (self.s_1, None)
        else:
            return (self.s_acc, Token.whitespace(" ")) # This is where the problem is
    
    def s_rej(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        return (self.s_rej, None)

    def s_acc(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        return (self.s_acc, None)

The whitespace FSM is very similar to the colon-dash FSM. There are two problems.
 - __Problem 1:__ The transition from $s_1$ to $s_{\rm acc}$ says to output a whitespace token; see line 20. The problem is that in order to create a token we must not only specify the token type but also the token value. In this case, the token value is the sequence of whitespace characters encountered. How will the FSM know this?
 - __Problem 2:__ The transition from $s_1$ to $s_{\rm acc}$ must read a character. What happens if the whitespace occurs at the end of the string so that there are no more characters to read? We'd still be in state s1 (see line 18) and never get a chance to output the whitespace token.

Solving the second problem is easy -- concatenate an extra symbol to the end of the string so that there is always one character to read. I'll concatenate a #. This is done on line 12 of the _run_ function.

Solving the first problem is harder. The value of the token should equal the sequence of whitespace characters read. Unfortunately, because of the way the FSM is defined, only a space is returned. Let's confirm this by writing tests for the whitespace fsm.

In [23]:
def test_whitespace_fsm():
    fsm: WhitespaceFSM = WhitespaceFSM()
    # Test that None is returned if string doesn't start with whitespace
    assert run(":", fsm) is None, ": type failed"
    print("Test that whitespace fsm returns None if string does not begin with whitespace char succeeded")

    # Test for correct token types
    assert run(" ", fsm).token_type == "WHITESPACE", "' ' type failed"
    assert run(" \n\t\r", fsm).token_type == "WHITESPACE", "' \\n\\t\\r' type failed"
    assert run(" \n\t hi \n\t", fsm).token_type == "WHITESPACE", "' \\n\\t hi \\n\\t' type failed"
    print("All tests of the whitespace token types succeeded")

    # Test for correct token values
    assert run(" ", fsm).value == " ", "' ' value failed"
    assert run(" \n\t\r", fsm).value == " \n\t\r", "' \\n\\t\\r' value failed"
    assert run(" \n\t hi \n\t", fsm).value == " \n\t ", "' \\n\\t hi \\n\\t' value failed"
    print("All tests of the whitespace fsm succeeded")

And now run the tests

In [24]:
test_whitespace_fsm()

Test that whitespace fsm returns None if string does not begin with whitespace char succeeded
All tests of the whitespace token types succeeded


AssertionError: ' \n\t\r' value failed

Each test of the token type succeeds, but the tests on sequences of whitespace characters all fail. The problem is that the FSM above cannot "remember" which characters it has seen. Let's try to solve this by (incorrectly) adding a class attribute to which we write the sequence of characters. 

In [25]:
del WhitespaceFSM
class WhitespaceFSM(FiniteStateMachine):
    def __init__(self) -> None:
        start_state: State = self.s_0
        terminating_states: set[State] = {self.s_acc, self.s_rej}
        super().__init__(start_state, terminating_states)
        self.characters_seen: str

    ######################################
    ## Define a Function for each state ##
    ######################################
    def s_0(self, input_symbol: str) -> tuple[State, Token | None]:
        self.characters_seen = "" # Initialize the character memory to the empty string
        if input_symbol in {' ', '\t', '\r', '\n'}:
            self.characters_seen += input_symbol
            return (self.s_1, None)
        else:
            return (self.s_rej, None)
    
    def s_1(self, input_symbol: str) -> tuple[State, Token | None]:
        if input_symbol in {' ', '\t', '\r', '\n'}:
            self.characters_seen += input_symbol
            return (self.s_1, None)
        else:
            return (self.s_acc, Token.whitespace(self.characters_seen)) # This is where the problem is
    
    def s_rej(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        return (self.s_rej, None)

    def s_acc(self, input_symbol: str) -> tuple[State, Token | None]:
        # return the next state and return the output as a tuple (next state, output operator)
        return (self.s_acc, None)

Notice the five differences in this definition of rhe FSM from the previous definition.
- line 7 defines a class attribute that remembers what characters have been read as a string
- line 13 initializes the sequence of characters read to the empty string before anything else is done.
- lines 15 and 22 add the character read to the end of the sequence of characters read so far
- line 25 passes the string to the token function

Let's test!


In [26]:
test_whitespace_fsm()

Test that whitespace fsm returns None if string does not begin with whitespace char succeeded
All tests of the whitespace token types succeeded
All tests of the whitespace fsm succeeded


This seems like a good result since all tests passed, but there is a problem. Specifically, the machine we created is no longer a finite state machine. Recall the elements in the definition of a FSM
- set of states
- set of inputs
- set ouf outputs
- start state
- transition function
- output function

There isn't anything in the definition that allows the machine to remember what inputs have been seen. In fact, the only form of memory in a FSM is the state. This limited form of FSM memory is important. Indeed, limiting memory to "what state am I in" is both the strength and the weakness of the FSM. It is the strength of the FSM because it means we can keep these machines simple. It is the weakness because it means there are certain kinds of patterns that an FSM can't detect.

Fortunately, a sequence of whitespace characters is a pattern that can be detected by the FSM. We just need to be a bit more clever in how we define the FSM.

---

We can create a FSM that we can use to recognize the whitespace pattern. There are a couple of tricks to understanding this new FSM.
- First, the FSM won't create and output the token. Instead, it gives instructions to the world.
- Second, the instructions to the world say things like
 - The string starts with a whitespace character, so if you are counting then you should set the number of whitespace characters read to one. See the transition from $s_0$ to $s_1$.
 - I've successfully read another whitespace character, so if you are counting then you should increment the number of whitespace characters that have been read. See the self loop on state $s_1$.
 - I've read a new character from the input string but it's not a whitespace character, so if you are counting don't increment. See the transitions on $s_{\rm rej}$ and $s_{\rm acc}$.
 - I've not read any whitespace characters at the start of the input so if you are counting then set the value to zero. See the transition from state $s_0$ to state $s_{\rm rej}$.
- Third, it's the responsibility of whatever is running the FSM to create the token. It's like a contract between the FSM and whatever is running it. We'll use a _run_ function to run the FSM.
 - The _run_ method is responsible for remembering how many whitespace characters were actually read.
 - The _run_ method is responsible for creating the token. It can create the token value because it knows how many whitespace characters it has read.
 - The FSM is responsible for telling the whatever is calling it what how it should count.
 

Here's the FSM.

[![whitespace FSM](revised-whitespace_fsm.png)](https://www.dropbox.com/scl/fi/dcrrgp10mp8wll5bho0q2/revised-whitespace_fsm.png?rlkey=xkoe3uvsukyelrp9yszkzmgvw&dl=0)


The FSM outputs instructions to the _run_ method. These instructions are of the form "whatever your variable $o$ is, map it to something else." These instructions are written as mappings "$o$ gets something else."
- The instruction on the transition from $s_0$ to $s_{\rm rej}$ says "no whitespace characters were found at the start of the input."
- The instruction on the transition from $s_0$ to $s_1$ says "the input starts with a whitespace character."
- The instruction on the selfloop on $s_1$ says "I've read another whitespace character in the input sequence."
- The instruction on the transition from $s_1$ to $s_{\rm acc}$ says "I've reached the end of the whitespaces at the start of the input so you don't need to do anything."
- The instructions on the selfloop on $s_{\rm rej}$ and $s_{\rm acc}$ say "I'm still reading input characters but you don't need to do anything."

In [27]:
from typing import Callable as function

# Define a type for State and the Output operator
State = function[[str], tuple[function, function]] # Notice that the State return type has changed
OutputOperator = function[[int], int] # This defines a type for the instructions to the run method

del WhitespaceFSM
class WhitespaceFSM(FiniteStateMachine):
    def __init__(self) -> None:
        start_state: State = self.s_0
        terminating_states: set[State] = {self.s_acc, self.s_rej}
        super().__init__(start_state, terminating_states)

    ######################################
    ## Define a Function for each state ##
    ######################################
    def s_0(self, input_symbol: str) -> tuple[State, OutputOperator]:
        # return the next state and return the output as a tuple (next state, output operator)
        if input_symbol in {' ', '\t', '\r', '\n'}:
            return (self.s_1, lambda x: 1)
        else:
            return (self.s_rej, lambda x: 0)
    
    def s_1(self, input_symbol: str) -> tuple[State, OutputOperator]:
        # return the next state and return the output as a tuple (next state, output operator)
        if input_symbol in {' ', '\t', '\r', '\n'}:
            return (self.s_1, lambda x: x+1)
        else:
            return (self.s_acc, lambda x: x)
    
    def s_rej(self, input_symbol: str) -> tuple[State, OutputOperator]:
        # return the next state and return the output as a tuple (next state, output operator)
        return (self.s_rej, lambda x: x)

    def s_acc(self, input_symbol: str) -> tuple[State, OutputOperator]:
        # return the next state and return the output as a tuple (next state, output operator)
        return (self.s_acc, lambda x: x)

Many of you will not have used lambda functions. I haven't used them a lot but thought they were really useful here.  The following can help you learn what a lambda function is and what each of the lambda functions above does.
- Do an internet search on "geeksforgeeks lambda functions python" to find a good tutorial on lambda functions.
- Ask an LLM "What does the following expression do lambda x: x + 1?"

The _run_ method needs to be redefined since it is now part of a new contract with the FSM. Specifically, it must now keep track of the number of characters read by applying the operator returned from each state. Once it knows how many characters were read, it can create the correct token.

In [28]:
#############################################
## Define a function that runs the machine ##
#############################################
del run
def run(input_sequence: str, fsm: FiniteStateMachine) -> Token | None:
    # Use a C++ style declaration of local variables
    present_state: State = fsm.start_state
    next_state: State
    num_chars_read: int = 0
    operator: OutputOperator
    token: Token

    # Added so we don't terminate in state s_1
    input_sequence += "#"

    # Step through each input symbol
    for symbol in list(input_sequence):
        # Ask the present state what the next state and output should be
        next_state, operator = present_state(symbol)
        # Update the state
        present_state = next_state
        # Apply the operator to update the number of characters read
        num_chars_read = operator(num_chars_read)
        # Don't keep running if you know you've succeeded or failed
        if present_state in fsm.terminating_states: 
            break

    # Create the token 
    if num_chars_read == 0:
        return None
    else:
        token = Token.whitespace(input_sequence[:num_chars_read])
        return token

In [29]:
fsm: WhitespaceFSM = WhitespaceFSM()
returned_token = run("\n\n  \n\n", fsm)
returned_token.__str__()

'(WHITESPACE,"\n\n  \n\n",0)'

Test

In [30]:
test_whitespace_fsm()

Test that whitespace fsm returns None if string does not begin with whitespace char succeeded
All tests of the whitespace token types succeeded
All tests of the whitespace fsm succeeded


All tests pass! 

Since lambda functions might be really new to you, the starter code uses a different approach for counting the number of whitespace characters read. The key idea in the starter code is to create an int that is treated as an input to the FSM and which outputs an updated value of the int.