# First Project: Lexer

The first project requires you to implement a scanner for the uC language,
specified by [uC BNF Grammar](./doc/uC_Grammar.ipynb) notebook. Study the specification
of uC grammar carefully. To complete this first project, you will use the
[PLY](http://www.dabeaz.com/ply/), a Python version of the
[lex/yacc](http://dinosaur.compilertools.net/) toolset with same functionality
but with a friendlier interface. Please read the complete contents of this section
and carefully complete the steps indicated.

## Regular Expressions

Regular expressions are concise ways of describing a set of strings that meet a given
pattern. For example, we can specify the regular expression:
```python
r'[a-zA-Z_][0-9a-zA-Z_]*'
``` 
to describe valid identifiers in the uC language. Regular expressions are a mini-language
that lets you specify the rules for constructing a string set. This specification
mini-language is very similar between the different programming languages that contain
the concept of regular expressions (also called RE or REGEX). Thus, learning to write
regular expressions in Python will also be useful for describing REs in other programming
languages.

Your first task is to write a set of regular expressions that will be used by the lexical
parser to recognize the following patterns:

In [None]:
# valid uC identifiers
identifier =

In [None]:
# integer constants
int_const = 

In [None]:
# floating constants
float_const =

In [None]:
# Comments in C-Style /* ... */
ccomment = 

In [None]:
# Unterminated C-style comment
uccomment = 

In [None]:
# C++-style comment (//...)
cppcomment = 

In [None]:
# string_literal
string_literal = 

In [None]:
# unmatched_quote
unquote = 

In [None]:
# testing
import re
b = re.match(ccomment, "/***/")
if b:
    pass
else:
    print("Erro.")

## Writing a Lexer
The process of “lexing” is that of taking input text and breaking it down into a stream
of tokens. Each token is like a valid word from the dictionary. Essentially, the role of
the lexer is to simply make sure that the input text consists of valid symbols and tokens
prior to any further processing related to parsing.

Each token is defined by a regular expression. Thus, your task here is to define a set of
regular expressions for the uC language. The actual job of lexing will be handled by PLY.
For a better understanding study the [Lex](http://www.dabeaz.com/ply/ply.html#ply_nn3)
chapter in the PLY documentation.

### Specification
Your lexer must recognize the symbols and tokens of uC Grammar. For instance, in the
example below, the name on the left is the token name, and the value on the right is
the matching text:

Reserved Keywords:
```
    FOR   : 'for'
    IF    : 'if'
    PRINT : 'print'
```

Identifiers:
```
    ID    : any text starting with a letter or '_', followed by any number of letters,
            digits, or underscores, that is not a reserved word.
```

Some Operators and Delimiters:
```
    PLUS    : '+'
    MINUS   : '-'
    TIMES   : '*'
    DIVIDE  : '/'
    ASSIGN  : '='
    SEMI    : ';'
    LPAREN  : '('
    RPAREN  : ')'
```

Literals:
```
    INT_CONST : 123
    FLOAT_CONST : 1.234
    STRING_LITERAL : "Hello World\n"
```

For `INT_CONST` and `FLOAT_CONST`, you should only consider decimal numbers and fixed point numbers respectively.


Comments:  To be ignored by your lexer
```
     //             Skips the rest of the line
     /* ... */      Skips a block (no nesting allowed)
```

Errors: Your lexer must report the following error messages:
```
     lineno: Unterminated string
     lineno: Unterminated comment
```

**IMPORTANT: The uC BNF grammar contains specific rules to add support to pointers (and their adressing) in the language. This part is not compulsory and will not take part in the evaluation but feel free to implement them if you want.**

### Lex Skeleton

In [None]:
import sys
import ply.lex as lex


class UCLexer:
    """A lexer for the uC language. After building it, set the
    input text with input(), and call token() to get new
    tokens.
    """

    def __init__(self, error_func):
        """Create a new Lexer.
        An error function. Will be called with an error
        message, line and column as arguments, in case of
        an error during lexing.
        """
        self.error_func = error_func
        self.filename = ""

        # Keeps track of the last token returned from self.token()
        self.last_token = None

    def build(self, **kwargs):
        """Builds the lexer from the specification. Must be
        called after the lexer object is created.

        This method exists separately, because the PLY
        manual warns against calling lex.lex inside __init__
        """
        self.lexer = lex.lex(object=self, **kwargs)

    def reset_lineno(self):
        """Resets the internal line number counter of the lexer."""
        self.lexer.lineno = 1

    def input(self, text):
        self.lexer.input(text)

    def token(self):
        self.last_token = self.lexer.token()
        return self.last_token

    def find_tok_column(self, token):
        """Find the column of the token in its line."""
        last_cr = self.lexer.lexdata.rfind("\n", 0, token.lexpos)
        return token.lexpos - last_cr

    # Internal auxiliary methods
    def _error(self, msg, token):
        location = self._make_tok_location(token)
        self.error_func(msg, location[0], location[1])
        self.lexer.skip(1)

    def _make_tok_location(self, token):
        return (token.lineno, self.find_tok_column(token))

    # Reserved keywords
    keywords = (
        "ASSERT",
        "BREAK",
        "CHAR",
        "ELSE",
        "FLOAT",
        "FOR",
        "IF",
        "INT",
        "PRINT",
        "READ",
        "RETURN",
        "VOID",
        "WHILE",
    )

    keyword_map = {}
    for keyword in keywords:
        keyword_map[keyword.lower()] = keyword

    #
    # All the tokens recognized by the lexer
    #
    tokens = keywords + (
        # Identifiers
        "ID",
        # constants
        "INT_CONST",
        "FLOAT_CONST",
    )

    #
    # Rules
    #
    t_ignore = " \t"

    # Newlines
    def t_NEWLINE(self, t):
        # include a regex here for newline
        t.lexer.lineno += t.value.count("\n")

    def t_ID(self, t):
        # include a regex here for ID
        t.type = self.keyword_map.get(t.value, "ID")
        return t

    def t_comment(self, t):
        # include a regex here for comment
        t.lexer.lineno += t.value.count("\n")

    def t_error(self, t):
        msg = "Illegal character %s" % repr(t.value[0])
        self._error(msg, t)

    # Scanner (used only for test)
    def scan(self, data):
        self.lexer.input(data)
        output = ""
        while True:
            tok = self.lexer.token()
            if not tok:
                break
            print(tok)
            output += str(tok) + "\n"
        return output

## Testing
For initial development, try running the lexer on a sample input file such as:

```cpp
/* comment */
int j = 3;
int main () {
  int i = j;
  int k = 3;
  int p = 2 * j;
  assert p == 2 * i;
}
```

And the result will look similar to the text shown below.

```
LexToken(INT,'int',2,14)
LexToken(ID,'j',2,18)
LexToken(EQUALS,'=',2,20)
LexToken(INT_CONST,'3',2,22)
LexToken(SEMI,';',2,23)
LexToken(INT,'int',3,25)
LexToken(ID,'main',3,29)
LexToken(LPAREN,'(',3,34)
LexToken(RPAREN,')',3,35)
LexToken(LBRACE,'{',3,37)
LexToken(INT,'int',4,41)
LexToken(ID,'i',4,45)
LexToken(EQUALS,'=',4,47)
LexToken(ID,'j',4,49)
LexToken(SEMI,';',4,50)
LexToken(INT,'int',5,54)
LexToken(ID,'k',5,58)
LexToken(EQUALS,'=',5,60)
LexToken(INT_CONST,'3',5,62)
LexToken(SEMI,';',5,63)
LexToken(INT,'int',6,67)
LexToken(ID,'p',6,71)
LexToken(EQUALS,'=',6,73)
LexToken(INT_CONST,'2',6,75)
LexToken(TIMES,'*',6,77)
LexToken(ID,'j',6,79)
LexToken(SEMI,';',6,80)
LexToken(ASSERT,'assert',7,84)
LexToken(ID,'p',7,91)
LexToken(EQ,'==',7,93)
LexToken(INT_CONST,'2',7,96)
LexToken(TIMES,'*',7,98)
LexToken(ID,'i',7,100)
LexToken(SEMI,';',7,101)
LexToken(RBRACE,'}',8,103)
```

Carefully study the output of the lexer and make sure that it makes sense. Once you are
reasonably happy with the output, try running some of the more tricky tests designed to stress test various corner cases. The repository provided as a base to implement the project contains a large set of tests to verify your code: check them to see more examples.

Here is another example:


```cpp
int v[5] = { 1, 3, 5, 7, 9};
assert v[3] == 7;
```

```
LexToken(INT,'int',1,0)
LexToken(ID,'v',1,4)
LexToken(LBRACKET,'[',1,5)
LexToken(INT_CONST,'5',1,6)
LexToken(RBRACKET,']',1,7)
LexToken(EQUALS,'=',1,9)
LexToken(LBRACE,'{',1,11)
LexToken(INT_CONST,'1',1,13)
LexToken(COMMA,',',1,14)
LexToken(INT_CONST,'3',1,16)
LexToken(COMMA,',',1,17)
LexToken(INT_CONST,'5',1,19)
LexToken(COMMA,',',1,20)
LexToken(INT_CONST,'7',1,22)
LexToken(COMMA,',',1,23)
LexToken(INT_CONST,'9',1,25)
LexToken(RBRACE,'}',1,26)
LexToken(SEMI,';',1,27)
LexToken(ASSERT,'assert',2,29)
LexToken(ID,'v',2,36)
LexToken(LBRACKET,'[',2,37)
LexToken(INT_CONST,'3',2,38)
LexToken(RBRACKET,']',2,39)
LexToken(EQ,'==',2,41)
LexToken(INT_CONST,'7',2,44)
LexToken(SEMI,';',2,45)
```