#### Flex

The Unix tool [`flex`](https://github.com/westes/flex) is a _scanner (lexical analyzer) generator_: it takes a specification of a set of symbols (tokens) and associated actions as input and generates a C program which recognizes these symbols and executes the associated actions, which essentially are attribute evaluation rules written in C. Its predecessor [`lex`](http://dinosaur.compilertools.net/lex/index.html) ([pdf](http://poincare.matf.bg.ac.rs/~nemanja_micovic/materijali/ppj/2017.2018/05/lex2.pdf)) originated at Bell Labs in 1975. [Scanner generators](https://en.wikipedia.org/wiki/Comparison_of_parser_generators#Regular_languages) exist for numerous programming languages.

The input to flex (and lex) is a file of the form:

    definitions
    %%
    rules
    %%
    user code
    
A rule is of the form

    r		{action}

where `r` is a regular expression and `action` is a C statement. A definition

    ident    r

gives a name to a regular expression. The regular expressions are similar to those of `grep`. The generated scanner matches characters from standard input with the regular expressions of the rules following the principle of the _longest match_, e.g. `.+` will match the whole input and not just the next character. If there is no match, the next character is copied to standard output. Upon a match, the matched prefix of the input is assigned to the C variable `yytext` and its length to `yyleng`.

For example, the _ROT13 cipher_ encrypts a text by replacing each lowercase and uppercase character with the one that come 13 positions later in the alphabet, in a cyclical fashion. Other characters like digits are unchanged:

In [None]:
%%writefile rot13.l
%option noyywrap
%%
[a-z]   { char ch = yytext[0];
          ch = (ch - 'a' + 13) % 26 + 'a';
          printf("%c", ch);
        }
[A-Z]   { char ch = yytext[0];
          ch = (ch - 'A' + 13) % 26 + 'A';
          printf("%c", ch);
        }
%%
void main(void){
    yylex();
}

Compiling above with `flex` results in `lex.yy.c`, in which C function `yylex()` implements the rules and which literally includes the user functions. The main function calls `yylex()`, so no user code is needed here. The file `lex.yy.c` is compilded with `cc` to an executable file, here called `rot13`:

In [None]:
!flex rot13.l

In [None]:
!cc -o rot13 lex.yy.c

As `rot13` reads from standard input, the Unix pipe (`|`) can be used to provide input:

In [None]:
!echo 1280 Main Street West | ./rot13

The ROT13 cipyher is an _involution_, meaning that applying it twice results in the original text:

In [None]:
!echo 1280 Main Street West | ./rot13 | ./rot13

More than one rule can match. In that case, `flex` resolves the ambiguity as follows:
- The rule with the longest match is preferred.
- Among the rules with the same number of characters matching, the first is preferred.

As an example, consider converting roman numerals: `I` is `1`, `V` is `5`, so `II` is 2 and `VI` is `6` but `IV` is `4`, not `I` followed by `V`. Following program takes two roman numerals and prints their sum. It also illustrates the use of a definition: `WS` (white space) is defined as `[ \t]+`. In the rules `WS` is enclosed in braces to indicate that this refers to the definition,  not the characters `WS`. The `|` after `{WS}` means that it shares the action of the next rule. The declaration of variable `total` is on a line that starts with space; any such line is copied literally to the generated C code. Note how in this example tabs are used in the definitions and rules.

In [None]:
%%writefile roman.l
%option noyywrap
WS  [ \t]+
%%
    int total = 0;
I   total += 1;
IV  total += 4;
V   total += 5;
IX  total += 9;
X   total += 10;
XL  total += 40;
L   total += 50;
XC  total += 90;
C   total += 100;
CD  total += 400;
D   total += 500;
CM  total += 900;
M   total += 1000;
{WS}   |
\n  return total;
%%
int main(void) {
    int first = yylex();
    int second = yylex();
    printf("%d + %d = %d\n", first, second, first + second);
}

In [None]:
!flex roman.l

In [None]:
!cc -o roman lex.yy.c

In [None]:
!echo IV VI | ./roman

Anything between `%{` and `%}` in the definitions sections as well as everthing after the second `%%` is also copied literally to the generated C code.

Running `man flex` from the terminal produces the man page and `info flex` produces the manual. The manual is also available [online](https://manpages.ubuntu.com/manpages/bionic/man1/lex.1.html).