Yet another not-so-simple lex analysor generator and naïve regular expression engine written in programming language Scheme.
Warning! This project is still evolving, most of the API will be changed later.
A regular expression usually is a sequence of characters that define a search pattern.
In usual programming language, Regular Expression is defined in form of String or literal. But for now in SLeX, you can only use RE-IR (Intermediate Representation) to define a RE (see Roadmap). In SLeX, we have:
- 1 primitive RE-IR constant:
eps
: eat none of input characters.
- 6 primitive constructors:
sig
,sig*
: eat a single character in the given char-set.sig-co
,sig*-co
: eat any single character which not in the given char-set.exact
: match exactly the given string (i.e. sequence of chars), case-sensitively.exact-ci
: match exactly the given string, but case-insensitively.
- and another 5 RE-IR combinators:
alt
seq
rep?
rep+
kln*
The constant eps
returns a ε-RE which matches whatever you gave:
(define RE-eps (RE/compile eps))
(RE/matches? RE-eps "") ; ==> #t
(RE/matches? RE-eps "asdf") ; ==> #t
Gavin a char-set, sig
construct a RE that matches exact one char in that candidate set. On the other hand, sig*
receives multiple chars as arguments and packed them as a char-set.
(define RE-digit (RE/compile (sig slex:digit)))
(define RE-digit (RE/compile (sig* #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)))
(RE/matches? RE-digit "1") ; ==> #t
(RE/matches? RE-digit "a") ; ==> #f
(RE/matches? RE-digit "12") ; ==> #f
There are some pre-defined char-sets that you can use:
(define slex:digit char-set:numeric)
(define slex:upper-case char-set:upper-case)
(define slex:lower-case char-set:lower-case)
(define slex:alpha char-set:alphabetic)
(define slex:alphanum char-set:alphanumeric)
(define slex:whitespace char-set:whitespace)
(define slex:standard char-set:standard)
(define slex:graphic char-set:graphic)
exact
and exact-ci
match exactly the given string, while exact-ci
means case-insensitive:
(define RE-switch0 (RE/compile (exact "switch0")))
(define RE-switch0-ci (RE/compile (exact-ci "switch0")))
(RE/matches? RE-switch0 "switch0") ; ==> #t
(RE/matches? RE-switch0 "SWITCH0") ; ==> #f
(RE/matches? RE-switch0-ci "sWiTCh0") ; ==> #t
alt
makes an alternation of two or more REs:
(define RE-peculiar-identifier (RE/compile (alt (sig* #\+ #\-) (exact "...")))
(RE/matches? RE-peculiar-identifier "+") ; ==> #t
(RE/matches? RE-peculiar-identifier "...") ; ==> #t
(RE/matches? RE-peculiar-identifier "+..") ; ==> #f
seq
connects two or more REs:
; equivalent to (exact "...")
(define RE-ldots (RE/compile (seq (sig* #\.) (sig* #\.) (sig* #\.))))
(RE/matches? RE-ldots "...") ; ==> #t
(RE/matches? RE-ldots ".") ; ==> #f
(RE/matches? RE-ldots "....") ; ==> #f
rep?
means that the RE should appear zero or exact one time:
; positive-integer may have a plus sign prefix
(define RE-positive-integer-revised
(RE/compile (seq (rep? (sig* #\+)) positive-integer)))
(RE/matches? RE-positive-integer-revised "+012345") ; ===> #t
kln*
creates a so called Kleene Closure of a RE:
; for simplicity, we suppose that integer could start with '0'
(define RE-positive-integer (RE/compile (kln* slex:digit)))
(RE/matches? RE-positive-integer "123456") ; ===> #t
(RE/matches? RE-positive-integer "012345") ; ===> #t
(RE/matches? RE-positive-integer "0.1234") ; ===> #f
if a RE r is equivalent to a DFA M, thus, a String s ∈ L(r) iff δ q0 s ∈ F or partial-δ q0 s "" = f s "".
scan
function will return all the possible matches occures in text.
To analysis the token stream, you should define a lexer with define-lex
special form and then add definitions and rules to it. The define-lex
special form has following forms:
(define-lex <lex-id> {<definition-exps>} <rule-exps>)
;;; where
;;; <definition-exps> ::= (<def-type> <def-clause> ... )
;;; <def-type> ::= definitions | definitions*
;;; <def-clause> ::= (<identifier> <scheme-exp>)
;;;
;;; <rule-exps> ::= (rule <rule-clause> ... {<default-exp>})
;;; <rule-clause> ::= (<pattern> <action>)
;;; <pattern> ::= <scheme-exp>
;;; <action> ::= <scheme-exp>
;;; <default-exp> ::= (default <scheme-exp>)
The semantic of define-lex
could be described as:
the special form
define-lex
defines a named lexer under a lexical enviornment expanded by keyworddefinition
(usinglet
) ordefinitions
(usinglet*
), whose pattern-action pairs are specified by keywordrule
, where the<pattern>
of<rule-clause>
must be a expression which evaluated to be a RE-IR or RE-string, and the<action>
of<rule-clause>
must be a expression which evaluated to be aself-evaluating
data or a tripple arugments procedure.
This may be confusing, let's take example/simple-lang.scm
as example:
(define-lex simple-L
(definition
(integer (rep+ (sig slex:digit)))
(identifier (rep+ (sig slex:alpha)))
(func (sig* #\+ #\-))
(keyword-set (exact-ci "set!"))
(delimiter (sig slex:whitespace))
; also action procedure
(sv-token
(lambda (color-func)
(lambda (token-str start-at)
(color-func token-str)))))
(rule
(integer (sv-token string-white))
(keyword-set (sv-token string-red))
(identifier (sv-token string-green))
(func (sv-token string-blue))
(delimiter Lex/action:token-str)
(default Lex/action:handle-error)))
~~~~~~~~~~~~~
^ String (DFANode . List<DFANode>)
| +\----+ /
TODO | RE | DFA
| +--+--+ ^
v Parse | | NFA->DFA
~~~~~~~~~~~~ v |
RE IR ---------------> NFA
/ RE->NFA \
(Symbol . Char-set | RE) (NFANode . NFANode)
A DFA M is a 5-tuple, (Q, Σ, δ, q0, F), consisting of
- a finite set of states (Q)
- a finite set of input symbols called the alphabet (Σ)
- a transition function (δ : Q × Σ → Q)
- an initial or start state (q0 ∈ Q)
- a set of accept states (F ⊆ Q)
It is very useful to extend the transition function to receive a string as argument, thus we may define δ' as (where $
stands for end-of-string):
δ' : Q x Σ* -> Q
δ' q $ = q
δ' q (w :: ws) = δ (δ' q ws) w
Sometime, when defining a scan
like function, we may find a partial transition function is useful also. A partial transition function also accept two argument as the normal transition function:
partial-δ Q x Σ* x Σ* -> Σ* x Σ*
partial-δ q xs $ = xs :: ""
partial-δ q xs (w :: ws) =
if δ q w == fail then xs :: (w :: ws)
else partial-δ (δ q w) (append xs w) ws