A lexical analyzer generator that converts regular expression definitions into a Non-deterministic Finite Automaton (NFA) for token recognition.
Supports: Keywords, punctuations, identifiers, numbers, operators, character ranges (a-z), and standard regex operations (|, *, +, concatenation).
compiler_project/
├── CMakeLists.txt
├── main.cpp
├── README.md
│
├── LexicalAnalyzerGenerator/
│ ├── PostfixConverter.cpp # Infix to postfix converter
│ ├── FiniteAutomaton.cpp # State and FA classes
│ ├── NFA.cpp # NFA with regex operations
│ ├── RegexToNfaConverter.cpp # Converts postfix to NFA
│ ├── LexicalParser.cpp # Parses regex files
├── utils
│ └── StringUtils.cpp # String utilities
│
└── test/
├── test_postfix.cpp
├── test_fa.cpp
├── test_nfa_operations.cpp
├── test_utils.cpp
├── test_parser.cpp
└── test_integration.cpp
| File | Purpose | Key Components |
|---|---|---|
| PostfixConverter.cpp | Converts infix regex to postfix using Shunting Yard algorithm | PostfixConverter class, handles operator precedence and parentheses |
| FiniteAutomaton.cpp | Basic automaton structure | State class (transitions, epsilon closure), FiniteAutomaton class (start/end states, state counting) |
| NFA.cpp | NFA with regex operations | Wraps FiniteAutomaton, operators: + (positive closure), * (Kleene star), ` |
| RegexToNfaConverter.cpp | Converts postfix expressions to NFAs | Stack-based Thompson's Construction, handles definitions and lambda transitions |
| LexicalParser.cpp | Parses regex specification files | Reads keywords {...}, punctuations [...], definitions =, expressions :, expands ranges (a-z), normalizes spaces |
| StringUtils.cpp | String manipulation utilities | split() |
Entry point - parses input file and builds complete NFA.
{keyword1 keyword2} # Keywords (priority 0)
[punct1 punct2] # Punctuations (priority 1)
name = expression # Regular definition
name : expression # Token pattern (priority 2+)
Example:
{if while for}
[; , ( )]
letter = a-z | A-Z
digit = 0-9
id: letter (letter|digit)*
num: digit+
# Create build directory
mkdir build && cd build
# Configure
cmake ..
# Build
make
# Run main program
./compiler_project
# Run tests
./test_postfix
./test_fa
./test_nfa_operations
./test_parser
./test_integrationg++ -std=c++20 -I. main.cpp LexicalAnalyzerGenerator/*.cpp -o compiler_project| Test File | Tests |
|---|---|
| test_postfix.cpp | Infix to postfix conversion |
| test_fa.cpp | Basic automaton operations |
| test_nfa_operations.cpp | NFA regex operations (union, concatenation, closures) |
| test_utils.cpp | String utilities |
| test_parser.cpp | File parsing and token priorities |
| test_integration.cpp | Complete pipeline |
Run all tests:
cd build
./test_postfix && ./test_fa && ./test_nfa_operations && ./test_parser && ./test_integrationInput File → LexicalParser → PostfixConverter → RegexToNfaConverter → Complete NFA
↓ ↓ ↓
Parse tokens Infix→Postfix Postfix→NFA (Thompson's)
Expand ranges Handle operators Stack-based construction
Normalize spaces
- Shunting Yard: Infix to postfix conversion with operator precedence
- Thompson's Construction: Postfix to NFA using stack-based approach
- Epsilon Closure: DFS to compute ε-closure for NFA states
- Range Expansion:
a-z→a|b|c|...|z
- Priority 0: Keywords (highest)
- Priority 1: Punctuations
- Priority 2+: Regular expressions (order of definition)
Used to resolve conflicts during token recognition.
|- Alternation (union)%- Concatenation*- Kleene closure (0 or more)+- Positive closure (1 or more)()- Grouping\- Escape character
\L- Lambda (epsilon/empty string)a-z- Character range\(- Escaped parenthesis