Compiler Project - Lexical Analyzer Generator

Overview

A lexical analyzer generator that converts regular expression definitions into a Non-deterministic Finite Automaton (NFA) for token recognition.

Supports: Keywords, punctuations, identifiers, numbers, operators, character ranges (a-z), and standard regex operations (|, *, +, concatenation).

Project Structure

compiler_project/
├── CMakeLists.txt
├── main.cpp
├── README.md
│
├── LexicalAnalyzerGenerator/
│   ├── PostfixConverter.cpp       # Infix to postfix converter
│   ├── FiniteAutomaton.cpp        # State and FA classes
│   ├── NFA.cpp                    # NFA with regex operations
│   ├── RegexToNfaConverter.cpp    # Converts postfix to NFA
│   ├── LexicalParser.cpp          # Parses regex files
├── utils
│   └── StringUtils.cpp            # String utilities
│
└── test/
    ├── test_postfix.cpp
    ├── test_fa.cpp
    ├── test_nfa_operations.cpp
    ├── test_utils.cpp
    ├── test_parser.cpp
    └── test_integration.cpp

File Descriptions

LexicalAnalyzerGenerator/

File	Purpose	Key Components
PostfixConverter.cpp	Converts infix regex to postfix using Shunting Yard algorithm	`PostfixConverter` class, handles operator precedence and parentheses
FiniteAutomaton.cpp	Basic automaton structure	`State` class (transitions, epsilon closure), `FiniteAutomaton` class (start/end states, state counting)
NFA.cpp	NFA with regex operations	Wraps FiniteAutomaton, operators: `+` (positive closure), `*` (Kleene star), `
RegexToNfaConverter.cpp	Converts postfix expressions to NFAs	Stack-based Thompson's Construction, handles definitions and lambda transitions
LexicalParser.cpp	Parses regex specification files	Reads keywords `{...}`, punctuations `[...]`, definitions `=`, expressions `:`, expands ranges (a-z), normalizes spaces
StringUtils.cpp	String manipulation utilities	`split()`

main.cpp

Entry point - parses input file and builds complete NFA.

Input File Format

{keyword1 keyword2}          # Keywords (priority 0)
[punct1 punct2]              # Punctuations (priority 1)
name = expression            # Regular definition
name : expression            # Token pattern (priority 2+)

Example:

{if while for}
[; , ( )]
letter = a-z | A-Z
digit = 0-9
id: letter (letter|digit)*
num: digit+

How to Build and Run

Build with CMake

# Create build directory
mkdir build && cd build

# Configure
cmake ..

# Build
make

# Run main program
./compiler_project

# Run tests
./test_postfix
./test_fa
./test_nfa_operations
./test_parser
./test_integration

Build without CMake

g++ -std=c++20 -I. main.cpp LexicalAnalyzerGenerator/*.cpp -o compiler_project

Testing

Test File	Tests
test_postfix.cpp	Infix to postfix conversion
test_fa.cpp	Basic automaton operations
test_nfa_operations.cpp	NFA regex operations (union, concatenation, closures)
test_utils.cpp	String utilities
test_parser.cpp	File parsing and token priorities
test_integration.cpp	Complete pipeline

Run all tests:

cd build
./test_postfix && ./test_fa && ./test_nfa_operations && ./test_parser && ./test_integration

Pipeline Flow

Input File → LexicalParser → PostfixConverter → RegexToNfaConverter → Complete NFA
                ↓              ↓                      ↓
          Parse tokens   Infix→Postfix      Postfix→NFA (Thompson's)
          Expand ranges  Handle operators   Stack-based construction
          Normalize spaces

Key Algorithms

Shunting Yard: Infix to postfix conversion with operator precedence
Thompson's Construction: Postfix to NFA using stack-based approach
Epsilon Closure: DFS to compute ε-closure for NFA states
Range Expansion: a-z → a|b|c|...|z

Token Priority System

Priority 0: Keywords (highest)
Priority 1: Punctuations
Priority 2+: Regular expressions (order of definition)

Used to resolve conflicts during token recognition.

Quick Reference

Regex Operators

| - Alternation (union)
% - Concatenation
* - Kleene closure (0 or more)
+ - Positive closure (1 or more)
() - Grouping
\ - Escape character

Special Tokens

\L - Lambda (epsilon/empty string)
a-z - Character range
\( - Escaped parenthesis

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
LexicalAnalyzerGenerator		LexicalAnalyzerGenerator
ParserGenerator		ParserGenerator
executable		executable
input		input
model		model
test		test
utils		utils
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
lexical_rules.txt		lexical_rules.txt
main.cpp		main.cpp
output_derivation.txt		output_derivation.txt
test_error_program.txt		test_error_program.txt
test_program.txt		test_program.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compiler Project - Lexical Analyzer Generator

Overview

Project Structure

File Descriptions

LexicalAnalyzerGenerator/

main.cpp

Input File Format

How to Build and Run

Build with CMake

Build without CMake

Testing

Pipeline Flow

Key Algorithms

Token Priority System

Quick Reference

Regex Operators

Special Tokens

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Compiler Project - Lexical Analyzer Generator

Overview

Project Structure

File Descriptions

LexicalAnalyzerGenerator/

main.cpp

Input File Format

How to Build and Run

Build with CMake

Build without CMake

Testing

Pipeline Flow

Key Algorithms

Token Priority System

Quick Reference

Regex Operators

Special Tokens

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages