Skip to content

Mini compiler (Flex/Bison): C++ subset → quadruples IR → optimizer → 16-bit 8086 assembly (MASM/TASM), runnable with emu8086.

Notifications You must be signed in to change notification settings

Pexieftw/flex-bison-cpp2asm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Compact C/C++ Compiler Pipeline

A compact compiler pipeline built with Flex and Bison that processes a C/C++-like subset language. The compiler performs complete lexical/syntax/semantic analysis, emits intermediate representation (IR) as quadruples, applies basic optimizations, and generates 16-bit 8086 assembly code compatible with MASM/TASM and emulators like emu8086.

Features

  • Complete compilation pipeline: From source code to assembly
  • Multi-stage optimization: Copy propagation, algebraic simplification, dead code elimination
  • 8086 assembly generation: MASM/TASM-style output for DOS emulators
  • Educational focus: Clean, readable implementation for learning compiler construction

Pipeline Overview

The compiler processes input through several stages:

  1. Lexical Analysis (lexical.l) - Tokenization for identifiers, literals, operators, keywords, and stream operators
  2. Syntax Analysis (syntax.y) - Bison grammar parsing with semantic context building
  3. Semantic Analysis (headers/ts.h) - Symbol table management and type checking
  4. IR Generation (headers/quads.h) - Quadruple intermediate representation
  5. Optimization (headers/quads.h) - Multiple optimization passes
  6. Assembly Generation (headers/assembly.h) - 8086 code generation

Repository Structure

.
├── headers/
│   ├── ts.h         # Symbol table + semantic helpers
│   ├── quads.h      # Quadruple IR + optimizer passes
│   ├── stack.h      # String/int stacks for parser actions
│   └── assembly.h   # 8086 assembly backend
├── lexical.l        # Flex lexer specification
├── syntax.y         # Bison parser grammar
├── input/
│   └── code.cpp     # Source program to compile
└── (generated files)
    ├── lex.yy.c
    ├── syntax.tab.c / syntax.tab.h
    ├── negative.asm  # Final assembly output
    └── temp files: data.txt, code.txt

Language Subset

Supported Types

  • int, float
  • char* / std::string (handled as String internally)
  • Basic template-style tokens (IDF::IDF)
  • Arrays with bounds checking

Declarations

int a;
float b = 2.5;
int arr[n];
const int c = 3;

Expressions & Operations

  • Arithmetic: +, -, *, /
  • Parentheses for precedence
  • Integer/float/string literals
  • Array indexing
  • Assignment operators

Control Flow

if (condition) { ... }
for (init; condition; increment) { ... }
// Post-increment in statements
variable++;

I/O Recognition

std::cout << expression << std::endl;

Note: Recognized for parsing but not fully implemented in backend

Function/Method Calls

  • namespace::function(args)
  • object.method(args)
  • object.attribute
  • object.method<T>(args)

Intermediate Representation

The compiler uses quadruples stored in the quad[] array:

typedef struct qdr {
    char oper[100];  // Operation: "=", "+", "-", "*", "/", "BR", "BLE", etc.
    char op1[100];   // First operand
    char op2[100];   // Second operand
    char res[100];   // Result
} qdr;

Example Quadruples

("=", "x", "", "y")           // y = x
("+", "a", "b", "t1")         // t1 = a + b
("BR", "12", "", "")          // Unconditional jump to quad 12
("BLE", "20", "x", "y")       // if x <= y then jump to quad 20
("BOUNDS", "0", "n-1", "")    // Array bounds declaration
("ADEC", "arr", "", "")       // Array declaration

Optimization Passes

The optimizer performs multiple passes until convergence or maximum iterations:

  1. Copy Propagation (propagationDeCopie) - Replace variables with their copied values
  2. Algebraic Simplification (simplificationAlgebrique) - Simplify arithmetic expressions
  3. Redundant Expression Elimination (eliminationDeExpRedondante) - Remove duplicate calculations
  4. Dead Code Elimination (eliminationDeCodeInutile) - Remove unreachable code

Branch targets are automatically updated after transformations.

8086 Assembly Backend

Generates segmented MASM/TASM-style assembly:

Memory Layout

  • Int: DW (16-bit word)
  • Float: DD (32-bit doubleword)
  • String: DB 100 DUP (?) (100-byte buffer)
  • Arrays: Calculated from BOUNDS + ADEC quadruples

Segments Generated

DATA SEGMENT
    ; Variable declarations
DATA ENDS

PILE SEGMENT STACK
    ; Stack space
PILE ENDS

CODE SEGMENT
MAIN:
    ; Program logic
    MOV AX, 4C00h
    INT 21h         ; DOS termination
CODE ENDS

Branching

  • BRJMP
  • BLE, BGE, BG, BL, BE, BNE, BZJxx conditionals

Build Instructions

Prerequisites

  • Flex (lexical analyzer generator)
  • Bison (parser generator)
  • GCC (C compiler)

Compilation

flex lexical.l
bison -d syntax.y
gcc lex.yy.c syntax.tab.c -o prog.exe

Usage

  1. Prepare source code: Place your program in ./input/code.cpp

  2. Run compiler:

    ./prog.exe

    Or execute run.bat

  3. Output:

    • Console: Symbol tables, IR quadruples, and optimization results
    • File: Final assembly code written to negative.asm, you can run it using emu8086

Example Workflow

# Build the compiler
flex lexical.l && bison -d syntax.y && gcc lex.yy.c syntax.tab.c -o compiler

# Write your source code to input/code.cpp
echo "int main() { int x = 5; return x; }" > input/code.cpp

# Compile
./compiler

# View generated assembly
cat negative.asm

About

Mini compiler (Flex/Bison): C++ subset → quadruples IR → optimizer → 16-bit 8086 assembly (MASM/TASM), runnable with emu8086.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published