## **UNIT I: Introduction to Compilers**

* Structure of a Compiler
* **Lexical Analysis**

  * Role of Lexical Analyzer
  * Input Buffering
  * Specification of Tokens
  * Recognition of Tokens
  * **Lex Tool**
* **Finite Automata**

  * Regular Expressions to Automata
  * Minimizing DFA



---

# üìù **UNIT I ‚Äì Introduction to Compilers**

---

## üîπ 1. Structure of a Compiler

A **compiler** is a program that translates source code (high-level language like C, C++, Java) into target code (low-level machine code or intermediate code).

üëâ Why not direct execution?
Because computers only understand **machine code (binary)**, not human-readable languages.

### üìå Major Phases of a Compiler

The compiler works in **two major parts**:

**A. Front-End (Analysis part)** ‚Üí Breaks down source code and checks correctness.  
**B. Back-End (Synthesis part)** ‚Üí Generates optimized machine code.

---

### üìä **Compiler Phases (Step-by-Step)**

1. **Lexical Analysis (Scanner)**

   * Reads characters ‚Üí groups into tokens.
   * Example: `int x = 10;` ‚Üí Tokens: `int` (keyword), `x` (identifier), `=` (operator), `10` (constant), `;` (delimiter).

2. **Syntax Analysis (Parser)**

   * Checks grammar rules (structure).
   * Example: Validates if `"int x = 10;"` follows language grammar.

3. **Semantic Analysis**

   * Checks meaning & consistency.
   * Example: `"int x = 10.5;"` ‚Üí Error, because assigning float to int.

4. **Intermediate Code Generation**

   * Produces intermediate representation (IR).
   * Example: `x = 10;` ‚Üí `t1 = 10; x = t1;`

5. **Code Optimization**

   * Improves performance (speed/memory).
   * Example: Removing redundant calculations.

6. **Code Generation**

   * Converts IR to machine code.
   * Example: `MOV R1, 10; MOV x, R1`

7. **Symbol Table Management**

   * Stores information about variables, functions, classes, etc.

8. **Error Handling**

   * Reports errors in lexical, syntax, or semantic phases.

---

‚úÖ **Diagram of Compiler Structure (Textual Form)**

```
Source Program 
     ‚Üì
Lexical Analysis ‚Üí Syntax Analysis ‚Üí Semantic Analysis ‚Üí IR Generation ‚Üí Optimization ‚Üí Code Generation 
     ‚Üì
Target Program
```

---

## üîπ 2. Lexical Analysis

### üìå Role of Lexical Analyzer

* First phase of the compiler.
* Reads **characters** from source ‚Üí produces **tokens**.
* Works like a ‚Äúscanner‚Äù.
* Removes whitespace & comments.
* Interfaces with **symbol table**.

üëâ Example:
Input: `int sum = a + b;`
Output tokens:

* `int` ‚Üí keyword
* `sum` ‚Üí identifier
* `=` ‚Üí assignment operator
* `a` ‚Üí identifier
* `+` ‚Üí operator
* `b` ‚Üí identifier
* `;` ‚Üí delimiter

---

### üìå Input Buffering

Problem: Compiler reads source program character by character ‚Üí slow.

Solution: Use **buffering techniques** with **two buffers** (like sliding windows).

* **Lexeme** = sequence of characters forming a token.
* **Sentinels** used to mark buffer end.

üëâ Example:
If we need to identify `while` in code ‚Üí buffer ensures smooth reading without reloading too often.

---

### üìå Specification of Tokens

A **token** has:

1. **Token Name** (type: identifier, keyword, operator, etc.)
2. **Attribute Value** (points to symbol table entry).

üëâ Example: `int count = 5;`

* `int` ‚Üí Token: KEYWORD
* `count` ‚Üí Token: IDENTIFIER (symbol table entry: variable name, type = int)
* `=` ‚Üí Token: ASSIGNMENT OP
* `5` ‚Üí Token: CONSTANT

---

### üìå Recognition of Tokens

Tokens are recognized using **regular expressions** and **finite automata**.

Example:

* Identifier ‚Üí `[a-zA-Z_][a-zA-Z0-9_]*`
* Number ‚Üí `[0-9]+`
* Whitespace ‚Üí `(\t | \n | " ")`

The **Lexical Analyzer** uses **DFA (Deterministic Finite Automata)** to match characters against token patterns.

---

### üìå Lex Tool (Lex/Flex)

* A tool to generate **lexical analyzers** automatically.
* Programmer writes **patterns** (using regex).
* Lex generates C code for scanner.

üëâ Example Lex Program:

```lex
%{
#include <stdio.h>
%}
%%
[0-9]+   { printf("NUMBER "); }
[a-zA-Z]+ { printf("WORD "); }
.        { printf("SYMBOL "); }
%%
int main() {
    yylex();
    return 0;
}
```

Input: `sum = 100`
Output: `WORD SYMBOL NUMBER`

---

## üîπ 3. Finite Automata

Lexical analysis heavily relies on **finite automata** for token recognition.

---

### üìå Regular Expressions ‚Üí Automata

* **Regular Expressions (RE):** Describe patterns.
* **Finite Automata:** Machines that recognize patterns.

üëâ Example:
Regex: `(a|b)*abb`
‚Üí Language of strings ending with `abb`.

---

### üìå Types of Finite Automata

1. **NFA (Non-deterministic Finite Automata)**

   * Multiple possible moves for a state.
   * Easier to construct from regex.

2. **DFA (Deterministic Finite Automata)**

   * Exactly one move per input symbol.
   * Faster for recognition.

---

### üìå NFA ‚Üí DFA Conversion (Subset Construction)

Steps:

1. Start from NFA start state.
2. Compute epsilon-closures.
3. Create DFA states as sets of NFA states.
4. Define transitions.

---

### üìå Minimization of DFA

Goal: Reduce DFA states without changing language.

Steps:

1. Partition states into groups (final, non-final).
2. Refine groups until no further split possible.
3. Merge equivalent states.

üëâ Example: DFA with 6 states may reduce to 3 states after minimization.

---

## ‚úÖ **Summary (Unit I Key Takeaways)**

* Compiler works in phases: **Lexical ‚Üí Syntax ‚Üí Semantic ‚Üí IR ‚Üí Optimization ‚Üí Code Generation.**
* **Lexical Analyzer** converts characters ‚Üí tokens using **regex + finite automata**.
* **Input buffering** improves performance.
* **Lex tool** automates scanner creation.
* **Finite Automata (NFA/DFA)** are used for token recognition, and DFA minimization makes it efficient.

---


![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)