# 📌 Introduction to Lexical Analyzer

## 🔍 Phases of a Compiler
When we input an **arithmetic expression** into the **lexical analysis phase**, the lexical analyzer produces a **stream of tokens** 🎭.

💡 It utilizes **regular expressions** for recognizing tokens. Specifically, it employs **regular grammar (Type-3 grammar)** for this purpose.

---

## 📝 Features of Lexical Analyzer

![Screenshot (120).png](attachment:c5ad41f6-4502-42dc-996c-da03731efb10.png)

1. **Scans** the high-level source code **line by line**.
2. **Takes lexemes as input** and produces **tokens as output**.
3. **Recognizes different types of tokens**:
   - **Identifiers** 🏷️
   - **Operators** ➕➖
   - **Constants** 🔢
   - **Keywords** (e.g., `int`, `return`) 🔑
   - **Literals** (e.g., string literals) 📝
   - **Punctuators** (e.g., `,`, `;`, `{}`, `()`) 📍
   - **Special Characters** (e.g., `&`, `_`) ✨

---

## 🔄 Functions in a Lexical Analyzer

![Screenshot (121).png](attachment:8c043bb1-b57b-41ad-8121-e47418beabd2.png)

1. **Scanning** 🔍
   - Removes **non-token elements** (e.g., **comments, white spaces**).
2. **Analyzing** 🔬
   - Converts lexemes into tokens using **Finite State Machines (FSMs)**.
---

## 📖 Understanding the Analyzing Phase
Let's take a look at how the lexical analyzer processes different tokens in **C language**:

![Screenshot (126).png](attachment:a5369278-d42f-4cdf-9ea7-f7b0031b73cc.png)

### 🌟 Recognizing the `if` Keyword
- The **Finite State Machine (FSM)** starts at an **initial state (A)**.
- Seeing `i`, it transitions to **state B**.
- Seeing `f`, it transitions to **state C (final state)**.
- If any other character follows, it is an **identifier** instead of a keyword.

### 🏷️ Recognizing Identifiers
- An identifier **must start with**:
  - A **small letter (`a-z`)** 🔡
  - A **capital letter (`A-Z`)** 🔠
  - An **underscore (`_`)** 🖍️
- After the first character, it can contain **letters, digits (`0-9`), or underscores (`_`)**.

### 🔢 Recognizing Integers
- If a number starts with `+` or `-`, it moves to **state G**.
- Seeing a digit (`0-9`), it moves to **final state H**.
- If there are multiple digits, an **epsilon transition** allows multiple digit recognition.
- Integers **may or may not** have a sign (`+` is implied by default).

---

## 🔄 Combining FSMs into a Single DFA
Since we need to recognize **keywords, identifiers, and integers**, we merge multiple **Finite State Machines (FSMs)** into a **single Deterministic Finite Automaton (DFA)**:

1. **Introduce a new initial state (S)**.
2. **Add epsilon transitions** to different token states.

![Screenshot (127).png](attachment:bafedc00-82cf-44a4-9ad3-068bad157ee1.png)

3. Convert the resulting **Non-Deterministic Finite Automaton (NFA)** into a **Deterministic Finite Automaton (DFA)**.

![Screenshot (128).png](attachment:1b1035ce-460c-4665-a3ae-1c5fb51b73d4.png)

🎯 **Key Final States in DFA**

![Screenshot (135).png](attachment:98997929-c269-426d-b739-a2ed79ed29aa.png)

- **State 2**: Recognizes the keyword `if` 🔑.
- **State 1 & 3**: Recognize identifiers 🏷️.
- **State 4**: Recognizes integers 🔢.

💡 **DFA is used in the analyzing phase of the lexical analyzer for pattern matching.**

---

## 🚀 Additional Features of Lexical Analyzer
### 📝 Removing Comments
Lexical analyzers **ignore comments** while scanning the code:
1. **Single-line comments** (`//`): Ignored until a newline 📝.
2. **Multi-line comments** (`/* ... */`): Ignored until `*/`.

🔹 Example:
```c
int x; // This is a comment
```
🔹 After lexical analysis:
```c
int x;
```
![Screenshot (140).png](attachment:4a6b824a-1804-4293-8284-a46d2128d8fa.png)

### ⬜ Handling White Spaces
White spaces are classified as:
- **Spaces (` `)**: Inserted via spacebar.
- **Tabs (`\t`)**: Inserted via Tab key.
- **Newline (`\n`)**: Inserted via Enter key.

Lexical analyzers **remove unnecessary white spaces** but **retain essential ones** to maintain program structure.

![Screenshot (141).png](attachment:1dbf1256-bea7-4d99-ae62-2303387d7c58.png)

---