

---

# üß© Understanding Normalization: The Foundation Before 1NF

Before we zoom into **1NF**, you must understand **why normalization exists**.

Think of a database as a **warehouse of truth** ‚Äî every piece of data should be clean, organized, and logically structured, like items neatly labeled and stored in boxes.

When data is not normalized:

* You‚Äôll have **duplicate data** everywhere.
* Updates will become a nightmare (change one record ‚Üí must change 10 others).
* You‚Äôll get **inconsistent reports**.
* And the storage will blow up unnecessarily.

So, **Normalization** is the process of:

> Structuring a database to minimize redundancy and dependency by dividing large tables into smaller, logical ones and defining relationships between them.

There are multiple ‚Äúlevels‚Äù (called **Normal Forms**), each adding stricter rules:

1. **1NF** ‚Äì Minimum safety and structure.
2. **2NF** ‚Äì Eliminates partial dependency.
3. **3NF** ‚Äì Eliminates transitive dependency.
4. **BCNF**, **4NF**, **5NF** ‚Äì More advanced forms (used rarely in most production systems).

Let‚Äôs focus on **1NF**, the *foundation layer* ‚Äî the rule that every good data model must pass before anything else.

---

# üß± FIRST NORMAL FORM (1NF): Minimum Safety Guarantee

### üß† Definition:

A table is in **First Normal Form (1NF)** if:

1. Each **cell** contains **atomic (indivisible)** values.
2. Each **record (row)** is unique.
3. Each **column** has **values of the same type**.
4. The **order of rows** and **columns** does not matter.

It‚Äôs the **‚Äúminimum safety guarantee‚Äù** ‚Äî if your table violates 1NF, it‚Äôs not even a proper relational table!

---

# üßç‚Äç‚ôÇÔ∏èLet‚Äôs Begin with a Story: ‚ÄúThe Messy People Table‚Äù

Imagine you‚Äôre a data engineer at a company maintaining employee data.

You start with this table called **Employee**:

| Emp_ID | Name    | Phone_Numbers       | Department | Height_Order |
| ------ | ------- | ------------------- | ---------- | ------------ |
| 101    | Alice   | 12345, 67890        | HR         | 1st          |
| 102    | Bob     | 54321               | Finance    | 2nd          |
| 103    | Charlie | 22222, 33333, 44444 | IT         | 3rd          |

Looks okay, right? But this **violates multiple 1NF rules**.
Let‚Äôs dissect each violation one by one ‚Äî like a detective finding clues.

---

## üîç Violation 1: Using Row Order to Convey Meaning

> ‚ùå ‚ÄúThere is no such thing as row order in a relational table.‚Äù

In this table, the **Height_Order** column indirectly represents that Alice is the tallest, then Bob, then Charlie.
If we *removed or rearranged* the rows, that order meaning would break.

üëâ **Relational databases don‚Äôt store rows in any guaranteed order.**
If you want order, you must **explicitly store it as data** ‚Äî not rely on physical sequence.

‚úÖ **Fix:**
Instead of relying on row order, create a proper column like `Height` or `Rank`.

| Emp_ID | Name    | Height (cm) | Department |
| ------ | ------- | ----------- | ---------- |
| 101    | Alice   | 175         | HR         |
| 102    | Bob     | 172         | Finance    |
| 103    | Charlie | 169         | IT         |

Now, even if rows shuffle, the height information remains intact.

---

## üîç Violation 2: Mixing Data Types in a Column

> ‚ùå ‚ÄúMixing datatypes within a column violates 1NF.‚Äù

Imagine if your `Phone_Numbers` column has:

* One row containing integers (`12345`)
* Another containing a string with text like `"Office: 67890"`

That‚Äôs chaos for the database engine ‚Äî it can‚Äôt decide what the datatype is, and queries like numeric filters or string functions may break.

‚úÖ **Fix:**
Each column must have **one consistent datatype**.
So define `Phone_Number` as **VARCHAR** (since phone numbers aren‚Äôt numeric values for calculation).

---

## üîç Violation 3: Table Without a Primary Key

> ‚ùå ‚ÄúA table without a primary key violates 1NF.‚Äù

If we don‚Äôt have a column or combination of columns that uniquely identifies each row, we can‚Äôt distinguish between two identical entries.

| Name  | Department | Phone |
| ----- | ---------- | ----- |
| Alice | HR         | 12345 |
| Alice | HR         | 12345 |

Now we have no clue which Alice is which.

‚úÖ **Fix:**
Always have a **Primary Key** (e.g., `Emp_ID`) to ensure uniqueness.

| Emp_ID | Name  | Department | Phone |
| ------ | ----- | ---------- | ----- |
| 101    | Alice | HR         | 12345 |

---

## üîç Violation 4: Repeating Groups (Multi-valued Columns)

> ‚ùå ‚ÄúStoring a repeating group of data items in a single row violates 1NF.‚Äù

In our first table, Alice‚Äôs `Phone_Numbers` cell stores **multiple values** (`12345, 67890`).
That‚Äôs not atomic ‚Äî it‚Äôs a **list** inside one column.

Databases can‚Äôt efficiently filter or join on individual phone numbers in that format.

‚úÖ **Fix:**
Make the data **atomic** ‚Äî one value per cell.
This might require **splitting the table**.

### Split into two related tables:

**Employee Table**

| Emp_ID | Name    | Department |
| ------ | ------- | ---------- |
| 101    | Alice   | HR         |
| 102    | Bob     | Finance    |
| 103    | Charlie | IT         |

**Employee_Phone Table**

| Emp_ID | Phone_Number |
| ------ | ------------ |
| 101    | 12345        |
| 101    | 67890        |
| 102    | 54321        |
| 103    | 22222        |
| 103    | 33333        |
| 103    | 44444        |

Now we have **1NF compliance**:

* One fact per row.
* Atomic values.
* Clear relationships.
* Easy querying.

---

## üîç Why Atomic Values Matter (Real-world case)

Imagine you want to find which employees have phone number `33333`.
In the original design:

```sql
SELECT * FROM Employee WHERE Phone_Numbers LIKE '%33333%';
```

That‚Äôs ugly, slow, and unreliable (what if commas are missing?).

But with normalized design:

```sql
SELECT e.Name
FROM Employee e
JOIN Employee_Phone p ON e.Emp_ID = p.Emp_ID
WHERE p.Phone_Number = '33333';
```

Now, it‚Äôs **clean**, **indexable**, and **maintainable**.

---

# üí° Summary of 1NF Rules and Fixes

| Violation                   | Example                          | Why It‚Äôs Wrong   | Fix                      |
| --------------------------- | -------------------------------- | ---------------- | ------------------------ |
| Using row order for meaning | Ordered by height                | Order can change | Add a Height column      |
| Mixed datatypes             | Integers + strings in one column | Confuses engine  | Keep consistent datatype |
| No primary key              | Duplicate rows                   | No uniqueness    | Add a unique key         |
| Repeating groups            | ‚Äú12345, 67890‚Äù                   | Not atomic       | Split into child table   |

---

# üß† Real-World Scenario: Why Data Engineers Must Care

In **data modeling for analytics (e.g., in Snowflake)**:

* When ingesting raw data (from logs, JSON, etc.), data often violates 1NF (nested arrays, repeated fields).
* Before building **dimension** and **fact tables**, we **normalize** (flatten) it to make data relational and joinable.

Example:
A JSON object:

```json
{
  "emp_id": 101,
  "name": "Alice",
  "phones": ["12345", "67890"]
}
```

This violates 1NF because `phones` is an array.
In Snowflake, you‚Äôd use **FLATTEN()** to normalize it into rows:

```sql
SELECT emp_id, name, phone.value AS phone_number
FROM raw_employee, LATERAL FLATTEN(input => raw_employee.phones) AS phone;
```

‚û° This converts it into **1NF-compliant** structure.

---

# üí¨ Must-Ask Questions to Master This Concept

1. What are the key rules of 1NF?
2. Why is atomicity important in databases?
3. Why can‚Äôt we rely on row order in a relational table?
4. How do we fix repeating groups in a table?
5. Why must every table have a primary key?
6. How does 1NF apply in modern databases like Snowflake when handling semi-structured data (like JSON)?

---

# üéØ Final Takeaway

Think of **1NF** as the **entry gate** to relational data modeling.
If your table violates 1NF, everything else built on top will eventually collapse ‚Äî just like a building with a weak foundation.

As a **data engineer**, your job is to:

* Ensure every dataset entering your model is at least **1NF compliant**.
* Flatten and structure semi-structured data (JSON, XML) before building analytical models.
* Remember: **Atomic data = Predictable queries = Reliable analytics.**

---



---

## üß© **1Ô∏è‚É£ What are the key rules of 1NF?**

**1NF ensures that your data table follows four golden rules** to be considered *relationally valid*.

Let‚Äôs restate them in an easy, memorable way ‚Äî imagine your table is a ‚Äúschool classroom‚Äù and each rule is a discipline students must follow üëá

| Rule                                       | Description                                                    | Analogy                                                             |
| ------------------------------------------ | -------------------------------------------------------------- | ------------------------------------------------------------------- |
| **Atomic values only**                     | Every cell must contain only one value (not a list or set).    | Each student sits alone ‚Äî no two students on the same chair.        |
| **Unique rows (Primary Key)**              | Each row must be uniquely identifiable by at least one column. | Every student has a unique roll number.                             |
| **Consistent data type per column**        | All entries in a column must store data of the same type.      | The ‚ÄúAge‚Äù column should have only numbers, not names or dates.      |
| **No significance to row or column order** | The sequence of rows or columns doesn‚Äôt carry meaning.         | Even if you rearrange students, attendance records remain the same. |

So, in simple words:

> ‚úÖ A table is in **1NF** when it contains **unique, atomic, and consistently typed values**, and **row order doesn‚Äôt matter**.

---

## üß© **2Ô∏è‚É£ Why is atomicity important in databases?**

Imagine a database as a **library**.
If each ‚Äúbook‚Äù (row) has one title field that contains multiple book names separated by commas ‚Äî how would you find a specific book? You‚Äôd have to **search inside text**, which is slow, error-prone, and impossible to index efficiently.

That‚Äôs why **atomicity (one value per cell)** is crucial.

### üß† Real-world problem:

| Emp_ID | Name  | Skills            |
| ------ | ----- | ----------------- |
| 101    | Alice | Python, SQL, Java |

Now try to find all employees who know ‚ÄúSQL‚Äù:

```sql
SELECT * FROM Employee WHERE Skills = 'SQL';
```

‚ùå No result! Because "SQL" is hidden inside "Python, SQL, Java".

### ‚úÖ Fixed:

| Emp_ID | Skill  |
| ------ | ------ |
| 101    | Python |
| 101    | SQL    |
| 101    | Java   |

Now:

```sql
SELECT * FROM Employee WHERE Skill = 'SQL';
```

‚úÖ Perfect match.
That‚Äôs why **atomic values = predictable and fast queries**.

---

## üß© **3Ô∏è‚É£ Why can‚Äôt we rely on row order in a relational table?**

In a relational database:

> There is **no inherent order of rows** unless explicitly defined.

If you insert records one by one, the **storage engine decides their physical order**.
So, using ‚Äúthe first row‚Äù or ‚Äúthe top record‚Äù to convey meaning is a **1NF violation**.

### Example:

| Rank | Student_Name |
| ---- | ------------ |
| 1    | Alice        |
| 2    | Bob          |

Looks fine? But what if the rows reorder?

| Rank | Student_Name |
| ---- | ------------ |
| 2    | Bob          |
| 1    | Alice        |

Now, who‚Äôs the topper? You can‚Äôt tell by row position.

‚úÖ **Fix:**
Store ‚ÄúRank‚Äù or ‚ÄúScore‚Äù explicitly as a column ‚Äî that‚Äôs the relational way.

```sql
SELECT * FROM Students ORDER BY Rank;
```

That‚Äôs **data-driven ordering**, not **storage-driven ordering** ‚Äî the correct approach under 1NF.

---

## üß© **4Ô∏è‚É£ How do we fix repeating groups in a table?**

Repeating groups (multiple values in one cell) are a **classic 1NF violation**.
They occur when you try to store lists or sets in one column.

### ‚ùå Bad design:

| Order_ID | Customer | Product_Names     |
| -------- | -------- | ----------------- |
| 1001     | Alice    | Keyboard, Mouse   |
| 1002     | Bob      | Monitor, CPU, SSD |

Here `Product_Names` contains **multiple items**, which breaks atomicity.

---

### ‚úÖ Solution: Split into two tables (Normalization)

**Orders table**

| Order_ID | Customer |
| -------- | -------- |
| 1001     | Alice    |
| 1002     | Bob      |

**Order_Products table**

| Order_ID | Product_Name |
| -------- | ------------ |
| 1001     | Keyboard     |
| 1001     | Mouse        |
| 1002     | Monitor      |
| 1002     | CPU          |
| 1002     | SSD          |

Now, each row = one fact, one value per cell.
Querying becomes flexible:

```sql
SELECT DISTINCT Customer
FROM Orders o
JOIN Order_Products p ON o.Order_ID = p.Order_ID
WHERE p.Product_Name = 'Mouse';
```

‚úÖ Easy, logical, and fully 1NF compliant.

---

## üß© **5Ô∏è‚É£ Why must every table have a primary key?**

A **Primary Key** ensures **each row is unique** and can be **referenced reliably** by other tables.

Without it, you risk:

* Duplicate rows (same data multiple times).
* Ambiguity during updates or deletes.
* Wrong join results when connecting tables.

### ‚ùå Example:

| Name  | Department | Salary |
| ----- | ---------- | ------ |
| Alice | HR         | 5000   |
| Alice | HR         | 5000   |

If you want to update Alice‚Äôs salary, which row gets updated? Both? Just one? That‚Äôs inconsistent.

### ‚úÖ Fix:

Add a **Primary Key** (`Emp_ID`).

| Emp_ID | Name  | Department | Salary |
| ------ | ----- | ---------- | ------ |
| 101    | Alice | HR         | 5000   |

Now:

```sql
UPDATE Employee SET Salary = 5500 WHERE Emp_ID = 101;
```

No confusion. One unique record.
That‚Äôs **data integrity** guaranteed by **1NF**.

---

## üß© **6Ô∏è‚É£ How does 1NF apply in modern databases like Snowflake (with JSON or semi-structured data)?**

Great question ‚Äî because Snowflake and other cloud databases often deal with **semi-structured data** like JSON, which can naturally violate 1NF.

### ‚ùå Example JSON stored in a VARIANT column:

```json
{
  "emp_id": 101,
  "name": "Alice",
  "phones": ["12345", "67890"]
}
```

Here, `phones` is a **repeating group** ‚Äî not atomic ‚Üí violates 1NF.

---

### ‚úÖ Solution: Flatten it!

In Snowflake, use the **FLATTEN()** function to make the data 1NF-compliant.

```sql
SELECT
  emp_id,
  name,
  phone.value AS phone_number
FROM raw_employee,
LATERAL FLATTEN(input => raw_employee.phones) AS phone;
```

**Result:**

| emp_id | name  | phone_number |
| ------ | ----- | ------------ |
| 101    | Alice | 12345        |
| 101    | Alice | 67890        |

Now, each row represents one phone number ‚Äî **atomic**, **queryable**, and **indexable**.

That‚Äôs how you **normalize semi-structured data** in Snowflake into a relational model that respects **1NF**.

---

# üéØ **In Summary: Key Takeaways**

| Concept                      | Core Idea                   | Real-World Importance           |
| ---------------------------- | --------------------------- | ------------------------------- |
| **Atomic values**            | One fact per cell           | Enables clean joins and filters |
| **Primary key**              | Ensures unique rows         | Maintains integrity             |
| **No repeating groups**      | Avoid lists in cells        | Simplifies queries              |
| **Consistent datatypes**     | Column = one type           | Prevents query errors           |
| **No reliance on row order** | Order via data, not storage | Makes model portable            |
| **1NF in modern systems**    | Flatten JSON/XML arrays     | Converts raw to relational      |

---

### üß† Think Like a Data Engineer:

When building **dimensional models** in Snowflake or other systems:

* Always ensure your **staging layer** data conforms to 1NF.
* Use transformations (like `FLATTEN()`, `UNNEST()`, or ETL logic) to remove nested or repeating values.
* This sets the foundation for **2NF, 3NF**, and later, **star schema design**.

---



---

### üß† **1. What is the main goal of Second Normal Form (2NF)?**

**Answer:**
The main goal of 2NF is to **eliminate partial dependency**, meaning every non-key attribute should depend on the **whole primary key**, not just part of it.
This ensures better **data integrity** and **reduces redundancy**.

---

### üß© **2. What is Partial Dependency?**

**Answer:**
A **partial dependency** occurs when a **non-key column** depends on only **part of a composite primary key**, instead of the entire key.
üëâ If your primary key is made up of **two or more columns**, and a non-key column depends on just one of them ‚Äî that‚Äôs a **partial dependency**.

**Example:**

| player_id | item_type | item_quantity | player_rating |
| --------- | --------- | ------------- | ------------- |
| 1         | sword     | 3             | 95            |

* Primary Key ‚Üí (player_id, item_type)
* `item_quantity` depends on both ‚Üí ‚úÖ OK
* `player_rating` depends only on `player_id` ‚Üí ‚ùå Partial dependency ‚Üí breaks 2NF

---

### ‚öíÔ∏è **3. How do you fix a table that violates 2NF?**

**Answer:**
To fix it, **split the table** into smaller ones so that **each table‚Äôs non-key attributes depend entirely on its primary key.**

**Solution:**

**Table 1: Player_Items**

| player_id | item_type | item_quantity |
| --------- | --------- | ------------- |
| 1         | sword     | 3             |

**Table 2: Players**

| player_id | player_rating |
| --------- | ------------- |
| 1         | 95            |

Now both tables follow **2NF** because:

* In `Player_Items`, `item_quantity` depends on the full key (player_id + item_type).
* In `Players`, `player_rating` depends only on player_id.

---

### ‚öôÔ∏è **4. What anomalies does 2NF prevent?**

**Answer:**
2NF helps prevent **partial-dependency-based anomalies**, which include:

| Type of Anomaly       | Example                                                                                |
| --------------------- | -------------------------------------------------------------------------------------- |
| **Update Anomaly**    | If a player's rating is stored in multiple rows, updating one row won‚Äôt update others. |
| **Insertion Anomaly** | You can‚Äôt insert a player rating without also inserting an item.                       |
| **Deletion Anomaly**  | If you delete the last item of a player, you lose the player rating too.               |

---

### üßÆ **5. How is 2NF different from 1NF?**

| Feature           | **1NF**                 | **2NF**                                |
| ----------------- | ----------------------- | -------------------------------------- |
| Goal              | Remove repeating groups | Remove partial dependencies            |
| Key Type          | Any key                 | Only matters for composite keys        |
| Focus             | Atomic values           | Full dependency on the primary key     |
| Anomalies Handled | Reduces data repetition | Reduces redundancy & anomalies further |

---

### üß± **6. Can a table with a single-column primary key ever violate 2NF?**

**Answer:**
‚ùå **No.**
Because **partial dependency** can only exist when there‚Äôs a **composite primary key**.
If a table has a **single-column primary key**, it‚Äôs **automatically in 2NF** (as long as it‚Äôs already in 1NF).

---

### üí° **7. How does 2NF improve data safety compared to 1NF?**

**Answer:**
In 1NF, partial dependencies can cause **data duplication** and **inconsistent updates**.
2NF eliminates those issues, ensuring:

* Updates are consistent
* Deletions don‚Äôt remove unrelated data
* Insertions are simpler and cleaner

That‚Äôs why we say **2NF provides better data safety than 1NF**.

---

### üèÜ **8. What‚Äôs the next step after 2NF?**

**Answer:**
After achieving 2NF, you check for **transitive dependencies** (where a non-key depends on another non-key).
Removing those leads you to **Third Normal Form (3NF)** ‚Äî the next level of normalization.

---
