
---

## 🔑 **1. Primary Key**

### 🔹 **Definition**:

A **Primary Key** is a column (or a combination of columns) in a table that **uniquely identifies each row**. It **cannot be NULL** and must be **unique** for each row.

### 🔹 **Features**:

* Must be **unique** across the table
* **Cannot contain NULLs**
* Enforces **entity integrity**
* Usually **indexed** for performance

### 🔹 **Purpose**:

To ensure **data integrity** by uniquely identifying each record.

### 🔹 **Real-world example**:

In a `customer_dim` table:

```sql
customer_id (PK) | customer_name | email       | phone
-----------------|---------------|-------------|--------
101              | Alice         | a@x.com      | 123456
```

Here, `customer_id` is the **primary key**. It uniquely identifies each customer.

### 🔹 **Interview tip**:

> A dimension or fact table **must** have a primary key to uniquely identify each record. In data warehouses, we often use **surrogate keys** as primary keys for dimension tables.

---

## 🔑 **2. Foreign Key**

### 🔹 **Definition**:

A **Foreign Key** is a column (or columns) that **refer to the primary key** of another table, creating a **relationship** between the two.

### 🔹 **Features**:

* Maintains **referential integrity**
* Can contain **NULLs** (if the relationship is optional)
* Usually used to link **Fact ↔ Dimension**

### 🔹 **Purpose**:

To **connect tables** and define **relationships** (especially star/snowflake schema).

### 🔹 **Real-world example**:

In a `sales_fact` table:

```sql
sale_id | product_id (FK) | customer_id (FK) | amount
--------|-----------------|------------------|-------
1       | 201             | 101              | 250.00
```

`product_id` and `customer_id` are **foreign keys** that reference `product_dim` and `customer_dim`.

---

## 🔑 **3. Composite Key**

### 🔹 **Definition**:

A **Composite Key** is a **primary key** made of **two or more columns** that together uniquely identify a row.

### 🔹 **Features**:

* Each individual column **may not be unique**
* Only the **combination** is unique
* Used when no single column is sufficient to act as a PK

### 🔹 **Purpose**:

To define uniqueness where **no single attribute** can do it alone.

### 🔹 **Real-world example**:

In a `student_enrollment` table:

```sql
student_id | course_id | enrollment_date
-----------|-----------|-----------------
101        | CS101     | 2023-09-01
101        | MA101     | 2023-09-02
```

Composite Primary Key: (`student_id`, `course_id`)

This ensures that a student can enroll **only once per course**, but the same `student_id` or `course_id` can repeat.

---

## 🔑 **4. Natural Key (a.k.a. Business Key)**

### 🔹 **Definition**:

A **Natural Key** is an attribute that **already exists in the real world** and can uniquely identify a record — like Social Security Number, Email, or ISBN.

### 🔹 **Features**:

* Comes from **business data**
* Often meaningful to humans
* Can be used as PK **but comes with risks**

### 🔹 **Purpose**:

To avoid using artificial identifiers, and to stay **aligned with business logic**

### 🔹 **Real-world example**:

In `employee_dim`:

```sql
employee_ssn (Natural PK) | name     | department
--------------------------|----------|-----------
AAA1111                  | Bob      | HR
```

Here, `employee_ssn` could be used as a **natural key**.

### 🔹 **When natural key breaks**:

* If the business rule changes (e.g., a company starts issuing new SSNs or changes product codes).
* If duplicates or NULLs sneak in over time due to dirty data
* If it's subject to change (email address, phone number, etc.)

**Hence**, we usually prefer a **surrogate key** in data warehouses.

---

## 🔑 **5. Surrogate Key (a.k.a. Factless Key)**

### 🔹 **Definition**:

A **Surrogate Key** is a **system-generated, meaningless** identifier (usually a number) used as a **primary key** in dimension tables.

### 🔹 **Features**:

* Not derived from business data
* Not exposed to end-users
* Unaffected by changes in business rules
* Enables easy joins with fact tables

### 🔹 **Purpose**:

To provide a **stable, consistent** identifier, especially important for **SCD Type 2** and other history tracking.

### 🔹 **Real-world example**:

In `product_dim`:

```sql
product_sk (PK) | product_code | product_name
---------------|--------------|--------------
1001           | P-201        | Chair
1002           | P-201        | Chair (updated)
```

Both rows have the same `product_code` but different surrogate keys — **crucial** for tracking **changes** over time.

### 🔹 **In SCD Type 2**:

Each version of a record gets a **new surrogate key** so that we can differentiate them in the **fact table joins**.

---

## ✅ Summary Table:

| Key Type      | Unique? | Nulls Allowed? | Derived From          | Typical Use Case                               |
| ------------- | ------- | -------------- | --------------------- | ---------------------------------------------- |
| Primary Key   | Yes     | No             | Business/Surrogate    | Uniquely identify records                      |
| Foreign Key   | No      | Yes (optional) | References another PK | Define relationships between tables            |
| Composite Key | Yes     | No             | Business attributes   | Multi-column uniqueness                        |
| Natural Key   | Yes     | No             | Business data         | Uniquely identifies record from business side  |
| Surrogate Key | Yes     | No             | System-generated      | Stable identifier for DW + historical tracking |

---

## 🎯 Important Questions

1. **Why do we prefer surrogate keys over natural keys in dimension tables?**
2. **Can a surrogate key be used in a fact table?**
3. **Give an example where a natural key failed and needed to be replaced.**
4. **What’s the risk of using composite keys in large fact tables?**
5. **Can a foreign key be part of a primary key?**
6. **How do surrogate keys help with SCD Type 2 implementation?**

---


---

### **1. Why do we prefer surrogate keys over natural keys in dimension tables?**

**Answer**:
We prefer **surrogate keys** because:

* **Stability**: Natural keys can change (e.g., product codes or employee IDs), which breaks referential integrity. Surrogate keys are immutable.
* **Simplified joins**: Surrogate keys are integers, making joins faster and more efficient compared to large string-based natural keys.
* **SCD Type 2 support**: To track historical versions of a record, each version needs a unique identifier. Natural keys won’t allow that.
* **Avoid data issues**: Natural keys might not be truly unique due to data quality issues in source systems. Surrogate keys avoid this.

👉 **Example**: In `product_dim`, if product codes change over time, using them as PK breaks relationships. Surrogate keys maintain consistency.

---

### **2. Can a surrogate key be used in a fact table?**

**Answer**:
Yes, in fact, **fact tables almost always use surrogate keys** to link to dimension tables.

* Each foreign key in the fact table points to the **surrogate key** (PK) in a dimension table.
* This ensures **consistency**, supports **slowly changing dimensions**, and keeps the **data model performant**.

👉 **Example**:
In a `sales_fact` table:

```sql
sale_id | customer_sk | product_sk | sales_amount
--------|-------------|------------|--------------
1       | 1001        | 2005       | 350.00
```

`customer_sk` and `product_sk` are surrogate keys from `customer_dim` and `product_dim`.

---

### **3. Give an example where a natural key failed and needed to be replaced.**

**Answer**:
Let’s say a company used **email addresses** as natural keys for customers. But:

* A customer **changes their email** or creates a new one.
* Two customers **share** the same email temporarily (mistake or shared mailbox).
* One record gets created with a **typo** in the email.

This breaks the uniqueness of the natural key and causes **data integrity issues**. To solve this, a **surrogate key** is introduced as a stable, system-generated identifier.

---

### **4. What’s the risk of using composite keys in large fact tables?**

**Answer**:
Composite keys in fact tables can cause:

* **Performance issues**: Joins become slower as multiple columns are involved in joins and indexes.
* **Complexity**: More complex ETL logic and maintenance overhead.
* **Storage overhead**: Larger indexes due to multi-column combinations.
* **Error-prone**: Higher chance of mistakes in ETL and downstream reporting.

Hence, we **avoid composite keys** in fact tables and rely on **surrogate keys** to simplify the model.

---

### **5. Can a foreign key be part of a primary key?**

**Answer**:
Yes, this is commonly seen in **bridge tables** or **association tables**.

👉 **Example**:
In a `student_course` table where each student can enroll in multiple courses:

```sql
student_id (FK) | course_id (FK)
---------------|----------------
1001           | CS101
1001           | MA101
```

Here, (`student_id`, `course_id`) together form a **composite primary key**, and both are **foreign keys** referencing `student` and `course` tables.

---

### **6. How do surrogate keys help with SCD Type 2 implementation?**

**Answer**:
In **SCD Type 2**, every time a change occurs in a dimension record, we:

* **Insert a new row** with updated data
* Keep the old record for history
* Use a new **surrogate key** for the new version

This means multiple rows can exist for the same business entity (natural key), and the fact table should point to the **correct version** of the dimension record.

👉 **Without surrogate keys**, we can't differentiate the versions and join facts to the **right historical dimension** record.

---

