# Pivot - Unpivot
The problem is to handle employee compensation data stored in a long format where each component appears as a separate row, and to reshape it into a wide pivoted format with components as columns, while also enabling the reverse transformation by unpivoting the wide table back into the original long format for flexible analysis.


In [0]:
%sql
-- Switch to my Catalog
USE CATALOG workspace;

-- Create schema if not exists
CREATE SCHEMA IF NOT EXISTS sql_pyspark_practice;

-- Use this schema
USE sql_pyspark_practice;

In [0]:
%sql
create or replace table emp_compensation (
emp_id int,
salary_component_type varchar(20),
val int
);

insert into emp_compensation
values (1,'salary',10000),(1,'bonus',5000),(1,'hike_percent',10)
, (2,'salary',15000),(2,'bonus',7000),(2,'hike_percent',8)
, (3,'salary',12000),(3,'bonus',6000),(3,'hike_percent',7);

select * from emp_compensation;

## Pivot

### SQL Solution

In [0]:
%sql
create or replace table emp_compensation_pivot as 
select emp_id,
        sum(case when salary_component_type = 'salary' then val end) as salary,
        sum(case when salary_component_type = 'bonus' then val end) as bonus,
        sum(case when salary_component_type = 'hike_percent' then val end) as hike_percent
from emp_compensation
group by emp_id;

select * from emp_compensation_pivot;

### PySpark Solution

In [0]:
from pyspark.sql.functions import *

df = spark.table("emp_compensation")
display(df)


df_pivot = df.groupBy("emp_id").pivot("salary_component_type").agg(sum("val"))

df_final = df_pivot.select(
    col("emp_id"),
    col("salary"),
    col("bonus"),
    col("hike_percent")
)
display(df_final)

## Unpivot

### SQL Solution

In [0]:
%sql

select emp_id, 'salary' as salaray_component_type, salary as val from emp_compensation_pivot
union all
select emp_id, 'bonus' as salaray_component_type, bonus as val from emp_compensation_pivot
union all
select emp_id, 'hike_percent' as salaray_component_type, hike_percent as val from emp_compensation_pivot;

### PySpark Solution

In [0]:
df_unpivot = df_final.selectExpr("emp_id", "stack(3, 'salary', salary, 'bonus', bonus, 'hike_percent', hike_percent) as (salary_component_type, val)")
display(df_unpivot)

## Learnings

### âœ… **Outcome**

* Successfully converted **tall/long data â†’ wide format (pivot)** and then **wide â†’ long format (unpivot)** using both **SQL** and **PySpark**.
* Got equivalent results in SQL tables and PySpark DataFrames.

---

### ðŸŽ¯ **What I Learnt**

* How to **pivot** data in SQL using CASE + GROUP BY.
* How to **pivot** in PySpark using `groupBy().pivot().agg()`.
* How to **unpivot** in SQL using `UNION ALL`.
* How to **unpivot** in PySpark using `stack()`.
* Understanding of data reshaping between **row-based** and **column-based** formats.

---

### ðŸªœ **Steps Done**

1. Created base table `emp_compensation` (long format).
2. Built a pivoted table using:

   * SQL CASE statements
   * PySpark pivot function
3. Built an unpivoted version using:

   * SQL `UNION ALL`
   * PySpark `stack()` function

---

### ðŸ“˜ **Main Topics Covered (SQL + PySpark)**

* **SQL**

  * `GROUP BY`
  * Conditional aggregation (`CASE WHEN`)
  * `UNION ALL`
  * Table creation
  * PIVOT/UNPIVOT logic manually

* **PySpark**

  * DataFrame creation using `spark.table()`
  * `groupBy()`
  * `pivot()`
  * `agg(sum())`
  * `selectExpr()`
  * `stack()` for unpivot
  * Column selection with `col()`

---

# âœ… **1) Deeply analyze the SQL code**

Your SQL contains **two parts**:

### **(A) Pivot**

Transforms rows â†’ columns.

It takes this input structure:

```
emp_id | salary_component_type | val
```

And converts it into:

```
emp_id | salary | bonus | hike_percent
```

This is done using:

```sql
sum(case when salary_component_type = 'salary' then val end)
```

Classic PIVOT implementation.

---

### **(B) Unpivot**

Transforms columns â†’ rows.

It converts the pivoted table back to:

```
emp_id | salary_component_type | val
```

using:

```sql
select emp_id, 'salary' as salary_component_type, salary as val
union all
...
```

This is a classic UNPIVOT using `UNION ALL`.

---

# âœ… **2) Reconstructed original problem**

Based on the SQL:

> **Problem:**
> Convert a compensation dataset from a tall/row-based format into a wide/pivoted format (salary, bonus, hike_percent as separate columns) **and then convert the wide table back into tall/unpivoted format**.

---

# âœ… **3) Explain the problem in plain English**

You have employee compensation data stored like this:

| emp_id | component_type | value |
| ------ | -------------- | ----- |
| 101    | salary         | 50000 |
| 101    | bonus          | 5000  |
| 101    | hike_percent   | 10    |

But you want it like this:

| emp_id | salary | bonus | hike_percent |
| ------ | ------ | ----- | ------------ |
| 101    | 50000  | 5000  | 10           |

This is **pivoting**.

And then the task asks you to convert that pivoted table back to the original long format.
This is **unpivoting**.

---

# âœ… **4) Why the problem matters & what concepts it is testing**

This question tests two core data transformation skills:

### **ðŸ“Œ PIVOT**

Turning row categories into columns.

Used when:

* Creating reporting tables.
* Summarizing KPI components.
* Creating ML feature tables.

### **ðŸ“Œ UNPIVOT**

Turning columns back into rows.

Used when:

* Normalizing data.
* Reversing pivoted outputs.
* Preparing for joins/analytics that expect row-level data.

### **Skills tested**

* CASE-based pivoting
* Aggregation with GROUP BY
* UNION ALL-based unpivoting
* Understanding wide vs long data models
* Schema transformation thinking

This is extremely common in analytics, SQL interviews, ETL, and data engineering pipelines.

---

# âœ… **5) How to logically think to solve such problems (general methodology)**

### **PIVOT thinking**

1. Identify **grouping column** â†’ `emp_id`
2. Identify **pivot category column** â†’ `salary_component_type`
3. Identify **value column** â†’ `val`
4. For each category, create:

   ```
   sum(case when component = 'salary' then val end) as salary
   ```
5. Group by the base key (emp_id)

### **UNPIVOT thinking**

1. Identify each column you want to convert back into rows.
2. For each column, create a SELECT block:

   ```
   select emp_id, 'salary' as type, salary as val
   ```
3. Combine using `UNION ALL`.

Thinking pattern:

* Pivot = CASE WHEN per category
* Unpivot = UNION ALL per column

---

# âœ… **6) Step-by-step breakdown of the SQL code**

---

## **PIVOT section**

### **Step 1 â€” Create pivot table**

```sql
create or replace table emp_compensation_pivot as
select 
   emp_id,
   sum(case when salary_component_type = 'salary' then val end) as salary,
   sum(case when salary_component_type = 'bonus' then val end) as bonus,
   sum(case when salary_component_type = 'hike_percent' then val end) as hike_percent
from emp_compensation
group by emp_id;
```

#### Explanation:

* `emp_id` is grouped so each employee gets one row.
* CASE picks values only when the component matches a type.
* SUM is used to aggregate; safe even if there is only 1 value.
* Result: pivoted data (wide format).

---

## **Unpivot section**

```sql
select emp_id, 'salary' as salary_component_type, salary as val 
from emp_compensation_pivot
union all
select emp_id, 'bonus', bonus 
from emp_compensation_pivot
union all
select emp_id, 'hike_percent', hike_percent 
from emp_compensation_pivot;
```

#### Explanation:

* Each SELECT converts one column into rows.
* `'salary'` is the type label.
* `salary` is the value.
* Repeated for bonus and hike_percent.
* UNION ALL stacks them into a long table.

---

# âœ… **7) Clean, readable SQL with comments**

### **PIVOT**

```sql
-- Pivot employee compensation by component type
CREATE OR REPLACE TABLE emp_compensation_pivot AS
SELECT 
    emp_id,

    -- Pick salary values
    SUM(CASE WHEN salary_component_type = 'salary' THEN val END) AS salary,

    -- Pick bonus values
    SUM(CASE WHEN salary_component_type = 'bonus' THEN val END) AS bonus,

    -- Pick hike percentage values
    SUM(CASE WHEN salary_component_type = 'hike_percent' THEN val END) AS hike_percent

FROM emp_compensation
GROUP BY emp_id;
```

---

### **UNPIVOT**

```sql
-- Unpivot the wide table back into long form
SELECT emp_id, 'salary' AS salary_component_type, salary AS val
FROM emp_compensation_pivot
UNION ALL

SELECT emp_id, 'bonus', bonus
FROM emp_compensation_pivot
UNION ALL

SELECT emp_id, 'hike_percent', hike_percent
FROM emp_compensation_pivot;
```

---

# âœ… **8) Equivalent PySpark DataFrame code**

Assume:

```python
df = spark.table("emp_compensation")
```

---

## **PIVOT in PySpark**

```python
from pyspark.sql import functions as F

pivot_df = (
    df
    .groupBy("emp_id")
    .pivot("salary_component_type", ["salary", "bonus", "hike_percent"])
    .agg(F.sum("val"))
)
```

---

## **UNPIVOT in PySpark**

PySpark has no built-in `unpivot`, so we manually stack using `unionByName`.

```python
salary_df = pivot_df.select(
    "emp_id",
    F.lit("salary").alias("salary_component_type"),
    F.col("salary").alias("val")
)

bonus_df = pivot_df.select(
    "emp_id",
    F.lit("bonus").alias("salary_component_type"),
    F.col("bonus").alias("val")
)

hike_df = pivot_df.select(
    "emp_id",
    F.lit("hike_percent").alias("salary_component_type"),
    F.col("hike_percent").alias("val")
)

unpivot_df = salary_df.unionByName(bonus_df).unionByName(hike_df)
```

---

# âœ… **9) Line-by-line Explanation of the PySpark Code**

### **Pivot**

```python
df.groupBy("emp_id")
```

Group by employee.

```python
.pivot("salary_component_type", ["salary", "bonus", "hike_percent"])
```

Makes three new columns based on these category names.

```python
.agg(F.sum("val"))
```

Aggregates values for each category.

---

### **Unpivot**

Each section:

```python
salary_df = pivot_df.select(
    "emp_id",
    F.lit("salary").alias("salary_component_type"),
    F.col("salary").alias("val")
)
```

* Takes the salary column,
* Adds the label "salary",
* Renames it to val.

Then:

```python
unionByName()
```

Stacks the DataFrames vertically.

---

# âœ… **10) How SQL thinking differs from PySpark thinking**

| Concept        | SQL                | PySpark                                           |
| -------------- | ------------------ | ------------------------------------------------- |
| Pivot          | Use CASE/aggregate | Use `.pivot()`                                    |
| Unpivot        | Use `UNION ALL`    | Explicit multiple DataFrame selects + unionByName |
| Code style     | Declarative        | Transformation pipeline                           |
| NULL handling  | Automatic          | Must understand column operations                 |
| Data structure | Tables             | DataFrames                                        |

PySpark requires more manual work for unpivoting.

---

# âœ… **11) Hints (3 basic + 3 advanced)**

### **Basic Hints**

1. Pivoting = CASE WHEN + SUM.
2. Group by the main entity (emp_id).
3. Unpivoting = multiple SELECTs stacked with UNION ALL.

### **Advanced Hints**

1. Always ensure missing values become NULL â€” SUM avoids issues.
2. When unpivoting, make sure column names match for UNION.
3. Think in terms of long â†’ wide â†’ long transformations used in analytics pipelines.

---

# âœ… **12) Final SQL + PySpark Solutions**

### **Final SQL**

```sql
CREATE OR REPLACE TABLE emp_compensation_pivot AS
SELECT 
    emp_id,
    SUM(CASE WHEN salary_component_type = 'salary' THEN val END) AS salary,
    SUM(CASE WHEN salary_component_type = 'bonus' THEN val END) AS bonus,
    SUM(CASE WHEN salary_component_type = 'hike_percent' THEN val END) AS hike_percent
FROM emp_compensation
GROUP BY emp_id;

SELECT emp_id, 'salary' AS salary_component_type, salary AS val 
FROM emp_compensation_pivot
UNION ALL
SELECT emp_id, 'bonus', bonus 
FROM emp_compensation_pivot
UNION ALL
SELECT emp_id, 'hike_percent', hike_percent 
FROM emp_compensation_pivot;
```

---

### **Final PySpark**

```python
from pyspark.sql import functions as F

# Pivot
pivot_df = (
    df.groupBy("emp_id")
      .pivot("salary_component_type", ["salary", "bonus", "hike_percent"])
      .agg(F.sum("val"))
)

# Unpivot
salary_df = pivot_df.select("emp_id", F.lit("salary").alias("salary_component_type"), F.col("salary").alias("val"))
bonus_df  = pivot_df.select("emp_id", F.lit("bonus").alias("salary_component_type"), F.col("bonus").alias("val"))
hike_df   = pivot_df.select("emp_id", F.lit("hike_percent").alias("salary_component_type"), F.col("hike_percent").alias("val"))

unpivot_df = salary_df.unionByName(bonus_df).unionByName(hike_df)
```

---

# âœ… **13) Final Teaching Takeaway**

For **pivoting**, think:

> "For each category, pull its value into a separate column using CASE."

For **unpivoting**, think:

> "For each column, create a row with a type label and a value, then UNION them."

This pivotâ€“unpivot pattern is extremely common in SQL interviews and ETL systems.

---