# Pivot - Unpivot
The problem is to handle employee compensation data stored in a long format where each component appears as a separate row, and to reshape it into a wide pivoted format with components as columns, while also enabling the reverse transformation by unpivoting the wide table back into the original long format for flexible analysis.


In [0]:
%sql
-- Switch to my Catalog
USE CATALOG workspace;

-- Create schema if not exists
CREATE SCHEMA IF NOT EXISTS sql_pyspark_practice;

-- Use this schema
USE sql_pyspark_practice;

In [0]:
%sql
create or replace table emp_compensation (
emp_id int,
salary_component_type varchar(20),
val int
);

insert into emp_compensation
values (1,'salary',10000),(1,'bonus',5000),(1,'hike_percent',10)
, (2,'salary',15000),(2,'bonus',7000),(2,'hike_percent',8)
, (3,'salary',12000),(3,'bonus',6000),(3,'hike_percent',7);

select * from emp_compensation;

## Pivot

### SQL Solution

In [0]:
%sql
create or replace table emp_compensation_pivot as 
select emp_id,
        sum(case when salary_component_type = 'salary' then val end) as salary,
        sum(case when salary_component_type = 'bonus' then val end) as bonus,
        sum(case when salary_component_type = 'hike_percent' then val end) as hike_percent
from emp_compensation
group by emp_id;

select * from emp_compensation_pivot;

### PySpark Solution

In [0]:
from pyspark.sql.functions import *

df = spark.table("emp_compensation")
display(df)


df_pivot = df.groupBy("emp_id").pivot("salary_component_type").agg(sum("val"))

df_final = df_pivot.select(
    col("emp_id"),
    col("salary"),
    col("bonus"),
    col("hike_percent")
)
display(df_final)

## Unpivot

### SQL Solution

In [0]:
%sql

select emp_id, 'salary' as salaray_component_type, salary as val from emp_compensation_pivot
union all
select emp_id, 'bonus' as salaray_component_type, bonus as val from emp_compensation_pivot
union all
select emp_id, 'hike_percent' as salaray_component_type, hike_percent as val from emp_compensation_pivot;

### PySpark Solution

In [0]:
df_unpivot = df_final.selectExpr("emp_id", "stack(3, 'salary', salary, 'bonus', bonus, 'hike_percent', hike_percent) as (salary_component_type, val)")
display(df_unpivot)

## Learnings

### âœ… **Outcome**

* Successfully converted **tall/long data â†’ wide format (pivot)** and then **wide â†’ long format (unpivot)** using both **SQL** and **PySpark**.
* Got equivalent results in SQL tables and PySpark DataFrames.

---

### ðŸŽ¯ **What I Learnt**

* How to **pivot** data in SQL using CASE + GROUP BY.
* How to **pivot** in PySpark using `groupBy().pivot().agg()`.
* How to **unpivot** in SQL using `UNION ALL`.
* How to **unpivot** in PySpark using `stack()`.
* Understanding of data reshaping between **row-based** and **column-based** formats.

---

### ðŸªœ **Steps Done**

1. Created base table `emp_compensation` (long format).
2. Built a pivoted table using:

   * SQL CASE statements
   * PySpark pivot function
3. Built an unpivoted version using:

   * SQL `UNION ALL`
   * PySpark `stack()` function

---

### ðŸ“˜ **Main Topics Covered (SQL + PySpark)**

* **SQL**

  * `GROUP BY`
  * Conditional aggregation (`CASE WHEN`)
  * `UNION ALL`
  * Table creation
  * PIVOT/UNPIVOT logic manually

* **PySpark**

  * DataFrame creation using `spark.table()`
  * `groupBy()`
  * `pivot()`
  * `agg(sum())`
  * `selectExpr()`
  * `stack()` for unpivot
  * Column selection with `col()`

---