# 05 - Data Transformation

## Introduction

Data transformation is core to ETL (Extract, Transform, Load) processes. This notebook covers grouping, merging, concatenating, and pivoting data.

## What You'll Learn

- GroupBy operations
- Merging DataFrames (joins)
- Concatenating DataFrames
- Pivot tables
- Applying functions to columns


In [1]:
import pandas as pd

# Create sample DataFrames
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Department': ['IT', 'Sales', 'IT', 'Marketing'],
    'Salary': [50000, 60000, 70000, 55000]
})

df2 = pd.DataFrame({
    'Name': ['Eve', 'Frank', 'Alice', 'Bob'],
    'Age': [32, 28, 25, 30],
    'City': ['Sydney', 'Berlin', 'New York', 'London']
})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)


DataFrame 1:
      Name Department  Salary
0    Alice         IT   50000
1      Bob      Sales   60000
2  Charlie         IT   70000
3    Diana  Marketing   55000

DataFrame 2:
    Name  Age      City
0    Eve   32    Sydney
1  Frank   28    Berlin
2  Alice   25  New York
3    Bob   30    London


## GroupBy Operations

GroupBy allows you to group data by one or more columns and perform operations on each group. Similar to SQL's GROUP BY.


In [4]:
# Group by Department and calculate mean salary
grouped = df1.groupby('Department')['Salary'].sum()
print("Average salary by department:")
print(grouped)


Average salary by department:
Department
IT           120000
Marketing     55000
Sales         60000
Name: Salary, dtype: int64


In [3]:
# Multiple aggregations
grouped_multi = df1.groupby('Department')['Salary'].agg(['mean', 'sum', 'count'])
print("Multiple aggregations by department:")
print(grouped_multi)


Multiple aggregations by department:
               mean     sum  count
Department                        
IT          60000.0  120000      2
Marketing   55000.0   55000      1
Sales       60000.0   60000      1


## Merging DataFrames (Joins)

Merging is similar to SQL JOINs. You can merge DataFrames based on common columns.


In [5]:
# Inner join (default)
merged_inner = pd.merge(df1, df2, on='Name', how='inner')
print("Inner join:")
print(merged_inner)


Inner join:
    Name Department  Salary  Age      City
0  Alice         IT   50000   25  New York
1    Bob      Sales   60000   30    London


In [6]:
# Left join
merged_left = pd.merge(df1, df2, on='Name', how='left')
print("Left join:")
print(merged_left)


Left join:
      Name Department  Salary   Age      City
0    Alice         IT   50000  25.0  New York
1      Bob      Sales   60000  30.0    London
2  Charlie         IT   70000   NaN       NaN
3    Diana  Marketing   55000   NaN       NaN


In [7]:
# Right join
merged_right = pd.merge(df1, df2, on='Name', how='right')
print("Right join:")
print(merged_right)


Right join:
    Name Department   Salary  Age      City
0    Eve        NaN      NaN   32    Sydney
1  Frank        NaN      NaN   28    Berlin
2  Alice         IT  50000.0   25  New York
3    Bob      Sales  60000.0   30    London


In [8]:
# Outer join (full outer)
merged_outer = pd.merge(df1, df2, on='Name', how='outer')
print("Outer join:")
print(merged_outer)


Outer join:
      Name Department   Salary   Age      City
0    Alice         IT  50000.0  25.0  New York
1      Bob      Sales  60000.0  30.0    London
2  Charlie         IT  70000.0   NaN       NaN
3    Diana  Marketing  55000.0   NaN       NaN
4      Eve        NaN      NaN  32.0    Sydney
5    Frank        NaN      NaN  28.0    Berlin


## Concatenating DataFrames

Concatenation combines DataFrames along rows or columns.


In [9]:
# Concatenate along rows (stack vertically)
df3 = pd.DataFrame({
    'Name': ['Grace', 'Henry'],
    'Department': ['IT', 'Sales'],
    'Salary': [80000, 65000]
})

concatenated = pd.concat([df1, df3], ignore_index=True)
print("Concatenated DataFrames:")
print(concatenated)


Concatenated DataFrames:
      Name Department  Salary
0    Alice         IT   50000
1      Bob      Sales   60000
2  Charlie         IT   70000
3    Diana  Marketing   55000
4    Grace         IT   80000
5    Henry      Sales   65000


## Pivot Tables

Pivot tables reshape data, similar to Excel pivot tables.


In [10]:
# Create sample data for pivot
df_sales = pd.DataFrame({
    'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'Product': ['A', 'B', 'A', 'B'],
    'Sales': [100, 150, 120, 180]
})

print("Original data:")
print(df_sales)

# Create pivot table
pivot = df_sales.pivot_table(values='Sales', index='Date', columns='Product', aggfunc='sum')
print("\nPivot table:")
print(pivot)


Original data:
         Date Product  Sales
0  2024-01-01       A    100
1  2024-01-01       B    150
2  2024-01-02       A    120
3  2024-01-02       B    180

Pivot table:
Product       A    B
Date                
2024-01-01  100  150
2024-01-02  120  180


## Applying Functions to Columns

You can apply custom functions to transform data.


In [10]:
# Apply a function to a column
df1['Salary_K'] = df1['Salary'].apply(lambda x: x / 1000)
print("After applying function:")
print(df1)


After applying function:
      Name Department  Salary  Salary_K
0    Alice         IT   50000      50.0
1      Bob      Sales   60000      60.0
2  Charlie         IT   70000      70.0
3    Diana  Marketing   55000      55.0


## Summary

In this notebook, you learned:
- ✅ How to use GroupBy for aggregations
- ✅ How to merge DataFrames (inner, left, right, outer joins)
- ✅ How to concatenate DataFrames
- ✅ How to create pivot tables
- ✅ How to apply functions to columns

**Next:** Learn data aggregation in `06_data_aggregation.ipynb`
