In [1]:
import pandas as pd

# GroupBy & Aggregation

In [2]:
# Sample DataFrame
df = pd.DataFrame({
    "Employee": ["Onkar", "Amit", "Sara", "Rohit", "Neha"],
    "Department": ["IT", "IT", "HR", "IT", "HR"],
    "Salary": [50000, 65000, 55000, 70000, 48000],
    "Experience": [1, 3, 2, 5, 1]
})

df

Unnamed: 0,Employee,Department,Salary,Experience
0,Onkar,IT,50000,1
1,Amit,IT,65000,3
2,Sara,HR,55000,2
3,Rohit,IT,70000,5
4,Neha,HR,48000,1


## 1. What is `groupby()`? Concept

Split -> Apply -> Combine  
1. Split data into groups
2. Apply some callculations
3. Combine results

## 2. Basic groupby -> one column

Result is a Series with department as index.

In [5]:
# Average salary per department
df.groupby("Department")["Salary"].mean()

Department
HR    51500.000000
IT    61666.666667
Name: Salary, dtype: float64

In [7]:
# Total salary per department
df.groupby("Department")["Salary"].sum()

Department
HR    103000
IT    185000
Name: Salary, dtype: int64

In [8]:
# Employee count per department
df.groupby("Department")["Employee"].count()

Department
HR    2
IT    3
Name: Employee, dtype: int64

## 3. GroupBy Multiple Aggregations — `.agg()`

In [9]:
df.groupby("Department")["Salary"].agg(["mean", "sum", "min", "max"])

Unnamed: 0_level_0,mean,sum,min,max
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HR,51500.0,103000,48000,55000
IT,61666.666667,185000,50000,70000


Instead of calling multiple agg. function we can do...

In [11]:
df.groupby("Department").agg(
    avg_salary = ("Salary", "mean"),
    max_salary = ("Salary", "max"),
    min_salary = ("Salary", "min"),
    emp_count = ("Employee", "count")
)

Unnamed: 0_level_0,avg_salary,max_salary,min_salary,emp_count
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HR,51500.0,55000,48000,2
IT,61666.666667,70000,50000,3


## 4. Groupby multiple columns

In [12]:
df.groupby(["Department", "Experience"])["Salary"].mean()

Department  Experience
HR          1             48000.0
            2             55000.0
IT          1             50000.0
            3             65000.0
            5             70000.0
Name: Salary, dtype: float64

In [14]:
df.groupby(["Department", "Experience"]).agg(
    emp_count = ("Employee", "count"),
    max_salary = ("Salary", "max")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,emp_count,max_salary
Department,Experience,Unnamed: 2_level_1,Unnamed: 3_level_1
HR,1,1,48000
HR,2,1,55000
IT,1,1,50000
IT,3,1,65000
IT,5,1,70000


## 5. GroupBy + Multiple Columns Aggregation

In [16]:
df.groupby("Department").agg({
    "Salary": ["mean", "sum"],
    "Experience": "mean"
})

Unnamed: 0_level_0,Salary,Salary,Experience
Unnamed: 0_level_1,mean,sum,mean
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
HR,51500.0,103000,1.5
IT,61666.666667,185000,3.0


## 6. Reset Index After GroupBy

In [17]:
result = df.groupby("Department")["Salary"].mean()
result

Department
HR    51500.000000
IT    61666.666667
Name: Salary, dtype: float64

In [18]:
result.reset_index()

Unnamed: 0,Department,Salary
0,HR,51500.0
1,IT,61666.666667


## 7. `transform()` — Return Same Shape as Original Data

In [19]:
df["Dept_Avg_Salary"] = df.groupby("Department")["Salary"].transform("mean")

In [20]:
df

Unnamed: 0,Employee,Department,Salary,Experience,Dept_Avg_Salary
0,Onkar,IT,50000,1,61666.666667
1,Amit,IT,65000,3,61666.666667
2,Sara,HR,55000,2,51500.0
3,Rohit,IT,70000,5,61666.666667
4,Neha,HR,48000,1,51500.0


## Difference between `agg()` and `transform()`

| Feature     | `agg()` | `transform()`    |
| - | - | - |
| Output rows | Reduced | Same as original |
| Use case    | Summary | Feature creation |
| Shape       | Smaller | Same             |
