# Grouping and Aggregation in Pandas (groupby)

Grouping and aggregation in Pandas allow you to split your data into groups based on certain criteria and then apply summary statistics or transformations to each group.

**Key steps:**
1. **Split**: Divide the dataset into groups based on column values.
2. **Apply**: Perform aggregation or transformation on each group.
3. **Combine**: Merge the results back into a DataFrame.

**Common methods:**
- `sum()`, `mean()`, `count()`, `min()`, `max()`
- Applying multiple aggregations using `.agg()`
- Grouping by multiple columns

Grouping is especially useful for summarizing large datasets and finding patterns.


In [13]:
import pandas as pd

In [14]:
# Sample dataset
data = {
    "Sector": ["Music", "Music", "IT", "Finance", "HR"],
    "Name": ["Hayley", "Taylor", "Claire", "Aurora", "Evangeline"],
    "Salary": [100000, 95000, 70000, 80000, 90000],
    "Age": [25, 30, 35, 40, 45],
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

Original DataFrame:
     Sector        Name  Salary  Age
0    Music      Hayley  100000   25
1    Music      Taylor   95000   30
2       IT      Claire   70000   35
3  Finance      Aurora   80000   40
4       HR  Evangeline   90000   45


In [15]:
# 1. Group by Department and find mean salary
mean_salary = df.groupby("Sector")["Salary"].mean()
print("\nMean Salary by Sector:\n", mean_salary)


Mean Salary by Sector:
 Sector
Finance    80000.0
HR         90000.0
IT         70000.0
Music      97500.0
Name: Salary, dtype: float64


In [16]:
# 2. Group by Sector and get multiple aggregations
agg_stats = df.groupby("Sector").agg({
    "Salary": ["mean", "max", "min"],
    "Age": ["mean", "count"]
})
print("\nMultiple Aggregations by Sector:\n", agg_stats)


Multiple Aggregations by Sector:
           Salary                  Age      
            mean     max    min  mean count
Sector                                     
Finance  80000.0   80000  80000  40.0     1
HR       90000.0   90000  90000  45.0     1
IT       70000.0   70000  70000  35.0     1
Music    97500.0  100000  95000  27.5     2


In [17]:
# 3. Group by multiple columns (Sector & Age Category)
df["Age Category"] = df["Age"].apply(lambda x: "Young" if x < 35 else "Senior")
multi_group = df.groupby(["Sector", "Age Category"])["Salary"].mean()
print("\nMean Salary by Sector and Age Category:\n", multi_group)


Mean Salary by Sector and Age Category:
 Sector   Age Category
Finance  Senior          80000.0
HR       Senior          90000.0
IT       Senior          70000.0
Music    Young           97500.0
Name: Salary, dtype: float64


In [18]:
# 4. Resetting index after grouping
multi_group_reset = multi_group.reset_index()
print("\nGrouped Data with Reset Index:\n", multi_group_reset)


Grouped Data with Reset Index:
     Sector Age Category   Salary
0  Finance       Senior  80000.0
1       HR       Senior  90000.0
2       IT       Senior  70000.0
3    Music        Young  97500.0


# Real-World Analogy: Sorting Mail

Imagine you have a large pile of letters:
- **Grouping**: You first sort them by city (like `groupby()` by a column).
- **Aggregation**: For each city, you count how many letters there are (`count()`), or find the highest postage fee (`max()`).
- **Multiple grouping**: You might sort by both city and neighborhood.

Just like organizing mail makes deliveries faster, grouping data makes it easier to analyze patterns.