# 06 - Data Aggregation

## Introduction

Aggregation operations summarize data by calculating statistics like sum, mean, count, etc. This is essential for reporting and analysis.

## What You'll Learn

- Basic aggregations (sum, mean, count, etc.)
- Custom aggregations
- Multiple aggregations
- Window functions
- Rolling calculations


In [2]:
import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'Department': ['IT', 'IT', 'Sales', 'Sales', 'Marketing', 'Marketing'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'Salary': [50000, 60000, 55000, 65000, 70000, 75000],
    'Sales': [100000, 120000, 150000, 180000, 200000, 220000]
})

print("Sample DataFrame:")
print(df)


Sample DataFrame:
  Department Employee  Salary   Sales
0         IT    Alice   50000  100000
1         IT      Bob   60000  120000
2      Sales  Charlie   55000  150000
3      Sales    Diana   65000  180000
4  Marketing      Eve   70000  200000
5  Marketing    Frank   75000  220000


## Basic Aggregations

Common aggregation functions: sum, mean, median, min, max, count, std (standard deviation).


In [None]:
# Group by Department and calculate various aggregations
print("Sum by department:")
print(df.groupby('Department')['Salary'].sum())

print("\nMean by department:")
print(df.groupby('Department')['Salary'].mean())

print("\nCount by department:")
print(df.groupby('Department')['Salary'].count())


In [3]:
# Multiple aggregations at once
agg_result = df.groupby('Department').agg({
    'Salary': ['mean', 'sum', 'min', 'max'],
    'Sales': ['sum', 'mean']
})
print("Multiple aggregations:")
print(agg_result)


Multiple aggregations:
             Salary                         Sales          
               mean     sum    min    max     sum      mean
Department                                                 
IT          55000.0  110000  50000  60000  220000  110000.0
Marketing   72500.0  145000  70000  75000  420000  210000.0
Sales       60000.0  120000  55000  65000  330000  165000.0


## Window Functions

Window functions perform calculations across a set of rows related to the current row.


In [6]:
# Rank salaries within each department
df['Salary_Rank'] = df.groupby('Department')['Salary'].rank(ascending=False)
print("DataFrame with salary rank:")
print(df[['Department', 'Employee', 'Salary', 'Salary_Rank']])


DataFrame with salary rank:
  Department Employee  Salary  Salary_Rank
0         IT    Alice   50000          2.0
1         IT      Bob   60000          1.0
2      Sales  Charlie   55000          2.0
3      Sales    Diana   65000          1.0
4  Marketing      Eve   70000          2.0
5  Marketing    Frank   75000          1.0


In [7]:
# Calculate cumulative sum by department
df['Cumulative_Salary'] = df.groupby('Department')['Salary'].cumsum()
print("Cumulative salary by department:")
print(df[['Department', 'Employee', 'Salary', 'Cumulative_Salary']])


Cumulative salary by department:
  Department Employee  Salary  Cumulative_Salary
0         IT    Alice   50000              50000
1         IT      Bob   60000             110000
2      Sales  Charlie   55000              55000
3      Sales    Diana   65000             120000
4  Marketing      Eve   70000              70000
5  Marketing    Frank   75000             145000


## Rolling Calculations

Rolling calculations compute statistics over a moving window of rows.


In [8]:
# Create time series data
dates = pd.date_range('2024-01-01', periods=10, freq='D')
df_ts = pd.DataFrame({
    'Date': dates,
    'Sales': [100, 120, 110, 130, 125, 140, 135, 150, 145, 160]
})

# Calculate rolling mean (3-day window)
df_ts['Rolling_Mean_3'] = df_ts['Sales'].rolling(window=3).mean()
print("Time series with rolling mean:")
print(df_ts)


Time series with rolling mean:
        Date  Sales  Rolling_Mean_3
0 2024-01-01    100             NaN
1 2024-01-02    120             NaN
2 2024-01-03    110      110.000000
3 2024-01-04    130      120.000000
4 2024-01-05    125      121.666667
5 2024-01-06    140      131.666667
6 2024-01-07    135      133.333333
7 2024-01-08    150      141.666667
8 2024-01-09    145      143.333333
9 2024-01-10    160      151.666667


## Summary

In this notebook, you learned:
- ✅ Basic aggregation functions (sum, mean, count, etc.)
- ✅ Multiple aggregations
- ✅ Window functions (rank, cumulative sum)
- ✅ Rolling calculations

**Next:** Learn datetime operations in `07_datetime_operations.ipynb`
