# Basic Aggregations in Pandas

### What Are Aggregations?

In data science and machine learning, **aggregation** refers to the process of summarizing a dataset by computing metrics such as **totals, averages, minimums, maximums, and counts**. These statistics help us understand the distribution, trends, and structure of the data without having to look at each record individually.

For instance, in the Titanic dataset, instead of checking every passenger’s age, we might ask: “What is the **average age**?” or “What is the **total fare collected**?” These answers come from aggregations.

In Pandas, aggregation methods are **built-in** and optimized for speed. They work on both **Series** (individual columns) and **DataFrames** (entire datasets). These functions are critical during **EDA (Exploratory Data Analysis)**, a step that always comes before data modeling or visualization in AI/ML workflows.

### Why Aggregations Matter in AI/ML

Aggregations form the backbone of **data profiling**, which is the act of analyzing our dataset’s contents and structure. Without aggregation:

- We won’t know if a column is heavily skewed.
- We might miss extreme outliers that could break our model.
- We might not notice data imbalance in a classification task.
- We could end up training on biased or misrepresented data.

Imagine trying to build a model to predict survival on the Titanic without knowing **how many survivors there are** or if **age** or **fare** distributions are skewed — our model might fail silently or make misleading predictions.

In short, we **can’t build clean or fair ML models without knowing our data**, and we can’t know our data without aggregation.

### Most Common Aggregation Methods in Pandas

| Function | Purpose |
| --- | --- |
| `.sum()` | Total value |
| `.mean()` | Average |
| `.median()` | Middle value |
| `.min()` / `.max()` | Smallest/Largest |
| `.count()` | Count of non-null entries |
| `.value_counts()` | Frequency of each unique value |
| `.describe()` | Full statistical summary |

### Examples

In [1]:
import pandas as pd
df = pd.read_csv("data/train.csv")

# Average age
print(df['Age'].mean())  # e.g., 29.69

# Total fare
print(df['Fare'].sum())  # e.g., 28693.949

# Age range
print(df['Age'].min(), df['Age'].max())

# Count of non-null entries per column
print(df.count())

# Frequency of passenger classes
print(df['Pclass'].value_counts())

# Survival distribution
print(df['Survived'].value_counts())

29.69911764705882
28693.9493
0.42 80.0
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64
Pclass
3    491
1    216
2    184
Name: count, dtype: int64
Survived
0    549
1    342
Name: count, dtype: int64


Use `.describe()` to get a full summary:

In [2]:
print(df.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


This includes:

- count
- mean
- std (standard deviation)
- min/max
- 25%, 50%, 75% percentiles

### When & Where to Use These

| Use Case | Method |
| --- | --- |
| Checking column balance | `.value_counts()` |
| Handling missing values with stats | `.mean()`, `.median()` |
| Checking if column can be normalized | `.min()`, `.max()` |
| Understanding column distribution | `.describe()` |
| Validating class balance for ML | `.value_counts()` on target variable |

In classification tasks (like survival prediction), knowing the **class balance** is critical. If 90% of people didn’t survive and we don’t catch it, our model could learn to just always predict “no survival” and still be 90% accurate — but totally useless.

### Exercises

Q1. What is the mean fare paid by passengers?

In [3]:
print(df['Fare'].mean())

32.204207968574636


Q2. What’s the total number of passengers who survived?

In [4]:
print(df['Survived'].sum())

342


Q3. Show how many passengers were in each class.

In [5]:
print(df['Pclass'].value_counts())

Pclass
3    491
1    216
2    184
Name: count, dtype: int64


Q4. Get the minimum and maximum ages.

In [6]:
print(df['Age'].min(), df['Age'].max())

0.42 80.0


Q5. Describe the statistics for the dataset.

In [7]:
print(df.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


### Summary

Basic aggregations are simple but incredibly powerful. They help us spot patterns, errors, biases, and opportunities for feature engineering. They are used:

- Before visualizing
- Before modeling
- During cleaning
- While interpreting results

In machine learning, these aggregation stats help us:

- Choose thresholds (mean, median)
- Design preprocessing logic
- Balance datasets
- Debug performance issues

Learning to **read and trust our data** is the first skill of a serious AI/ML engineer — and **aggregations are our first set of tools.**