# **Pandas Data Transformations**

In [47]:
# group by and aggregations:: Data transformations is bridge between raw data and actionable insights.
import pandas as pd
penguins_body_mass=pd.read_csv("..\Datasets\penguins.csv")

# Aggregations in pandas

Aggregation operations summarize data within groups, reducing multiple values to a single value per group. They are a cornerstone of the split-apply-combine strategy in pandas' groupby:

- Split: Break the DataFrame into groups (e.g., by `body_mass_g`).
- Apply: Compute a summary (e.g., sum, mean, count) for each group.
- Combine: Collect results into a new DataFrame or Series.

Basic aggregation functions include:
- Numeric: `mean()`, `sum()`, `min()`, `max()`, `std()`, `median()`.
- General: `count()` (counts non-NaN values), `nunique()` (counts unique values), `size()` (counts total rows including NaNs).
- Custom: Use `.agg()` with functions or lambdas, e.g., `.agg(lambda x: max(x) - min(x))` for range, or `.agg(list)` to collect values.

Aggregations combine multiple values into a single result for each group, enabling efficient high-level insights into the data.


In [48]:
# Group data by 'species' and calculate the mean body mass for each group.
# This groups the penguins by species, then finds the average weight in grams.

penguins_body_mass.groupby(["species"])["body_mass_g"].mean()

species
Adelie       3700.662252
Chinstrap    3733.088235
Gentoo       5076.016260
Name: body_mass_g, dtype: float64

The code :
penguins_body_mass.groupby(["body_mass_g"])["species"].mean()

did not work because the `.mean()` function in pandas requires numeric data and fails on strings.

- `.mean()` → works only on numeric data, so it raises an error or returns nothing when applied to non-numeric columns like strings (e.g., `"species"`).
- `.sum()` → with strings, it concatenates the values instead of summing numerically.
- `.min()` / `.max()` → applied to strings, these return the lexicographically smallest or largest value respectively.

Thus, `.mean()` is stricter and numeric-only, whereas `.sum()`, `.min()`, and `.max()` can operate on string data but with behavior that reflects string operations rather than arithmetic.

In [49]:
# Get unique species values (3 unique species in dataset)
penguins_body_mass["species"].unique()
# three unique values. 

array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)

In [50]:
# Using agg() to compute multiple aggregation metrics for each species
penguins_body_mass.groupby("species")["flipper_length_mm"].agg(["sum","mean"])

Unnamed: 0_level_0,sum,mean
species,Unnamed: 1_level_1,Unnamed: 2_level_1
Adelie,28683.0,189.953642
Chinstrap,13316.0,195.823529
Gentoo,26714.0,217.186992


In [51]:
result = penguins_body_mass.groupby("species").agg({
    "flipper_length_mm": ["sum", "mean"],   # multiple aggregations for flipper length
    "body_mass_g": ["mean", "max", "min"]   # multiple aggregations for body mass
})
result

Unnamed: 0_level_0,flipper_length_mm,flipper_length_mm,body_mass_g,body_mass_g,body_mass_g
Unnamed: 0_level_1,sum,mean,mean,max,min
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Adelie,28683.0,189.953642,3700.662252,4775.0,2850.0
Chinstrap,13316.0,195.823529,3733.088235,4800.0,2700.0
Gentoo,26714.0,217.186992,5076.01626,6300.0,3950.0


In [52]:
# Count non-null body_mass_g values for each species
penguins_body_mass.groupby("species")["body_mass_g"].count()

species
Adelie       151
Chinstrap     68
Gentoo       123
Name: body_mass_g, dtype: int64