# **Pandas Data Transformations**

In [2]:
# group by and aggregations:: Data transformations is bridge between raw data and actionable insights.
import pandas as pd
penguins_body_mass=pd.read_csv("..\Datasets\penguins.csv")

# Aggregations in pandas

Aggregation operations summarize data within groups, reducing multiple values to a single value per group. They are a cornerstone of the split-apply-combine strategy in pandas' groupby:

- Split: Break the DataFrame into groups (e.g., by `body_mass_g`).
- Apply: Compute a summary (e.g., sum, mean, count) for each group.
- Combine: Collect results into a new DataFrame or Series.

Basic aggregation functions include:
- Numeric: `mean()`, `sum()`, `min()`, `max()`, `std()`, `median()`.
- General: `count()` (counts non-NaN values), `nunique()` (counts unique values), `size()` (counts total rows including NaNs).
- Custom: Use `.agg()` with functions or lambdas, e.g., `.agg(lambda x: max(x) - min(x))` for range, or `.agg(list)` to collect values.

Aggregations combine multiple values into a single result for each group, enabling efficient high-level insights into the data.


In [3]:
# Group data by 'species' and calculate the mean body mass for each group.
# This groups the penguins by species, then finds the average weight in grams.

penguins_body_mass.groupby(["species"])["body_mass_g"].mean()

species
Adelie       3700.662252
Chinstrap    3733.088235
Gentoo       5076.016260
Name: body_mass_g, dtype: float64

The code :
penguins_body_mass.groupby(["body_mass_g"])["species"].mean()

did not work because the `.mean()` function in pandas requires numeric data and fails on strings.

- `.mean()` → works only on numeric data, so it raises an error or returns nothing when applied to non-numeric columns like strings (e.g., `"species"`).
- `.sum()` → with strings, it concatenates the values instead of summing numerically.
- `.min()` / `.max()` → applied to strings, these return the lexicographically smallest or largest value respectively.

Thus, `.mean()` is stricter and numeric-only, whereas `.sum()`, `.min()`, and `.max()` can operate on string data but with behavior that reflects string operations rather than arithmetic.

In [4]:
# Get unique species values (3 unique species in dataset)
penguins_body_mass["species"].unique()
# three unique values. 

array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)

In [5]:
# Using agg() to compute multiple aggregation metrics for each species
penguins_body_mass.groupby("species")["flipper_length_mm"].agg(["sum","mean"])

Unnamed: 0_level_0,sum,mean
species,Unnamed: 1_level_1,Unnamed: 2_level_1
Adelie,28683.0,189.953642
Chinstrap,13316.0,195.823529
Gentoo,26714.0,217.186992


In [6]:
result = penguins_body_mass.groupby("species").agg({
    "flipper_length_mm": ["sum", "mean"],   # multiple aggregations for flipper length
    "body_mass_g": ["mean", "max", "min"]   # multiple aggregations for body mass
})
result

Unnamed: 0_level_0,flipper_length_mm,flipper_length_mm,body_mass_g,body_mass_g,body_mass_g
Unnamed: 0_level_1,sum,mean,mean,max,min
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Adelie,28683.0,189.953642,3700.662252,4775.0,2850.0
Chinstrap,13316.0,195.823529,3733.088235,4800.0,2700.0
Gentoo,26714.0,217.186992,5076.01626,6300.0,3950.0


In [7]:
# Count non-null body_mass_g values for each species
penguins_body_mass.groupby("species")["body_mass_g"].count()

species
Adelie       151
Chinstrap     68
Gentoo       123
Name: body_mass_g, dtype: int64

In [8]:
# Advanced aggregations: applying multiple aggregation functions to summarize data.

aggregations= penguins_body_mass.groupby("island")["year"].agg(["max","min"])
aggregations

Unnamed: 0_level_0,max,min
island,Unnamed: 1_level_1,Unnamed: 2_level_1
Biscoe,2009,2007
Dream,2009,2007
Torgersen,2009,2007


In [None]:
# To make results clearer, we can rename aggregation columns using a dictionary.
renamed_columns={
    "maximum year":"max",
    "average_year":"mean"
}
aggregations= penguins_body_mass.groupby("species")["year"].agg(**renamed_columns)
aggregations

Unnamed: 0_level_0,maximum year,average_year
species,Unnamed: 1_level_1,Unnamed: 2_level_1
Adelie,2009,2008.013158
Chinstrap,2009,2007.970588
Gentoo,2009,2008.080645


In [10]:
# Applying different aggregation functions to multiple columns.
aggregations= penguins_body_mass.groupby("island")["year"].agg(max_year="max",min_year="min")
aggregations

Unnamed: 0_level_0,max_year,min_year
island,Unnamed: 1_level_1,Unnamed: 2_level_1
Biscoe,2009,2007
Dream,2009,2007
Torgersen,2009,2007


In [11]:
# Applying different aggregation functions to multiple columns.
penguins_body_mass.groupby("island").agg({"flipper_length_mm":"min", "bill_length_mm":"min"})

Unnamed: 0_level_0,flipper_length_mm,bill_length_mm
island,Unnamed: 1_level_1,Unnamed: 2_level_1
Biscoe,172.0,34.5
Dream,178.0,32.1
Torgersen,176.0,33.5


In [12]:
# Combine multiple aggregations into one grouped summary.
penguins_body_mass.groupby("island").agg(
    average_body_mass=("body_mass_g", "mean"),
    unique_species_count=("species", "nunique"),
    std_body_mass=("body_mass_g", "std") #standard deviation 
)

Unnamed: 0_level_0,average_body_mass,unique_species_count,std_body_mass
island,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Biscoe,4716.017964,2,782.855743
Dream,3712.903226,2,416.644112
Torgersen,3706.372549,1,445.10794


- **Aggregations** such as `sum()`, `mean()`, `min()`, `max()`, `count()`, and `nunique()` summarize data within groups.
- **Multiple aggregations** can be applied simultaneously using `.agg()` with lists, dictionaries, or named aggregation syntax.
- **Named aggregations** allow assigning descriptive names to output columns for clarity.
- Proper data types (numeric for math functions) are crucial to avoid errors during aggregation.
- For categorical data, functions like `nunique` or `count` provide meaningful insights rather than numeric aggregates like `sum`.

## Best Practices
- Use **named aggregations** for clean, readable output:
- When aggregating strings, avoid numeric functions; use counts or unique counts instead.
- Rename columns either during aggregation (named aggregation) or afterwards using `.rename()`.
- Examine and convert data types before aggregation to prevent errors like `TypeError: agg function failed`.

## Summary Table of Useful Aggregations

| Function      | Description                             | Use Case                |
|---------------|-------------------------------------|-------------------------|
| `sum()`       | Sum of values                       | Numeric totals          |
| `mean()`      | Arithmetic average                  | Central tendency        |
| `min()`/`max()`| Minimum/Maximum values              | Range insights          |
| `count()`     | Non-null count                     | Number of entries       |
| `nunique()`   | Count of unique values             | Category diversity      |
| `std()`       | Standard deviation                 | Data variability        |

Mastering `groupby` and `agg` empowers efficient exploratory data analysis and reporting by summarizing complex datasets into comprehensible insights. Always choose aggregation functions considering data types and the analysis context for meaningful results.

In [13]:
# using custom Aggregations functions: custom functions helps us to do advanced action tailored to our specific names.
import pandas as pd
penguins_data=pd.read_csv("../Datasets/penguins.csv")

## why custom Aggregations functions :?
- When built-in functions doesnot meet our requirements
- To implement specific business rules for analysis. 
-  Advanced metrices such as ranges, percentages are required.

### Key Methods

- **Basic Syntax:**  
  ```
  df.groupby('column')['column_to_aggregate'].agg(your_function)
  ```

- **Lambda:**  
  ```
  agg(lambda x: max(x) - min(x))
  ```

- **Named Function:**  
  ```
  def my_func(x):
      return ...
  agg(my_func)
  ```

- **Multiple Functions:**  
  ```
  agg([my_func, 'mean'])
  # or
  agg({'column': [(name, my_func)]})  # for named outputs
  ```
```


In [14]:
def calculate_range(series):
    return series.max()-series.min()
custom_aggregations = penguins_data.groupby("species")["body_mass_g"].agg(
    unique_species= "nunique",
    mass_min="min",
    mass_max="max",
    mass_range=calculate_range 
)
custom_aggregations

# The column used for aggregation is determined by what you select after the groupby (here body_mass_g).

# The aggregation functions (like min, max) are applied to this selected column group-wise.

# The output column names inside .agg() come from the keyword argument names you provide.

Unnamed: 0_level_0,unique_species,mass_min,mass_max,mass_range
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adelie,55,2850.0,4775.0,1925.0
Chinstrap,34,2700.0,4800.0,2100.0
Gentoo,47,3950.0,6300.0,2350.0


In [15]:
custom_agg = penguins_data.groupby('island').agg(
    unique_species=('species', 'nunique'),
    mass_range=('body_mass_g', lambda s: s.max() - s.min())
)
custom_agg

# Groups penguins by island.
# Counts unique species in each island group.
# Calculates range (max - min) of body_mass_g per island.

Unnamed: 0_level_0,unique_species,mass_range
island,Unnamed: 1_level_1,Unnamed: 2_level_1
Biscoe,2,3450.0
Dream,2,2100.0
Torgersen,1,1800.0


In [16]:
# Custom function: count values above 4000
def count_heavy(x):
    return (x > 4000).sum()

result = penguins_body_mass.groupby('species')['body_mass_g'].agg(count_heavy)
print(result)

species
Adelie        35
Chinstrap     15
Gentoo       122
Name: body_mass_g, dtype: int64


In [17]:
# custom complex aggregarions
penguins_adelie=penguins_body_mass.query("species=='Adelie'")
len(penguins_adelie["island"])

152

In [18]:
sum(penguins_adelie["island"]=="Biscoe")
# .sum() on this mask counts the number of True values (matches), while len() just returns total length unrelated to matching count.

44

To calculate the percentage of each species within a pandas DataFrame using custom complex aggregations, you can apply a groupby operation combined with a custom aggregation function. The general approach is:

1. Group the DataFrame by the categorical column (`species`).
2. Count the number of entries in each group.
3. Divide each group's count by the total count of all entries (to get the proportion).
4. Multiply by 100 to convert the proportion to a percentage.

Here is an example code snippet implementing this:

In [19]:
def percentage_of_total(x):
    return (len(x) / len(penguins_body_mass)) * 100
result = penguins_body_mass.groupby("species").agg(percentage=('species', percentage_of_total))
result

Unnamed: 0_level_0,percentage
species,Unnamed: 1_level_1
Adelie,44.186047
Chinstrap,19.767442
Gentoo,36.046512


In [20]:
result=penguins_body_mass.groupby("island").agg(percentage=("island",percentage_of_total))
result

Unnamed: 0_level_0,percentage
island,Unnamed: 1_level_1
Biscoe,48.837209
Dream,36.046512
Torgersen,15.116279


In [21]:
def percentage_of_total(x):
    return (len(x) / len(penguins_body_mass)) * 100

# Group by both 'island' and 'species',
# then apply the function to count rows per group and convert to percentage of total
result = penguins_body_mass.groupby(['island', 'species']).agg(percentage=('species', percentage_of_total))
result.reset_index()


Unnamed: 0,island,species,percentage
0,Biscoe,Adelie,12.790698
1,Biscoe,Gentoo,36.046512
2,Dream,Adelie,16.27907
3,Dream,Chinstrap,19.767442
4,Torgersen,Adelie,15.116279


# **How to use Apply function for Transformations**:

Transformations involve applying functions to modify or create new data in a DataFrame or Series, such as:

- Scaling numeric values (e.g., normalizing body_mass_g).
- Modifying strings (e.g., capitalizing species).
- Creating new columns based on row-wise calculations (e.g., body mass index).
- Applying group-based logic (e.g., subtracting the group mean).

Unlike aggregations (which reduce data, like `mean()` per group), transformations preserve the shape of the data (same number of rows or columns) or create new columns. Pandas provides several methods to apply functions for transformations:

- **`apply()`**: General-purpose, applies functions to rows, columns, or entire DataFrames.
- **`map()`**: Applies functions to each element in a Series (column).
- **`applymap()`**: Applies functions element-wise to an entire DataFrame (deprecated in favor of `map` for Series or `apply` with lambda in newer pandas versions).
- **`transform()`**: Applies group-wise transformations, aligning results with the original DataFrame’s index.
- **Vectorized Operations**: Use pandas/NumPy operations for faster, element-wise transformations without explicit loops.


# Using `apply()` for Transformations in Pandas
Use `apply()` to apply custom functions to rows or columns, preserving data shape. 

- **Key Methods**:
  - Row-wise: `df.apply(func, axis=1)`
  - Column-wise: `df.apply(func, axis=0)`
  - Multiple outputs: Return `pd.Series` in the function

In [22]:
# how to use apply.  
import pandas as pd
penguins=pd.read_csv("..\Datasets\penguins.csv")            

In [23]:
# Calculate BMI: Row wise
def calculate_bmi(row):
    return row["body_mass_g"]/ (row["flipper_length_mm"]/1000)**2
penguins["bmi"]=penguins.apply(calculate_bmi,axis=1)
penguins

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,bmi
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,114465.370410
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,109839.287779
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,85470.085470
3,4,Adelie,Torgersen,,,,,,2007,
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,92619.936106
...,...,...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009,93351.070037
340,341,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009,83325.164200
341,342,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009,101345.002550
342,343,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009,92970.521542


In [24]:
#Use pd.notnull() to check for non-null values before calculation.

def calculate_bmi(row):
    if pd.notnull(row["body_mass_g"]) and pd.notnull(row["flipper_length_mm"]):
        return row["body_mass_g"] / (row["flipper_length_mm"] / 1000) ** 2
    else:
        return None
penguins["bmi"]=penguins.apply(calculate_bmi,axis=1)
penguins

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year,bmi
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007,114465.370410
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007,109839.287779
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007,85470.085470
3,4,Adelie,Torgersen,,,,,,2007,
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007,92619.936106
...,...,...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009,93351.070037
340,341,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009,83325.164200
341,342,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009,101345.002550
342,343,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009,92970.521542


In [25]:
# Size category
def heavy(size):
    return size["body_mass_g"]>4000
penguins["weighty"]=penguins.apply(heavy,axis=1)
penguins.drop(["bill_length_mm","bill_depth_mm","flipper_length_mm"],axis=1) #just to make output clear. 

Unnamed: 0,rowid,species,island,body_mass_g,sex,year,bmi,weighty
0,1,Adelie,Torgersen,3750.0,male,2007,114465.370410,False
1,2,Adelie,Torgersen,3800.0,female,2007,109839.287779,False
2,3,Adelie,Torgersen,3250.0,female,2007,85470.085470,False
3,4,Adelie,Torgersen,,,2007,,False
4,5,Adelie,Torgersen,3450.0,female,2007,92619.936106,False
...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,4000.0,male,2009,93351.070037,False
340,341,Chinstrap,Dream,3400.0,female,2009,83325.164200,False
341,342,Chinstrap,Dream,3775.0,male,2009,101345.002550,False
342,343,Chinstrap,Dream,4100.0,male,2009,92970.521542,True


In [28]:
# Conditional Size Category:

def size_category(row):
    mass=row["body_mass_g"]
    flipper=row["flipper_length_mm"]
    if mass>4000 and flipper>200:
        return "large"
    elif mass<3600:
        return "Medium"
    return "Small"
penguins['size_category'] = penguins.apply(size_category, axis=1)
penguins.drop(["bill_length_mm","bill_depth_mm","flipper_length_mm"],axis=1) #just to make output clear. 

Unnamed: 0,rowid,species,island,body_mass_g,sex,year,bmi,weighty,size_category
0,1,Adelie,Torgersen,3750.0,male,2007,114465.370410,False,Small
1,2,Adelie,Torgersen,3800.0,female,2007,109839.287779,False,Small
2,3,Adelie,Torgersen,3250.0,female,2007,85470.085470,False,Medium
3,4,Adelie,Torgersen,,,2007,,False,Small
4,5,Adelie,Torgersen,3450.0,female,2007,92619.936106,False,Medium
...,...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,4000.0,male,2009,93351.070037,False,Small
340,341,Chinstrap,Dream,3400.0,female,2009,83325.164200,False,Medium
341,342,Chinstrap,Dream,3775.0,male,2009,101345.002550,False,Small
342,343,Chinstrap,Dream,4100.0,male,2009,92970.521542,True,large


In [30]:
# Multiple Outputs:

def mass_features(row):
    mass=row["body_mass_g"]
    max_mass=penguins["body_mass_g"].max()
    return pd.Series({
        'mass_normalized': (mass - penguins['body_mass_g'].min()) / (max_mass - penguins['body_mass_g'].min()),
        'is_heavy': 1 if mass > 4000 else 0
    })
penguins[['mass_normalized', 'is_heavy']] = penguins.apply(mass_features, axis=1)
penguins.drop(["bill_length_mm","bill_depth_mm","flipper_length_mm"],axis=1) #just to make output clear. 

Unnamed: 0,rowid,species,island,body_mass_g,sex,year,bmi,weighty,size_category,mass_normalized,is_heavy
0,1,Adelie,Torgersen,3750.0,male,2007,114465.370410,False,Small,0.291667,0.0
1,2,Adelie,Torgersen,3800.0,female,2007,109839.287779,False,Small,0.305556,0.0
2,3,Adelie,Torgersen,3250.0,female,2007,85470.085470,False,Medium,0.152778,0.0
3,4,Adelie,Torgersen,,,2007,,False,Small,,0.0
4,5,Adelie,Torgersen,3450.0,female,2007,92619.936106,False,Medium,0.208333,0.0
...,...,...,...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,4000.0,male,2009,93351.070037,False,Small,0.361111,0.0
340,341,Chinstrap,Dream,3400.0,female,2009,83325.164200,False,Medium,0.194444,0.0
341,342,Chinstrap,Dream,3775.0,male,2009,101345.002550,False,Small,0.298611,0.0
342,343,Chinstrap,Dream,4100.0,male,2009,92970.521542,True,large,0.388889,1.0
