# Group-Wise Operations and Transformations
*Curtis Miller*

This notebook will focus on performing certain operations and transformations at a group level. In particular, I demonstrate filling in missing information at a group level and performing group-level statistical transformations such as standardization. In this notebook, I use the iris dataset. The natural grouping for this dataset would be the species of the flowers. I load in this dataset along with needed libraries below.

In [None]:
import pandas as pd
import numpy as np

In [None]:
iris = pd.read_csv("iris.csv")
iris.head()

In [None]:
iris.shape

Group-wise typically follow a three-step process:

1. **Split** the dataset into groups
2. **Apply** an operation for each group (suggestively implies using the `apply()` method)
3. **Combine** the grouped datasets together again

Here are examples of this procedure in action.

## Group-Wise Missing Data Replacement

Let's randomly censor some of the `iris` data.

In [None]:
# Code in this block chooses random indices for censoring (True will be censored)
idx = np.array([False] * (150 * 5))
idx[np.random.choice(np.arange(150 * 5), size=150, replace=False)] = True
idx = idx.reshape(150, 5)
idx[:, 4] = False    # Last column is for species; never censor this
idx[:10, :]

In [None]:
# Convert to DataFrame for indexing
idx = pd.DataFrame(idx, index=iris.index, columns=iris.columns)
idx.head()

In [None]:
# Now do the actual censoring
iris_censor = iris.copy()
iris_censor[idx] = np.nan
iris_censor.head()

How could we replace the missing data? One approach might be to replace with mean values, but this would be crude; not all species have the same mean values for their variables. We would like to be able to replace with the mean values for particular species.

We can form groups though and fill with means at the group level. Let's see this in action.

In [None]:
# Split
iriscengroups = iris_censor.groupby("species")
iriscengroups.groups

In [None]:
# Apply/Combine
replace_nan = lambda s: s.fillna(s.mean())
iris_c_replaced = iriscengroups.apply(replace_nan)    # Recombination is done automatically
iris_c_replaced.head()

## Group-Wise Standardization

Recall the standardization procedure:

$$z_i = \frac{x_i - \bar x}{s_x}$$

Sometimes we may want to standardize but require observations be standardized with respect to their individual groups, so if $k$ denotes group, we may actually want:

$$z_{ik} = \frac{x_{ik} - \bar{x}_k}{s_k}$$

I demonstrate this by standardizing each variable in the `iris` dataset while respecting groups. The standardized data is contained in its own column.

In [None]:
# Split
irisgroups = iris.groupby("species")
irisgroups.groups

In [None]:
# Apply/Combine
standardize = lambda s: (s - s.mean()) / s.std()
iris_standardized = irisgroups[["sepal_length", "sepal_width", "petal_length", "petal_width"]].apply(standardize)
iris_standardized.head()

In [None]:
iris = iris.join(iris_standardized, rsuffix="_standardized")

In [None]:
iris.head()