In [None]:
import pandas as pd
import numpy as np

# Group based analysis
Where the ability to apply both built in and your own functions to data tables is particularly powerful, is when we group data into categories. Geographical data are a good place to apply this because they are often organised hierarchically (meshblocks into SA1s into SA2s and so on). It is worth keeping in mind that if you have labels for the layers in a geographically hierarchy it is almost certainly quicker to aggregate data using those than by spatial joins.

To explore this we go back to the source SA1 data for all of New Zealand.

In [None]:
# make an SA1 to UR lookup
urban_areas = pd.read_csv("data/geographic-areas-table-2023.csv")[
    ["SA12023_code", "TA2023_name"]] \
    .drop_duplicates() \
    .set_index("SA12023_code")

# get the data (all 500+ columns), set SA1 as an index, and flag NAs
sa1 = pd.read_csv(
    "data/2023_Census_totals_by_topic_for_individuals_by_SA1.csv") \
        .rename(columns = {"Statistical area 1 (SA1) 2023 code": "sa1_code"}) \
        .set_index("sa1_code") \
        .replace([-999, -997], pd.NA)

# drop non Mainland    
sa1 = sa1[sa1["Landwater name"] == "Mainland"] \
    .drop(columns = ["OBJECTID", "Landwater code", "Landwater name"])

# make the dataframe
sa1 = urban_areas.join(sa1, how = "inner")
sa1.index.name = "sa1_code"
sa1

Let's see how many rows are in each territorial authority.

In [None]:
sa1.TA2023_name.value_counts()

Now if we use `groupby` we can apply built in or even our own functions to groups of data.

In [None]:
grouped_df = sa1.groupby("TA2023_name")
grouped_df.sum()

To apply a function to grouped data you use `agg()` in place of `apply()` signifying that the function expects a set of values and will aggregate those values and return a single summary value as its output.

In [None]:
def unevenness(values):
    total = np.sum(values)
    if total == 0:
        return 0
    return np.sum([(x / total) ** 2 for x in values])

grouped_df.agg(unevenness)

We can even apply functions to the results of applying other functions:

In [None]:
display(grouped_df.sum().apply(unevenness, axis = "columns"))

You can also conveniently iterate over the groups created by `groupby` as a series of tuples containing the value of the selector and the subset of the data in the group.

In [None]:
groups = sa1.groupby("TA2023_name")
for name, group in groups:
    print(f"{group.shape[0]:4} SA1s in {name} total population {sum(group.iloc[:, 1])}")