In [None]:
import pandas as pd
import numpy as np

# Applying functions in `pandas`
We've already seen that `pandas` enables faster computation and therefore processing of data. It is important to take advantage of this whenever possible (and it should pretty much always be possible). In the this notebook we look at how to do that.

We'll use the dataset we just cleaned to do that.

In [None]:
welly_ages = pd.read_csv("data/welly-ages-final.csv", index_col = 0)
welly_ages

## Summary functions applied to a `DataFrame`
Many standard statistical summary functions are provided as built-in methods by `pandas`. They return the values of the summary statistic for each column in the table as a `Series`.

In [None]:
# mean
welly_ages.mean()

In [None]:
# standard deviation
welly_ages.std()

In [None]:
# median
welly_ages.median()

In [None]:
welly_ages.quantile(q = [i / 10 for i in range(1, 10)])

If you want calculation across rows then specify `axis="columns"`.

In [None]:
welly_ages.mean(axis = "columns")

### An overview
The `describe()` function provides a helpful overview.

In [None]:
welly_ages.describe()

There isn't very much more to say than this about the builtin functions. There are many useful functions available, although the way that the `pandas` API documentation is organised makes it a little tricky to find them all in one convenient to consult place. So here's a link to get you where you need to go: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics.

### But wait... there are also simple graphics
Using `matplotlib` or another plotting library to view graphical summaries of data can be very useful and many basic plots are available usually via `<dataframe>.<series>.plot.some_function()`. They won't always be useful&mdash;and for these data most of them are not&mdash;but it's good to know they're there: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#basic-plotting-plot.

In [None]:
welly_ages.loc[:, "age_20_24":"age_60_64"] \
    .sum(axis = "columns") \
    .plot.box(title = "Population ages 20-64, by SA1")

We will spend more time on the graphics options for mapping when we look more closely at `geopandas` a little later.

## Applying functions to `DataFrame`
You aren't restricted to the built in mathematical and other functions. As an example a simple measure of the unevenness of a set of numbers is the sum of the squared fractions of the total that each value represents. The more unequal a set of numbers, the larger this calculation will be for a set of numbers of a given size. 

We could calculate this across the rows in our data table (i.e. for each SA1) in a series of steps. Like this:

In [None]:
# get row totals
welly_age_totals = welly_ages.sum(axis = "columns")
# this step is not obvious... it relies on the broadcasting rules
# see https://numpy.org/doc/stable/user/basics.broadcasting.html
# where the underlying numpy behaviour is explained (it's complicated!)
welly_age_fracs = welly_ages.divide(welly_age_totals, axis = "index") 
welly_age_fracs_squared = welly_age_fracs ** 2
welly_age_fracs_squared.sum(axis = "columns")

BUT... I'm not even going to pretend much of the above is obvious. Especially step 2 where we have to know about _broadcasting_ to understand how dividing the whole table by the row totals will behave. I had to look this step up, and I doubt that I will ever be certain what will happen in this situation until I try it. 

In practice, instead of the above I make make a function to perform the required calculation on a set of values. Here's that function:

In [None]:
def unevenness(values):
    total = np.sum(values)
    if total == 0:
        return 0
    return np.sum([(x / total) ** 2 for x in values])

# and to show it works
print(f"""
      {unevenness([20, 20, 20, 20,  20]) = :.3f}
      {unevenness([ 7, 14, 21, 28,  35]) = :.3f} 
      {unevenness([ 5,  5, 20, 30,  40]) = :.3f}
      {unevenness([ 2,  2,  2,  2,  92]) = :.3f}
      {unevenness([ 0,  0,  0,  0, 100]) = :.3f}
      """)

The equally split first set of values has the lowest unevenness of these five sets of five numbers, where the last set of values has the highest (all values are 0 except for one).

Of note here is that we use `numpy` mathematics functions not base Python `math` functions because they are faster. It probably doesn't matter very much here, but this is a good habit to get into, because when it does matter it will matter a lot!

Having written such a function, we can use `apply()` to uh... _appply_ it to rows or columns in a data table. Because my function does not deal with NA values, and the data table still contains NA values we drop those before applying the function.

In [None]:
welly_ages.iloc[:, :-1] \
    .dropna() \
    .apply(unevenness, axis = "columns")

`apply` allows for the usual specification of the `axis` whether (i.e. `axis = "index"` or `axis = "columns"`), so it's a simple matter to apply this to each column in the data rather than across rows. This is an advantage of this approach over an attempt to perform this calculation as a series of data table processing steps, where you may have to change the `axis` parameter in several steps of the process.

In [None]:
welly_ages \
    .dropna() \
    .apply(unevenness, axis = "index")