# Aggregation Functions

A very common task in numerical computing and data analysis is to compute one or more aggregate functions of an array. Especially common examples of aggregates include sums, products, averages, variances, and so on. By using `numpy`'s vectorized tools, it is often possible to construct sophisticated mathematical functions that incorporate aggregations, using very simple code. 

In [17]:
import numpy as np

In [18]:
#list of numbers 0 through 9 
a=np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

We can compute the sum of the entries in a one of two ways, np.sum(a) or a.sum()

In [19]:
np.sum(a)

45

In [21]:
a.sum()

45

Same with the mean, the min and the max

In [22]:
a.mean()

4.5

In [23]:
np.mean(a)

4.5

In [24]:
np.max(a)

9

In [25]:
a.max()

9

In [26]:
a.min()

0

In [27]:
np.min(a)

0

 Product of all entries in a

In [28]:
a.prod()

0

In [29]:
np.prod(a)

0

In [30]:
#avoid the zero term
a[1:].prod()

362880

### Example: Variance

The *variance* of a set of numbers $x_1, \ldots, x_n$ is 

$$\mathrm{var}(x_1,\ldots,x_n) = \frac{1}{n}\left[(x_1 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2\right]\;,$$

where 

$$\bar{x} = \frac{1}{n}\left(x_1 + \cdots + x_n\right)$$

is the *mean*. Data sets with large variance are more "spread out" or "variable." 

While the formula for variance looks somewhat complicated, we can compute it as a one-liner using `numpy`'s vectorization and aggregation tools. 

In [31]:
#array of 100 random numbers between 0 and 1
x=np.random.rand(100)

In [32]:
x.var(), np.var(x)

(0.06656206510383104, 0.06656206510383104)

Can also compute by hand: variance is the average of the square deviations

In [33]:
np.mean((x-x.mean())**2)

0.06656206510383104

## Boolean Aggregations

There are two especially useful aggregation functions for working with boolean aggregations. `np.any()` returns `True` if ANY of the array elements are `True`. `np.all()` returns `True` if ALL of the array elements are `True`. 

In [34]:
a=np.arange(10)
mask=a>7
a[mask]

array([8, 9])

In [36]:
#sum over a subset
np.sum(a[mask])

17

In [37]:
#is the mask true for any entry
np.any(mask)

True

In [39]:
#is the mask true for every entry
np.any(mask)

True

## Aggregation Along Multiple Dimensions

Often we don't just want the sum of all the numbers in an array -- we want the sum of all numbers *per row*. For another example, maybe we want the largest entry *in each column*. `numpy` makes it easy to accomplish these tasks via the `axis` argument. 

In [40]:
#3 by 5 grid
A=np.reshape(np.arange(15),(3,5))
A

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [41]:
#sum all entries
A.sum()

105

Sum over rows or over colums. axis  0 is rows, axis 1 is columns

In [42]:
#axis=0 means you sum over rows, result is the sum of each column
B=A.sum(axis=0)
B

array([15, 18, 21, 24, 27])

In [43]:
#axis=1 means you sum over columns, the result is the sum of each row
C=A.sum(axis=1)
C

array([10, 35, 60])

Remembering which is which is very confusing for most people. Here is one way of remember A.shape=(3,5), setting axis =0 means we elminate axis 0 and are left with something of length 5. On the other hand, setting axis = 1 means we eliminate the 5 and so we are left with something of length 3

In matrix notation: setting axis = 0 corresponds to 
    $$ 
   B[j]=\Sigma_{i=1}^3 A[i,j],\quad j=1,2,3,4,5.
    $$
When we write axis=0, it means that we are summing over axis zero which in this case corresponds to the variable i. Therefore, the j-th entry of sum is the sum of all entries in column j. Similarly, when axis=1, we have 
    $$ 
   C[i]=\Sigma_{j=1}^5 A[i,j],\quad i=1,2,3.
    $$



We can also use axis =0 or 1 with other other aggregation functions

In [44]:
#min with axis=0 gives the min of each column, taken over rows
A.min(axis=0)

array([0, 1, 2, 3, 4])

In [45]:
#max with axis =1 gives the max of each row, taken over columns
A.max(axis=1)

array([ 4,  9, 14])

In [46]:
#mean with axis=1 gives the mean of each row
A.mean(axis=1)

array([ 2.,  7., 12.])

In [47]:
#cumsum with axis=1 gives the cumulutative sum of each row
A.cumsum(axis=1)

array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]], dtype=int32)

# Dealing with NaN values
What happens when there is a NaN value in your array? Since NaNs propagate, aggregations involving NaNs will generate more NaNs:

In [49]:
A=1.0*A
A[0,:]=np.nan
A

array([[nan, nan, nan, nan, nan],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.]])

In [50]:
A.mean(axis=0)

array([nan, nan, nan, nan, nan])

In [51]:
A.mean(axis=1)

array([nan,  7., 12.])

nanmean lets us ignore the nans

In [52]:
np.nanmean(A,axis=0)

array([ 7.5,  8.5,  9.5, 10.5, 11.5])

Story is similar with np.nansum(), np.nanmax(), and  np.nanmin()