# 02.04 - Aggregations: Min, Max, and Everything In Between

Often, the first step when dealing with large amounts of data is computing summary statistics (mean, median, min/max, quantiles, etc.).  

NumPy has fast built-in **aggregation functions** for these:

### Summing the Values in an Array

With Python, we would use the <code>sum</code> function:

In [3]:
import numpy as np

In [6]:
L = np.random.random(100)
sum(L)

49.54737404228075

The same synthax applies to Numpy <code>sum</code> function (although they differ in the meaning of their optional arguments):

In [7]:
np.sum(L)

49.54737404228075

However, the difference in performance are noticeable, especially for array of larger size:

In [8]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array) # 2 orders of magnitude faster!

238 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Minimum and Maximum

Similarly, Python built-in <code>min</code> and <code>max</code> has a faster equivalent in NumPy:

In [11]:
%timeit min(big_array)
%timeit np.min(big_array)

167 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.59 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


**Note**: a shorter syntax is to use methods of the array object itself:

In [12]:
print(big_array.min(), big_array.max(), big_array.sum())

1.404321527953556e-07 0.9999990688289864 499831.49227356917


### Multi dimensional aggregates

One common type of aggregation operation is an aggregate along a row or column.  For example:

In [14]:
M = np.random.random((3, 4))
print(M)

[[0.29579535 0.37803529 0.00963572 0.93431572]
 [0.25882731 0.09027571 0.81799307 0.05484366]
 [0.45841601 0.8564048  0.9004233  0.58082741]]


If we want the aggregation function to compute **only on a specified axis**, we can pass the optional arg <code>axis</code>.  

For example, to do that for each column we use <code>axis=0</code>:

In [21]:
M.max(axis=0)

array([0.45841601, 0.8564048 , 0.9004233 , 0.93431572])

For rows, <code>axis=1</code>:

In [23]:
M.sum(axis=1)

array([1.61778208, 1.22193975, 2.79607153])

**Note**: NumPy has a large list of available aggregation function. Most of them also have a <code>NaN</code> safe counterpart to handle missing data.  

Here is a list of useful aggregation functions:

<pre>
np.sum 	   np.nansum 	   Compute sum of elements
np.prod 	  np.nanprod 	  Compute product of elements
np.mean 	  np.nanmean 	  Compute mean of elements
np.std 	   np.nanstd 	   Compute standard deviation
np.var 	   np.nanvar 	   Compute variance
np.min 	   np.nanmin 	   Find minimum value
np.max 	   np.nanmax 	   Find maximum value
np.argmin 	np.nanargmin 	Find index of minimum value
np.argmax 	np.nanargmax 	Find index of maximum value
np.median 	np.nanmedian 	Compute median of elements
np.percentile np.nanpercentile Compute rank-based statistics of elements
np.any 	   N/A 	         Evaluate whether any elements are true
np.all 	   N/A 	         Evaluate whether all elements are true

</pre>