## Aggregations: Min, Max, and Everything In Between
Eber David Gaytan Medina

Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question. Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the "typical" values in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).

NumPy has fast built-in aggregation functions for working on arrays; we'll discuss and demonstrate some of them here.

In [1]:
import numpy as np
L = np.random.random(100)
sum(L)

51.11481177469645

In [None]:
np.sum(L)
55.612091166049424

In [None]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)
10 loops, best of 3: 104 ms per loop
1000 loops, best of 3: 442 µs per loop

In [None]:

min(big_array), max(big_array)
(1.1717128136634614e-06, 0.9999976784968716)

In [None]:
np.min(big_array), np.max(big_array)
(1.1717128136634614e-06, 0.9999976784968716)
%timeit min(big_array)
%timeit np.min(big_array)
10 loops, best of 3: 82.3 ms per loop
1000 loops, best of 3: 497 µs per loop

In [None]:
print(big_array.min(), big_array.max(), big_array.sum())
1.17171281366e-06 0.999997678497 499911.628197

In [None]:
M = np.random.random((3, 4))
print(M)
[[ 0.8967576   0.03783739  0.75952519  0.06682827]
 [ 0.8354065   0.99196818  0.19544769  0.43447084]
 [ 0.66859307  0.15038721  0.37911423  0.6687194 ]]

In [None]:
M.sum()
6.0850555667307118

In [None]:
M.min(axis=0)
array([ 0.66859307,  0.03783739,  0.19544769,  0.06682827])

In [None]:
M.max(axis=1)
array([ 0.8967576 ,  0.99196818,  0.6687194 ])

In [None]:
!head -4 data/president_heights.csv
order,name,height(cm)
1,George Washington,189
2,John Adams,170
3,Thomas Jefferson,189

In [None]:
import pandas as pd
data = pd.read_csv('data/president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173
 174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183
 177 185 188 188 182 185]

In [None]:
print("Mean height:       ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height:    ", heights.min())
print("Maximum height:    ", heights.max())
Mean height:        179.738095238
Standard deviation: 6.93184344275
Minimum height:     163
Maximum height:     193

In [None]:
print("25th percentile:   ", np.percentile(heights, 25))
print("Median:            ", np.median(heights))
print("75th percentile:   ", np.percentile(heights, 75))
25th percentile:    174.25
Median:             182.0
75th percentile:    183.0

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # set plot style
plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');

![image.png](attachment:image.png)
