Most people use the average of the data as the standard summary statistic; for instance,
when receiving their exam scores, students will usually ask what the class average was for
the exam. The average is also the most commonly used summary statistic by data scientists.
Most statisticians use the term sample mean for this statistic and often refer to it as simply
the mean. However, using the term mean to refer to the average of a set of numbers can
create confusion with another type of mean for random data: the ensemble mean, which
is also usually just called the mean. The ensemble mean is introduced in Chapter 9.
Because of this ambiguity in the meaning of the word mean, I will use the terms average
or sample mean to refer to the value that is computed from data.
When teaching a class on this subject, I usually ask the class: “What does the average
or sample mean mean?”. Here are some of the common answers:
1. The value that most of the data values are centered around
2. The value that is most likely to occur
3. The value that has minimum distance from every value
4. The value that divides the data into two sets of equal size
Only one of these descriptions of the average is always accurate, and even then, the
description is ambiguous.
To understand average (and also some other summary statistics), we first have to understand that representing data by a summary statistic results in errors, and we can use
these errors to help choose a “good” summary statistic. Let d0, d1,...,dN−1 be a set of
numerical data values, and let ν be a summary statistic. The error ei between data value
di and ν is simply ei = di − ν. Note that the error may be positive or negative. Intuitively,
we should try to choose ν to minimize the errors to the data. However, how to do this is
not entirely clear because we have many different errors, and if we adjust the value of ν to
decrease the error to one data value, that may increase the error to another data value.

The typical way to overcome this problem is to combine the errors in some way to create
a single numerical value. Below I show some possible choices of functions for combining
the errors. Since the errors depend on the choice of summary statistic, ν, we write these
combinations as functions of ν:

1. Sum of errors:
Es(ν) =
N−1

i=0
ei
2. Number of nonzero error values:
E0(ν) =
N−1

i=0
1R−{0}(ei)
Here, the indicator function 1R−{0}(ei) will return 1 for any nonzero errors.
3. Sum of absolute errors:
E1(ν) =
N−1

i=0
|ei|
4. Sum of squared errors:
E2(ν) =
N−1

i=0
(ei)
2
The code below calculates each of these error functions as the value of ν is varied, using
a simple data set: D = {−1, −1, 0, 2, 5}:

In [1]:
import numpy as np

# The data:
D = [-1, -1, 0, 2, 5]
# Sweep the value of nu from -2 to 6
nus = np.arange(-2, 6.01, 0.01)
# For clarity, store the different error metrics in different variables.
# Initialize them here
sum_errors = 0
num_nonzero_errors = 0
sum_abs_errors = 0
sum_square_errors = 0
# Calculate the error metrics
for d in D:
    sum_errors += d - nus
    num_nonzero_errors += (d - nus) != 0
    sum_abs_errors += np.abs(d - nus)
    sum_square_errors += (d - nus) ** 2

In [2]:
nu_e = nus[np.argmin(sum_errors)]
nu_0 = nus[np.argmin(num_nonzero_errors)]
nu_1 = nus[np.argmin(sum_abs_errors)]
nu_2 = nus[np.argmin(sum_square_errors)]

Interestingly, even for this very small data set, each of these error metrics results in a
different minimizing value of ν on the range ν ∈ [−2, 6]. None of these metrics is inherently
“correct” – although we explain below why one of them is not useful.
From inspecting the graph and further thought, we can make the following observations:
1. The sum of errors decreases without bound as a function of ν. Thus, there is
no minimizing value of ν if ν can be any real value. Because of this, the sum of
errors is not useful in determining a summary statistic.
2. The number of nonzero errors is equal to the total number of data values, except
at each of the data values. The number of nonzero errors at a data value is equal Zoomed in view of two error functions comparing a summary statistic ν to a data set. Data
are shown as stars.
to the number of other data values. This metric will be minimized by setting ν
equal to one of the data values that appears the most in the data set.
3. The sum of absolute errors is a continuous, piecewise-linear function. Furthermore, the linear segments connect between data values, between the minimum
data value and −∞, and between the maximum data value and +∞. It is not
hard to see that the absolute error will increase linearly as ν is decreased from
the minimum data value or increased from the maximum data value. For such
a function, the minimizing value will be at one of the values where the linear
segments intersect, which is at one of the data values.
4. The sum of squared errors is a parabola. For this example, the minimum is not
at one of the data values.
In the subsections below, we use the metrics E0, E1, and E2 to define three common
summary statistics.

In [3]:
import pandas as pd

df = pd.DataFrame(D)
df

Unnamed: 0,0
0,-1
1,-1
2,0
3,2
4,5


In [4]:
import scipy.stats as stats

stats.mode(D, keepdims=False)
stats.mode(D, keepdims=False)[0]

np.int64(-1)

In [5]:
np.median(D)

np.float64(0.0)

In [6]:
df.median()

0    0.0
dtype: float64

In [7]:
np.mean(D)

np.float64(1.0)

In [8]:
df.mean()

0    1.0
dtype: float64

In [9]:
D3 = D + [100]

In [10]:
np.mean(D3)

np.float64(17.5)

In [11]:
np.median(D3)

np.float64(1.0)