# Question 0: Confidence Intervals

First, let's revisit some of the notation from class:
1. Let $S$ represent the unknown, "true" value of some statistic, which is what we would get if we conducted our experiments on an infinitely large sample.
2. Let $\widehat S$ (pronounced "S hat") represent the value of the statistic calculated on the data that we did obtain. If we re-ran the experiment tomorrow, the trus $S$ wouldn't be different, but our particular dataset might be a little different and so would tomorrow's $\widehat S$.
3. Next, let $S^*$ represent the value of that statistic calculated from a re-sampled dataset. (If we do 10,000 re-samplings, there will be 10,000 different $S^*$ values: $S^*_0 \ldots S^*_{9999}$. We will use $S^*_i$ to refer to a generic entry in the list of resampled statistics.)
4. Last, recall the **bootstrap analogy principle**: $\widehat S$ is to $S$ as $S^*$ is to $\widehat S$. That is, the values seen when re-sampling our dataset (the $S^*_i$) generally have the same relationship to $\widehat S$ as different $\widehat S$ values (i.e. from repeating the experiment a few different times) will have to the true value $S$. More specifically, $P(\widehat S - S) \approx P(S^* - \widehat S)$.

In class, we did some algebra that used these above definitions to go from a $[\textrm{low}, \textrm{high}]$ range that contains 95% of the $S^*$ values (i.e. $P(\textrm{low} \le S^* \le \textrm{high}) = 0.95$) to a $[\mathscr{l}, \mathscr{h}]$ range that, 95% of the time, will contain the trus $S$ (i.e. $P(\mathscr{l} \le S \le \mathscr{h}) = 0.95$).

At the end of the day, we had $\mathscr{l} = 2\widehat S - \textrm{high}$ and $\mathscr{h} = 2\widehat S - \textrm{low}$.

## Question 0.0
We will first learn how to calculate the $[\textrm{low}, \textrm{high}]$ range containing 95% of the $S^*$ values.

Below is a version of the `resample_1sample` code from last homework, and code to load in the blood pressure dataset.

Use `resample_1sample` with `numpy.mean` to calculate the means of 10,000 different resamplings of the GG blood pressure data. Store these in a variable `s_stars`. Also calculate `s_hat`, the actual mean.

Now, we need to find a range containing 95% of these means. The most straightforward way to do this is to choose a range that excludes the bottom 2.5% and the top 2.5% of the means. In other words, we need to find the value below which 2.5% of the means are found, and the value below which 97.5% of the means are found. (These values are otherwise known as the 2.5th percentile and 97.5th percentile.)

One way to calculate a percentile is to simply sort the data, and then look at the data points that fall 2.5% and 97.5% along the way from lowest to highest value. If there are 10,000 data points, then the the 2.5th percentile is data point number $10000 \cdot 0.025 = 250$ (but recall that the index of the 250th data point is 249...).

Use `numpy.sort`, which returns a sorted copy of the input array, to calculate these percentiles. Store them in variables `low` and `high`. 

Note that the provided code plots two histograms (on the same scale): that of the actual GG blood pressure values, and that of the resampled means. Note again how narrow the distribution of means is.

In [None]:
import numpy

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = [12, 4]

def resample_1sample(data, statistic, n_resamples):
    data = numpy.asarray(data)
    stats = []
    for i in range(n_resamples):
        resampled_data = numpy.random.choice(data, size=len(data), replace=True)
        stats.append(statistic(resampled_data))
    return stats

aa_bp = []
gg_bp = []
file = open('bloodpressure3.txt')
header = file.readline()
for line in file:
    bp, genotype = line.strip('\n').split('\t')
    bp = float(bp)
    if genotype == 'AA':
        aa_bp.append(bp)
    elif genotype == 'GG':
        gg_bp.append(bp)
    else:
        print('unknown genotype!', genotype)
file.close()
aa_bp = numpy.array(aa_bp)
gg_bp = numpy.array(gg_bp)


# YOUR ANSWER HERE


plt.xlim(gg_bp.min()-2, gg_bp.max()+2)
plt.hist(gg_bp, bins='auto', label='GG data')
plt.axvline(s_hat, color='green')
plt.legend()

plt.figure() # start a new plot below the first
plt.xlim(gg_bp.min()-2, gg_bp.max()+2)
plt.hist(s_stars, bins='auto', label='GG mean resamples')
plt.axvline(low, color='blue')
plt.axvline(high, color='blue')
plt.axvline(s_hat, color='green')
plt.legend()

print(low, s_hat, high)

In [None]:
assert 122 < low < 124
assert 125 < high < 126
assert abs(s_hat - 124.021365854) < 0.001
assert len(s_stars) == 10000
sss = numpy.array(s_stars)
assert numpy.count_nonzero(sss <= low) == 250
assert numpy.count_nonzero(sss <= high) == 9750

## Question 0.1
Now do the same thing (including plotting histograms) with the median value of the GG blood pressures (using `numpy.median` as your statistic).

In [None]:
# YOUR ANSWER HERE

print(low, s_hat, high)

In [None]:
assert low == 120.962
assert high == 125.414

Interestingly, the distribution of resampled medians is quite "spiky". Can you think why this might be?

Of course, the approach above is a bit of a fussy way to calculate percentiles, and it doesn't handle the case of arrays with sizes that aren't as amenable to percentiling as was our 10,000-element array. Instead, we can use the `numpy.percentile` function, which takes either a single percentile (0 to 100) or a list of them:
```python
median = numpy.percentile(s_stars, 50)
low, high = numpy.percentile(s_stars, [2.5, 97.5])
```
Note that `numpy.percentile` will take care of sorting the input data for you.


# Question 0.2
Use `numpy.percentile` and your knowledge of writing generic resampling functions from last homework to write a function to generate bootstrap confidence intervals: `bootstrap_ci(data_sets, statistic, n_resamples=10000, ci_percent=95)`, where:
 - `data_sets` is a list of one or more data sets
 - `statistic` is a function that will be used to calculate the statistic around which to produce the confidence interval. Just like with last homework's `resample` function, the `statistic` parameter could be a 1-sample function like `numpy.mean` or a two-sample function like `t_stat` from last homework, or some more exotic function that takes three parameters or whatnot.
 - `n_resamples` is the number of resamples to do
 - `ci_percent` is the percent confidence interval to produce
 
The function should return four things: `s_stars`, `l`, `s_hat` and `h`, corresponding to:
  - a list of the $S^*$ values
  - $\mathscr{l}$, the low end of the confidence interval
  - $\widehat S$, the calculated statistic
  - $\mathscr{h}$, the high end of the interval

When using a `statistic` function that takes one argument (like `numpy.mean`, that you call as `numpy.mean(data)`), the `bootstrap_ci` function would be called as follows:

```python
s_stars, l, s_hat, h = bootstrap_ci([data], numpy.mean)
```

When using a `statistic` that takes two arguments (like e.g. a `t_statistic` function that you might write and call as `t_statistic(data1, data2)`), `bootstrap_ci` would be called as follows:

```python
s_stars, l, s_hat, h = bootstrap_ci([data1, data2], t_statistic)
```

Since this is the last homework, you should be ready to write this function from scratch. The `resample` function from last week is provided: it will be useful to call `resample` from `bootstrap_ci`.

Note that the below code plots the resampled values for the means of the GG and AA blood pressures, plus 95% confidence intervals for those means.

In [None]:
def resample(data_sets, statistic, n_resamples):
    # turn every dataset in data_sets into a numpy array 
    data_sets = [numpy.asarray(data_set) for data_set in data_sets]
    stats = []
    for i in range(n_resamples):
        resampled_data_sets = [numpy.random.choice(data, size=len(data), replace=True) for data in data_sets]
        stats.append(statistic(*resampled_data_sets))
    return stats

# YOUR ANSWER HERE

s_stars, l, s_hat, h = bootstrap_ci([gg_bp], numpy.mean, ci_percent=99)
print('GG mean, 99% CI:', l, h)

s_stars, l, s_hat, h = bootstrap_ci([gg_bp], numpy.mean)
print('GG mean, 95% CI:', l, h)
plt.hist(s_stars, bins='auto', label='GG $S^*$')
plt.axvline(l, color='orange')
plt.axvline(h, color='orange')

s_stars, l, s_hat, h = bootstrap_ci([aa_bp], numpy.mean)
print('AA mean, 95% CI:', l, h)
plt.hist(s_stars, bins='auto', alpha=0.6, label='AA $S^*$')
plt.axvline(l, color='blue')
plt.axvline(h, color='blue')
plt.legend()

s_stars, l, s_hat, h = bootstrap_ci([aa_bp], numpy.max)
print('AA max observed:', s_hat)
print('AA max, 95% CI:', l, h)

In [None]:
s_stars, l, s_hat, h = bootstrap_ci([gg_bp], numpy.mean, ci_percent=99)
assert len(s_stars) == 10000
assert 121 < l < 122
assert 126 < h < 127
s_stars, l, s_hat, h = bootstrap_ci([aa_bp], numpy.max, n_resamples=5000)
assert len(s_stars) == 5000
assert l == 155.373
assert h == 167.129

So, it appears that the 95% confidence intervals for the mean of the GG and the AA blood pressure data overlap! Does this mean that there is no statistically significant difference in the means of the GG and AA groups?

## Question 0.3
Using the provided functions (which are simplified versions of what you wrote for the last homework), calculate the p-value for the difference in means between the GG and AA data (store it in a variable `p_val`). Make sure to order the data sets such that the (larger) GG mean is subtracted from the AA mean.

Note that the p-value should be significant! Next, using your `bootstrap_ci` function above, calculate a 95% confidence interval around the observed difference in means between these data sets, storing the results as `l`, `s_hat` and `h` as above. Also plot a histogram of the resampled differences in the means of the groups (i.e. the `s_stars`) with vertical lines at `h` and `l`.

In [None]:
def mean_difference(data1, data2):
    return numpy.mean(data1) - numpy.mean(data2)

def two_tail_p_value(actual_stat, resampled_stats, null_hyp_stat=0):
    resampled_stats = numpy.asarray(resampled_stats)
    actual_diff = abs(actual_stat - null_hyp_stat)
    diff = numpy.abs(resampled_stats - null_hyp_stat)
    count_extreme = numpy.count_nonzero(diff >= actual_diff)
    return count_extreme / len(resampled_stats)

def bootstrap_means_different(data1, data2, statistic, n_resamples=10000):
    data1 = numpy.asarray(data1)
    data2 = numpy.asarray(data2)
    actual_stat = statistic(data1, data2)
    # Just shift both means to zero (unlike last homework... but the result is the same)
    shifted1 = data1 - numpy.mean(data1)
    shifted2 = data2 - numpy.mean(data2) 
    resampled_stats = resample([shifted1, shifted2], statistic, n_resamples)
    p_val = two_tail_p_value(actual_stat, resampled_stats)
    return p_val

# YOUR ANSWER HERE

print('Difference between means:', s_hat, 'p =', p_val)
print('Confidence interval on differences:', l, h)


In [None]:
assert abs(s_hat - 5.25058013937) < 0.000001
assert 0.0225 < p_val < 0.0275
assert 0.5 < l < 0.8
assert 9.7 < h < 10.0

So, even though the confidence interval for the mean of the AA data overlaps that for the mean of the GG data, these two populations are still statistically distinct. There is a significant p-value for the difference in their means, and the 95% confidence interval on this difference does not include zero.

We should conclude from this that it's hard to read  much into whether two confidence intervals overlap. Such an occurence means that some percent of the time, the means of the resampled GG and AA data are close to each other. What matters is *what* percent if the time!

We can see this most easily from the histogram of the resampled differences in means. There *is* a small fraction below zero, but as long as that's less than 5% of the time (which it clearly is), the means are statistically significantly distinct. And indeed, that is what we observe in the p-value.

## Question 0.4
How does the size of the confidence interval scale with the sample size? To examine this question, we will take a few random subsets of the GG data and calculate the width of the bootstrap CIs produced.

Calculate the width (i.e. `h - l`) of a 95% CI around the mean for each of the subsets below (i.e. the `subsampled_ggs`), store those in a list `ci_widths`, and use `plt.plot` to produce a line-plot of the relationship between the sample sizes (on the x axis) and those widths.

In [None]:
sample_sizes = [5, 10, 25, 50, 100, 150, 200]
subsampled_ggs = [numpy.random.choice(gg_bp, size=sample_size) for sample_size in sample_sizes]

# YOUR ANSWER HERE
print(ci_widths)

In [None]:
assert len(ci_widths) == len(sample_sizes)
assert ci_widths[0] > ci_widths[3] > ci_widths[6]
assert 2.9 < ci_widths[6] < 3.9

As you might expect, the larger the sample, the narrower the confidence interval.

## Question 0.5
Next, we will examine the reliability of the confidence interval procedure. Recall from class that we showed that, if the bootstrap analogy holds, then for 95% of samples, a "95% confidence intervals" constructed from that sample will contain the true value of the statistic.

In order to test this, we need to be able to generate samples from a dataset where the true value is known. I will provide some functions, defined below, that will produce a sample of data where the true value is known, e.g.:
```python
def get_normal_sample(sample_size):
    # draw data from a normal distribution with a "true mean" = 0
    return numpy.random.normal(size=sample_size)

def get_uniform_sample(sample_size):
    # draw data from a uniform distribution between 0 and 1,
    # so the "true min" = 0, "true mean" = 0.5, and "true max" = 1.
    return numpy.random.normal(size=sample_size)

def get_gg_sample(sample_size):
    # return a re-sample of the GG data. The "true" value of any given statistic
    # will just be the value of the statistic calculated for the original GG dataset.
    return numpy.random.choice(gg_bp, size=sample_size)
```

For simplicity, we will restrict ourselves to one-sample statistics.

Write a function that will count how often the bootstrap CI procedure generates a "correct" CI:

`good_ci_fraction(statistic, get_sample, true_value, sample_size, n_trials=300, n_ci_resamples=1000, ci_percent=95)`

Where:
 - `statistic` is a function that calculates a 1-sample statistic
 - `get_sample` is a function, such as defined above, that will generate a sample of data. It is **critical** that you generate a new sample for each of the `n_trials` times that a confidence interval is constructed.
 - `true_value` is the true value of that statistic
 - `sample_size` is the sample size parameter to pass to the `get_sample` function
 - `n_trials` is the number of times that you should calculate a bootstrap CI (using your `bootstrap_ci` function)
 - `n_ci_resamples` is the number of resamples that you should pass to `bootstrap_ci`
 - `ci_percent` is the confidence interval percentage, which you should also pass to `bootstrap_ci`
 
This function should return one value: the fraction of `n_trials` for which `true_value` is within the [l, h] range produced by `bootstrap_ci`.

Note: this can take a while, even when doing relatively few trials (and within each trial, drawing relatively few bootstrap resamples). If any given run of `good_ci_fraction` takes more than 10 seconds, though, something is probably wrong.

In [None]:
 def get_normal_sample(sample_size):
    # draw data from a normal distribution with a "true mean" = 0
    return numpy.random.normal(size=sample_size)

def get_uniform_sample(sample_size):
    # draw data from a uniform distribution between 0 and 1,
    # so the "true min" = 0 and "true max" = 1.
    return numpy.random.uniform(size=sample_size)

def get_gg_sample(sample_size):
    # return a re-sample of the GG data. The "true" value of any given statistic
    # will just be the value of the statistic calculated for the original GG dataset.
    return numpy.random.choice(gg_bp, size=sample_size)

# YOUR ANSWER HERE

good_norm_mean_frac = good_ci_fraction(numpy.mean, get_normal_sample, true_value=0, sample_size=50)
print('Normal Mean:', good_norm_mean_frac)

good_uni_mean_frac = good_ci_fraction(numpy.mean, get_uniform_sample, true_value=0.5, sample_size=50)
print('Uniform Mean:', good_uni_mean_frac)

good_gg_mean_frac = good_ci_fraction(numpy.mean, get_gg_sample, true_value=numpy.mean(gg_bp), sample_size=50)
print('GG Mean:', good_gg_mean_frac)

In [None]:
assert 0.90 < good_norm_mean_frac < 0.99
assert 0.90 < good_uni_mean_frac < 0.99
assert 0.90 < good_gg_mean_frac < 0.99
# make sure that 300 replicate trials were performed
assert good_norm_mean_frac * 300 == int(good_norm_mean_frac * 300)

You should see that the 95% confidence interval contains the true value around 95% of the time. (With only 300 trials, and 1000 bootstrap resamplings per trial, you wouldn't expect exactly 95% every time... but more trials or resamplings would make this code pretty slow to run, so we compromise.)

## Question 0.6
The above was with a sample size of 50. How small a sample will work? Write loop(s) to test the uniform and GG data with sample sizes of 25, 10, and 5. Store the results in two lists: `uniform_fracs` and `gg_fracs`.

In [None]:
sample_sizes = [25, 10, 5]
# YOUR ANSWER HERE

print('Uniform:', uniform_fracs)
print('GG:', gg_fracs)

In [None]:
assert uniform_fracs[0] > uniform_fracs[2]
assert gg_fracs[0] > gg_fracs[2]
assert 0.85 < uniform_fracs[0] < 0.95
assert 0.84 < gg_fracs[0] < 0.96

As you can see, with n less than 25 or so, it's hard to get a representative enough sample to reliably build a 95% confidence interval. That's pretty good, though!

## Question 0.7
What about the other statistic we discussed in class: the maximum? Let's test! In this case, we might expect the validity of the confidence interval to depend strongly on the sample size, since the larger the sample, the more likely it is to contain the true maximum (or numbers near it).

Using sample sizes of 20, 100, and 200, try to find the 95% CI around the maximum of a uniform distribution. (Use `numpy.max` as the statistic, and recall that the maximum of the uniform distribution we're using is 1.) Next, do the same for the GG blood pressure data instead of the uniform. Store the results in `uniform_fracs_sample_size` and `gg_fracs_sample_size`, respectively.

In [None]:
sample_sizes = [20, 100, 200]

# YOUR ANSWER HERE
print('Uniform max size:', uniform_fracs_sample_size)

# YOUR ANSWER HERE
print('GG max size:', gg_fracs_sample_size)

In [None]:
uf = numpy.array(uniform_fracs_sample_size)
assert numpy.alltrue(uf > 0.84) and numpy.alltrue(uf < 0.92)
assert gg_fracs_sample_size[-1] > gg_fracs_sample_size[0]
assert 0.55 < gg_fracs_sample_size[1] < 0.67



Now this is very interesting! Overall, the percentages are nowhere near 95%, even for relatively large samples (compared to what it takes to get a good CI around the mean). Moreover, these values are highly dependent on the shape of the true distribution (uniform vs. the more-or-less bell curve of the GG dataset).

**For the GG data, the story is relatively clear.** Recall, for the purposes of examining the CI procedure, we are pretending that the GG data is the "true distribution", and we are taking random samples from that "true distribution". We will use these random samples to calculate a confidence interval and compare it to the true value.

Anyhow, from the histogram of `gg_bp` above, note that there are very few data points very close to the true maximum -- but a little further below the maximum, there start to be many more data points.

Thus, a sample of the GG data that does not include, for example, the top two data points (which will happen around 13% of the time) is non-representative in two ways. First, the maximum is too low. This, alone, is fine: the CI procedure should handle that.

The real problem is the second way in which that sample is non-representative: the density of data points around the (too-small) maximum value will be too large. For example, there is only one other data point within 5 units of the maximum value of the GG dataset. But there are nine data points within 5 units of the third-from-the-maximum value! So if a sample does not inclide the top two points from our "true" population, then it will appear that data points are relatively dense around the sample's maximum. So when we apply the bootstrap CI procedure to the sample, we will see that most of the time the bootstrap resamplings (of our non-representative sample) will have a maximum value that is relatively close to the maximum of that sample. This will cause us to estimate a too-narrow confidence interval!

In other words, the bootstrap analogy has failed, because the distances between the true maximum and sample maximum (relatively large) are not representative of the distances between the sample maximum and the resample maxima (relatively small).

However, with a larger size, samples are more likely to include the true maximum or a point nearby, and thus will have a shape that is more representative of the truth.

**But how can we explain the case of the uniform distribution, where more samples just don't seem to help?** Well, while a large sample will invariably get values closer to the true maximum, it will also sample more densely, which will lead to a narrower confidence interval. (See your answer to question 0.4...) It turns out that for a uniform distribution, these two effects exactly cancel eachother out, and so more data never helps. But why is the confidence interval over-optimistic in the first place?

It turns out the answer is similar to the GG case. Unless the sample has a minimum of zero and a maximum of one, the sample will be less spread-out than the true case (with true minimum of zero and true maximum of one), so bootstrapping will find more data within a given distance of the sample maximum than you would expect for the fully-spread-out distribution. This effect is small, but non-zero and accounts for the inability of the "95% CI" to actually contain 1 95% of the time. 

Note that in this case, if we re-formulated the bootstrap analogy in terms of $\frac{\widehat S}{S}$ and $\frac{S^*}{ \widehat S}$ rather than $\widehat S - S$ and $S^* - \widehat S$, the CI procedure would work better here. You have to use the right analogy for your data... Fortunately, most of the time, using the basic analogy of differences between the statistics works best. But again, statistics isn't magic. It's important to run tests like these with simulated data to make sure things are behaving how you expect. 

In any case, it's clearly very complex to estimate the true maximum from the maximum of a sample -- even for a uniform distribution! This is known as [the German tank problem](https://en.wikipedia.org/wiki/German_tank_problem), and is a classic question in statistics with an interesting history (as its name might suggest).

## Question 0.8
Surely a confidence interval on the median will behave, though... right?

Test the ability to create a 95% CI around the median (using `numpy.median`) with a sample size of 200 on the normal (true median = 0), uniform (true median = 0.5) and GG datasets. Save these values as `good_norm_med_frac`, `good_uni_med_frac`, and `good_gg_med_frac`, respectively.

In [None]:
# YOUR ANSWER HERE
print('Normal Median:', good_norm_med_frac)

# YOUR ANSWER HERE
print('Uniform Median:', good_uni_med_frac)

# YOUR ANSWER HERE
print('GG Median:', good_gg_med_frac)

In [None]:
assert 0.86 < good_norm_med_frac < 0.96
assert 0.83 < good_uni_med_frac < 0.93
assert 0.77 < good_gg_med_frac < 0.87

Oh dear! Even the median seems to trip up the confidence interval procedure a bit.

As above, the fact that the confidence intervals are again a bit too small means that bootstrap analogy isn't quite holding. In this case, the medians of the resamples are closer to the median of the sample than the median of the sample is to the true median.

The problem is that the the distribution of resampled medians is a little bit "spiky" (as you saw above). That is, it is very concentrated at the median of the original sample, and a few other places. (Note that the highest histogram bin for the resampled means in Question 0.0 should have had something like 500 counts, but the highest bin for the medians in 0.1 should have been closer to 2000.)

This is because the median, though it is resistant to outliers, is very dependent on the input data: the median can only ever be at the exact value of a data point, or halfway between two data points. So there are a limited number of places a resampled median could fall, wholly determined by the data points in the original sample.

In comparison, every new sample can (in principle) get completely unique data points. So the distribution of medians of real samples will be much smoother than that of bootstrap resamples. Again, this slightly breaks the bootstrap analogy, and we get confidence intervals that are a bit over-confident.

To see this effect, run the code below: first, it generates 10,000 medians from 10,000 samples from the normal distribution. Then, it gets a single sample from a normal distribution and generates 10,000 re-sampled medians. The histograms of each look very different! (Try changing `numpy.median` to `numpy.mean` below and see how that changes things.)

Note: don't worry if the distributions are shifted relative to one another -- it is completely to be expected that an outlier sample will produce resample statistics that are all shifted, and the bootstrap CI procedure is designed to account for this (as we saw in class, most samples aren't so crazy that the confidence interval created from them won't include the true value of the statistic). But note that the relative shape of the distributions is much more similar when using means vs. medians.

In [None]:
medians = [numpy.median(get_normal_sample(200)) for i in range(10000)]
plt.hist(medians, bins='auto', label='sample medians')

norm_sample = get_normal_sample(200)
resampled_medians = [numpy.median(numpy.random.choice(norm_sample, size=200)) for i in range(10000)]
plt.hist(resampled_medians, bins='auto', alpha=0.6, label='resample medians')
x = plt.legend() # store result of this to suppress printing it out

**Perhaps this all has you feeling a bit unhappy about the general utility of the bootstrap procedure!** Don't worry too much! Every branch of statistics makes certain assumptions, and sometimes the assumptions break down a bit. But if you understand where and how the assumptions break, you can avoid and correct for the problems.

Running sanity-check simulations like the above `good_ci_fraction`, and testing your statistical power (as we will do in another question) is a key tool for you to diagnose and understand these issues, rather than just blindly trusting that a statistic is behaving well.

Also, you probably have a better feeling now for why the mean is so beloved by statisticians (compared to the median or mode or other statistics to describe the location of a set of data): the mean is generally very well behaved in statistical tests, despite its susceptibility to outliers.