# Measures of Central Tendency

## Objectives ##
- Calculate the mean and median for a given data set.

## Mean and Median ##
The "center" of a data set is also a way of describing location. The two most widely used measures of the "center" of the data are the **mean** (average) and the **median**. To calculate the mean weight of 50 people, add the 50 weights together and divide by 50. In the last section, we learned that the median is the second quartile $Q_2$ or the 50th percentile. To find the median weight of the 50 people, order the data and find the number that splits the data into two equal parts. The median is generally a better measure of the center when there are extreme values or outliers because it is not affected by the precise numerical values of the outliers. The mean is the most common measure of the center.

```{note}
The words *mean* and *average* are often used interchangeably. The substitution of one word for the other is common practice. The technical term for the type of mean we a discussing in this section is *arithmetic mean*, and *average* is technically a general term for any center location. For example, the *median* of a data set is also technically an average of the data set because it is one measure of the center location. However, in practice among non-statisticians, the *average* of a data set is commonly understood as the *arithmetic mean*.
```

We use different symbols to differentiate between the mean of a sample and the mean of a population. The letter used to represent the sample mean is an x with a bar over it (pronounced “x bar”): $\bar{x}$. The Greek letter $\mu$ (pronounced "mew" and spelled in English "mu") represents the population mean. Both the sample mean and population mean are calculated the same way: add together all the data values in the sample or population, then divide by the number of data values in the sample of population. In practice, we usually calculate the sample mean $\bar{x}$. If we know enough about the population to calculate the population mean $\mu$, then we don't need to collect a smaller sample to estimate features of the population.

When each value in the data set is not unique, the mean can be calculated by multiplying each distinct value by its frequency and then dividing the sum by the total number of data values.  To see that both ways of calculating the mean are the same, consider this sample of 11 data values:
<center>
    1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4
</center>
<br/>

$$ \bar{x} = \frac{1 + 1 + 1 + 2 + 2 + 3 + 4 + 4 + 4 + 4 + 4}{11} = 2.727$$
<br/>

$$ \bar{x} = \frac{3(1) + 2(2) + 1(3) + 5(4)}{11} = 2.727$$
<br/>

In the second calculation, the frequencies are 3, 2, 1, and 5 since the data contains 3 ones, 2 twos, 1 three, and 5 fours.

In R, these calculations look like:

In [1]:
xbar = (1 + 1 + 1 + 2 + 2 + 3 + 4 + 4 + 4 + 4 + 4)/11
xbar

In [1]:
xbar = (3*1 + 2*2 + 1*3 + 5*4)/11
xbar

```{note}
In each case, we first calculate the mean and store the value in the variable <code>xbar</code>. To have the computer display the value that we stored in <code>xbar</code>, we simply type <code>xbar</code> by itself on the next line. If we don't type that extra <code>xbar</code>, the computer will store the value of the mean in the variable, but it won't tell use what the mean is.
```

Observe that all we do to calculate the mean is add all the data values together, then divide by the number of data values. This concept can be more succinctly expressed using the formula

$$ \bar{x} = \frac{\sum x}{n}. $$

We use $\sum$ (the capital Greek letter sigma) when we want to add up or find the sum of values. In this case, the formula is telling us to add up all the $x$'s, where we use $x$ as a placeholder for the data values in the sample. Then we divide the sum of $x$'s by $n$, where $n$ is the number of data values in the sample. Note, though we've used the sample mean $\bar{x}$ in the formula, the formula is essentially the same for the population mean $\mu$:

$$ \mu = \frac{\sum x}{N}, $$

where the $x$'s are the the data values in the population, and $N$ is the number of data values in the population.

Calculating a mean is easy using R. We can use the <code>sum</code> function to add up the values in a list, which gives us $\sum x$. And we can use the <code>length</code> function to find out how many values are in a list, which gives us the sample size $n$ (or population size $N$ if working with an entire population). Both the <code>sum</code> function and the <code>length</code> function have just one argument:

```R
sum(x)
```

```R
length(x)
```

In both cases, <code>x</code> is a list of data values.

So for the above sample data, we can calculate the sample mean using R as follows:

In [1]:
x = c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4)
n = length(x)

xbar = sum(x)/n
xbar

````{note}
We will usually store a list of data in variable <code>x</code> to make the computer code as similar as possible to the formula. For example, <code>xbar = sum(x)/n</code> has an obviously similar structure to $\bar{x} = \sum x/n$. However, any variable name would work. For example, the following code calculates the mean just as well as the code above:

```R
values = c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4)
n = length(values)

xbar = sum(values)/n
xbar
```
````

We've already discussed the median: the median $M$ is the same as $Q_2$, the second quartile, or the 50th percentile. The median is the 'middle value' of the data: exactly half the data are greater than the median, and exactly half the data are less than the median. We've seen that we can find the median in R using <code>quantile(x, probs=0.50)</code>, where <code>x</code> is the list of data values we want the median of. For example, to find the median of the data above, we would type:

In [2]:
x = c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4)

quantile(x, probs = 0.50)

The median is $M = 3$. Exactly half the values in the data are less than the median, and exactly half the values is the data are greater than the median.

***


### Example 2.2.1 ###
The following data show the number of months patients wait on a transplant list before getting surgery. The data are ordered from smallest to largest. Calculate the mean and median.
<center>
    3, 4, 5, 7, 7, 7, 7, 8, 8, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 13, 14, 14, 15, 15, 17, 17, 18, 19, 19, 19, 21, 21, 22, 22, 23, 24, 24, 24, 24
</center>

#### Solution ####

In [6]:
x = c(3, 4, 5, 7, 7, 7, 7, 8, 8, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 13, 14, 14, 15, 15, 17, 17, 18, 19, 19, 19, 21, 21, 22, 22, 23, 24, 24, 24, 24)

# Find the Mean
n = length(x)

xbar = sum(x)/n
xbar

# Find the Median
M = quantile(x, probs = 0.50)
M

So the mean is about $\bar{x} = 13.9487$ months and the median is $M = 13$ months.

(We use $\bar{x}$ instead of $\mu$ because we are dealing with only a *sample* of patients on the transplant list, not the entire *population* of transplant patients.)

***

### Example 2.2.2
All exam scores from a Calculus class are shown below. Find the mean score and the median score.

<center>
    42, 24, 35, 31.5, 32.5, 32, 48.5, 35.5, 38, 40, 20.5, 37, 34, 41.5, 43, 49, 48, 28, 35, 48.5, 22, 35.5, 44.5, 39, 21.5, 34.5, 40
</center>

In [17]:
x = c(42, 24, 35, 31.5, 32.5, 32, 48.5, 35.5, 38, 40, 20.5, 37, 34, 41.5, 43, 49, 48, 28, 35, 48.5, 22, 35.5, 44.5, 39, 21.5, 34.5, 40)

# Find the Mean
N = length(x)

mu = sum(x)/N
mu

# Find the Median
M = quantile(x, probs = 0.50)
M

So the mean score is about $\mu = 36.3148$ points and the median is $M = 35.5$ points.

(We use $\mu$ instead of $\bar{x}$ because we are dealing with the whole *population* of exam scores, not a *sample* of the class exam scores.

***


### Example 2.2.3 ###
Suppose that we sample 50 people in one city. One person earns \$5,000,000 per year and the other 49 each earn \$30,000. Find the mean and the median of the data. Which is the better measure of the "center," the mean or the median?

#### Solution ####
The data in this example includes lots of large numbers. We could calculate the mean and median of the data the same way we have in the previous examples, but it would take a long time to list out all 50 numbers. Instead, we can take advantage of the fact that many of the values are repeated to perform the calcualtions more quickly.

Since there is one person who earns \$5,000,000 per year and 49 people who earn \$30,000 per year, we calculate the mean as follows:

In [1]:
xbar = (1*5000000 + 49*30000)/50
xbar

So the mean annual income is $\bar{x} = \$129,400$ per year.

To calculate the median, remember that the median is the middle value. Imagine lining up the data from smallest to largest. The \$5,000,000 value lies on the edge of our list of values. All the other values are \$30,000, so the median, the middle value, must be $M = \$30,000$. (See {numref}`Figure {number} <visual-median>`.)

```{figure} visual_median.png
---
width: 100%
alt: An illustration of where the median lies in the example data.
name: visual-median
---
If we imagine lining up the data values, we see that the median must be $M = \$30,000$.
```

The median is a better measure of the "center" than the mean in this case because 49 of the values are \$30,000 and one is \$5,000,000. The \$5,000,000 value is an outlier and significantly skews the mean. The median of $M = \$30,000$ gives us a better sense of the income of an ordinary person in the city than the mean of $\bar{x} = \$129,400$.

In [125]:
library(shape)

png("visual_median.png", width = 1000, height = 300)

rangebar = function(x0, x1, y, d, col){
    segments(x0, y-d, x0, y+d, lwd = 4, col = col)
    segments(x0, y, x1, y, lwd = 4, col = col)
    segments(x1, y-d, x1, y+d, lwd = 4, col = col)
}

rangebrace = function(x0, x1, y, d, col){
    Y = seq(y, y+d, length = 500)
    Xl = (x1 - x0)/4 * ( (2/d)^(1/3) * sign(Y - (y + d/2))*abs(Y - (y + d/2))^(1/3) + 1 ) + (x0 + x1)/2
    Xr = (x1 - x0)/4 * ( -(2/d)^(1/3) * sign(Y - (y + d/2))*abs(Y - (y + d/2))^(1/3) - 1 ) + (x0 + x1)/2
    lines(Xl, Y, lwd = 3, col = col)
    lines(Xr, Y, lwd = 3, col = col)
}

c1 = "blue3"
c2 = "forestgreen"
c3 = "red3"

a = 0
b = 9.5
c = 10
eps = 0.025

par(mar = c(0, 0, 0, 0))
plot(NULL, xlim = c(-0.1, 10.1), ylim = c(-2, 1), axes = FALSE, ann = FALSE)
rangebar(a, b-eps, 0, 0.15, col = c1)
rangebar(b+eps, c, 0, 0.15, col = c2)

text(x = (a + b-eps)/3, y = 0.3, labels = "$30,000", cex = 2, col = c1)
text(x = (a + b-eps)/3, y = -0.3, labels = "49 Values", cex = 2, col = c1)
text(x = (b+eps + c)/2, y = 0.3, labels = "$5,000,000", cex = 2, col = c2)
text(x = (b+eps + c)/2, y = -0.3, labels = "1 Value", cex = 2, col = c2)

rangebrace(a, (a + c)/2 - 4*eps, -1, 0.4, col = c3)
rangebrace((a + c)/2 + 4*eps, c, -1, 0.4, col = c3)
text(x = (3*a + c)/4, y = -1.2, labels = "Half of Data Values", cex = 2, col = c3)
text(x = (a + 3*c)/4, y = -1.2, labels = "Half of Data Values", cex = 2, col = c3)

Arrows((a + c)/2, -1.5, (a + c)/2, -0.17, lwd = 3, col = c3, arr.type = "triangle", arr.width = 0.3)
text(x = (a + c)/2, y = -1.5, labels = "Median", cex = 2, pos = 1, col = c3)

dev.off()