# 3.6 Analysis of Relative Location
The most widely used measures of central location and dispersion are the mean and standard deviation, respectively. Unlike the mean, the standard deviation can be hard to interpret with our intuition. All we know is that low standard deviations indicate that observations are close to the mean, while a high value shows that the observations are spread out. 

In this section we'll use Chebyshev's Theorem and the empirical rule to make more precise statements regarding the proportion of observations that fall within a specified number of standard of deviations from the mean. 

We'll also use the mean and standard deviation to compute z-scores that measure the relative location of a particular observation; z-scores can also be used to detect outliers.

## Chebyshev's Theorem
We'll see how important the standard deviation is in later chapters. The Russian mathematician Pavroty Chebyshev (1821 - 1894) found bounds for the proportion of the observations that lie within a specified number of standard deviations from the mean.

**Chebyshev's Theorem**: For any variable, the proportion of observations that lie within $k$ standard deviations from the mean is at least $1 - 1/k^2$, where $k \geq 1$.

In [1]:
""" Example 3.20
A lecture has n=280 students. The professor
says that the mean score on an exam is 74
with a standard deviation of 8. At least
how many students scored within 58 and 90?
"""
mean, stddev = 74, 8
low, high = 58, 90

# How many stddevs away are the bounds?
k1 = (mean - low) / stddev
k2 = (high - mean) / stddev
k1, k2

(2.0, 2.0)

Since $\bar{x} - ks = 58$ and $\bar{x} + ks = 90$ so $k = \frac{\bar{x} - 58}{s}$ and $k = \frac{90 - \bar{x}}{s}$. Thus using Chebyshev's theorem we have that at least $1 - 1/2^2 = 1 - 1/4 = 3/4 \cdot 280 = 210$ students scored between 58 and 90.

The theorem holds for both samples and populations. The advantage to Chebyshev's theorem is that it applies to all variables, but it yields conservative bounds. The actual percentage of observations in the interval may be larger!

## The Empirical Rule
If we know that our observations are drawn from a normally distribute3d variable (relatively symmetric and bell-shaped distribution) then we can make more precise statements about the percentage of observations that fall within certain intervals. Being bell-shaped and symmetric are characteristics of the normal distribution, which we'll discuss in a later chapter. The Empirical rule states that, given a sample mean $\bar{x}$, a sample standard deviation $s$, and a relatively normal distribution:

1) Approximately $68\%$ of all observations fall in the interval $\bar{x} \pm s$,
2) Approximately $95\%$ of all observations fall in the interval $\bar{x} \pm 2s$, and
3) Almost all observations fall in the interval $\bar{x} \pm 3s$

In [2]:
""" Example 3.21
Revisit 3.20. Assume
that the distribution
follows a symmetric, bell curve.
"""
# a. How many students scored within 58 and 90?
n, p = 280, 0.95
print(round(n*p), 'students scored between 58 and 90')

# b. How many students scored more than 90?
p2 = (1 - p)/2
print(round(n*p2), 'students scored more than 90')

266 students scored between 58 and 90
7 students scored more than 90


The main difference between Chebyshev's and the empirical rule is the assumption of normality. Note that Chebyshev's theorem doesn't need us to assume anything about the distribution, and because of this it gave us a much more conservative estimate for the number of students scoring within two standard deviations of the mean. Indeed, our lower bound was 210, when really (when we assume normality), 266 students scored between 58 and 90. A nice lower bound, but quite conservative. 

We prefer using the empirical rule when our data follows a normal distribution!

## Z-scores
We can use the mean and standard deviation to find the relative location of observations within a distribution. We use the **z-score** to find the relative position of a sample value within the data set by dividing the deviation of the sample value from the mean by the standard deviation! A z-score is computed as
$$
z = \frac{x - \bar{x}}{s}
$$
where $x$ is an observation of a variable, $\bar{x}$ and $s$ are the variable's sample mean and the sample standard deviation, respectively.

Note that a z-score is a unitless measure! The units cancel out with the division, and thus it measures the distance of a given observation from the mean in terms of standard deviations. For example, a z-score of 1 implies that the given observations is 1 standard deviations above the mean. Similarly, a -2 z-score implies that the given observation is two standard deviations below the mean. Converting observations into z-scores is also called **standardizing** the observations.

In [3]:
""" Example 3.22
The mean and standarddev for an accounting exam
are 74 and 8, respectively. The mean and stddev
of scores on a marketing exam are 78 and 10. Find
the z-scores for a student who scores 90 on both.
"""
def get_zscore(x, mean, stdd):
    return (x - mean) / stdd

print('Accounting:', get_zscore(90, 74, 8))
print('Marketing:', get_zscore(90, 78, 10))

Accounting: 2.0
Marketing: 1.2


Clearly the student performed better in accounting than in marketing, relative to the class. 

In 3.2 we used boxplots to visualize outliers. With a normal variable we can also use z-scores to identify outliers. It's common to treat an observation as an outlier if the z-score is more than 3 or less than -3.

In [4]:
""" Example 3.23
Are there outliers in growth and value?
"""
import pandas as pd
import math

gv = pd.read_csv('Growth_Value.csv', index_col=0)
gv.head(5)

Unnamed: 0_level_0,Growth,Value
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1984,-5.5,-8.59
1985,39.91,22.1
1986,13.03,14.74
1987,-1.7,-8.58
1988,16.05,29.05


In [5]:
# Means.
g_mean, v_mean = gv.Growth.mean(), gv.Value.mean()

# Standard deviations.
g_std = math.sqrt(sum((gv.Growth - g_mean)**2) / (len(gv) - 1))
v_std = math.sqrt(sum((gv.Value - v_mean)**2) / (len(gv) - 1))

# Assign z-scores.
gv['g_zscores'] = (gv.Growth - g_mean) / g_std
gv['v_zscores'] = (gv.Value - v_mean) / v_std

# Outlier masks.
g_outlier = (gv['g_zscores'] <= -3) | (gv['g_zscores'] >= 3)
v_outlier = (gv['v_zscores'] <= -3) | (gv['v_zscores'] >= 3)

print('Outliers in growth:', gv[g_outlier]['Growth'])
print('Outliers in value:', gv[v_outlier]['Value'])

Outliers in growth: Series([], Name: Growth, dtype: float64)
Outliers in value: Year
2008   -46.52
Name: Value, dtype: float64


Notice there are no outliers in the Growth mutual fund. While there is one outlier $-46.52$ in the Value fund. This is consistent with the boxplots from 3.2.