# The Mean

## Reading

[Chapter 14: 14.1 - 14.5](https://inferentialthinking.com/chapters/14/Why_the_Mean_Matters.html)

In the previous notebook we saw how we can estimate parameters of the entire population based on statistical analyses of the parameters of a sample. We also saw that there are some conditions where the estimations will likely be accurate, and these conditions depend on some properties of the sample, such as the mean and the distribution.

---

## Properties of the Mean

- <u>Calculation of the Mean</u>

The _mean_ is also commonly called the average. It is found by summing all the data values in the dataset and divide by the total number of data values.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

array = np.array([12, 3, 23, 4, 15])
print(sum(array)/len(array))
print(np.mean(array))

- <u>When Data is 1 and 0, or True and False</u>

When the data sequence contains only 1's and 0's, then:
> - The sum is the number of 1's.
> - The mean is the proportion of the 1's.


In [None]:
array = np.array([1,0,1,1,1,0])
print("Count of 1's:", sum(array))
print("Proportion of 1's:", sum(array)/len(array))
print("The mean:", np.mean(array))

For computers, True means 1 and False means 0. Therefore, when the data sequence contains True and False, we can calculate the count of True's and proportion of True's the same way.

In [None]:
array = np.array([True, False, True, True, True, False])
print("Count of True's:", sum(array))
print("Proportion of True's:", sum(array)/len(array))
print("The mean:", np.mean(array))

- <u>The Mean and the Histogram</u>

The mean value depends on each unique data value and its proportion in the dataset.

In the following `array1`, the unique data values are 1, 2, 6, and each value's proportion is dependent on how many times it appears in the sequence:

In [None]:
array1 = np.array([1,2,2,6])
print("The mean:", 1*1/4 + 2*2/4 + 6*1/4)
print("The mean:", np.mean(array1))

In the following `array2`, the unique data values are also 1, 2, 6, and each value's proportion is the same as in `array1`:

In [None]:
array2 = np.array([1,1,2,2,2,2,6,6])
print("The mean:", 1*2/8 + 2*4/8 + 6*2/8)
print("The mean:", np.mean(array2))

This means _if two arrays have the same distribution_ (same proportion for each unique data value), _then they have the same mean_.

If we plot the distribution of `array1`, which is the same as the distribution of `array2`, we get:

In [None]:
plt.figure(figsize=(4,3))
plt.hist(array2, bins=np.arange(1,7), density=True, edgecolor='black', alpha=0.5)
plt.hist(array1, bins=np.arange(1,7), density=True, edgecolor='black', alpha=0.5)
plt.title("Distribution of array1 and array2")
plt.xlabel("Value")

# You don't need to write the following code.
# The code draws a triangle at the mean.
triangle_x = [2.75 - 0.1, 2.75 + 0.1, 2.75]
triangle_y = [-0.03,-0.03, 0]
plt.fill(triangle_x, triangle_y, color='black', zorder=3)
plt.ylim(bottom=-0.04)

plt.show()

It looks like only one distribution was plotted, but actually both `array1` and `array2` distributions were plotted. Since they're the same distribution, the two histograms overlap each other perfectly. You can turn one of the `plt.hist()` lines of code into a comment so it doesn't run, and you'll see that the plot above is made of a blue histogram and an orange histogram that overlap each other.

For the histogram above, imagine that the x-axis is a straight board or a plank, and the 3 bars rest on top of the plank. As shown, the plank is balanced on a fulcrum, which is the black triangle.
> - If the fulcrum is in the middle of the plank, which is at 3.5, the plank will tilt left because there's more "weight" on the left side.
> - If the fulcrum is at 1.5, then the plank will tilt right.
> - If the fulcrum is at 2.75 as shown, which is where the mean is, then the plank will be balanced.

_The mean is the balance point of the histogram_.

- <u>The Mean and the Median</u>

Recall that the median is the midpoint of the sorted data values. _If a distribution is  symmetrical, then the mean and the median are the same_.

Below we create a symmetrical distribution, where the left and right side are balanced.

In [None]:
array = np.array([1,2,2,3,3,3,4,4,5])
print("The mean:", np.mean(array))
print("The median:", np.median(array))

In [None]:
plt.figure(figsize=(4,3))
plt.hist(array, bins=np.arange(1,7), density=True, edgecolor='black', alpha=0.5)
plt.title("Distribution of a Symmetrical Array")
plt.grid()
plt.plot()

_When the distribution is not symmetric_, or is _skewed_, so that one side extends farther than the other side (also called having a tail), _then the mean is pulled away from the median towards the tail_.

Below we create an array with a right hand tail.

In [None]:
array = np.array([1,2,2,3,3,3,4,4,4,4,5,5,6,6,7,8,12,15,16,20])
print("The mean:", np.mean(array))
print("The median:", np.median(array))

In [None]:
plt.figure(figsize=(4,3))
plt.hist(array, bins=np.arange(1,22), density=True, edgecolor='black')
plt.title("Distribution of Skewed Array")
plt.scatter(np.mean(array), 0, color='red', s=40, zorder=2)
plt.scatter(np.median(array), 0, color='yellow', s=40, zorder=2)
plt.grid()
plt.plot()

The histogram is skewed right and has a right hand tail, therefore the red mean is on the right of the yellow median.

The mean is affected by the tail: the farther the tail stretches to the right, the more the mean moves to the right or the larger the mean becomes. But the median is not affected by values at the extremes of the distribution, such as the right most values of the tail.

This is why for skewed distributions that have a long tail, it is better to look at the median than the mean. In the distribution above, the majority of the data values are around 2-6, and the median of 4.5 describes the data better than the mean of 6.5.

---

## Deviation from the Mean

We've found that the mean is the balance point of the histogram, and that data spread out from the mean on both the left and right sides of the mean. In this section we will measure the spread of the data.

The spread of the data is the average of the distances between all the data points and the mean. To find the spread, we take the following steps:

1. Find the _deviation_.

We find the distance between the mean and each data point. This is called the deviation from the mean.

In [None]:
# create an array of 4 numbers
array = np.array([12, 5, 9, 3])
# find the mean
the_mean = np.mean(array)
print("The mean:", the_mean)

In [None]:
# find the deviations
deviations = array - np.mean(array)
# recall that the code above subtracts the mean from
# each value in array, creating an array of differences

print("The deviations:", deviations)

Note that if we add up all the deviations we would get 0 as the result. This makes sense because the mean is the balance point for the histogram. The sum of distances on the left of the mean should be equal to the sum of distances on the right of the mean.

In [None]:
print(sum(deviations))

2. Find the _variance_.

We square the deviations and take their average.<br>
We need to square the deviations for 2 reasons:
> - We want to remove the negative signs, otherwise when we add the deviations as a step of finding the average, we would get 0.
> - Squaring will put more emphasis on the larger deviations so they count more. For example, $2^2$ is 4, and $8^2$ is 64. The 64 will influence the average more than the 4.

Then we average the squares to get the variance. The variance is the mean of the squared deviations.

In [None]:
variance = np.mean(deviations**2)
# the code above squares each deviation
# and finds the mean of the squares

print("The variance:", variance)

3. Find the _standard deviation_.

Since we need to square the deviations in order to find the variance, we now take the square root to "undo" the squaring.

In [None]:
sd = np.sqrt(variance)
print("The standard deviation:", round(sd, 2))

Lucky for us, numpy also has the `std()` function to calculate the standard deviation of an array, so we don't have to go through all 3 steps above when we want to find the standard deviation.

In [None]:
print("The standard deviation:", round(np.std(array), 2))

When the standard deviation is large compared to the data range, we say that the distribution has a large spread. When the standard devision is small compared to the data range, we say the spread is small.

In the example above the data values are from 3 to 12, which means the range is 9. Since the standard deviation is 3.49, we can say that the data has a large spread (3.49/9 is about 0.39 or 39%). A small spread is around 10% or below, and a large spread is around 40% and above.

---

## The Standard Deviation

<u>Example of Using the Standard Deviation</u>

We will use the dataset of basketball players and their heights to work with the <u>s</u>tandard <u>d</u>eviation, or SD.

In [None]:
url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/nba2013.csv"
nba = pd.read_csv(url)
print("First 5 rows:")
nba.head()

We now find the mean and standard deviation SD.

In [None]:
mean_height = np.mean(nba["Height"])
sd_height = np.std(nba["Height"])
print("Mean height:", round(mean_height,2))
print("Standard deviation:", round(sd_height,2))

Then we analyze the players' height and compare it against the SD.

Below we look at the tallest player's height and the SD.

In [None]:
tallest_height = nba["Height"].max()
tallest_row = nba[nba["Height"] == tallest_height]
print("Row of tallest player:")
display(tallest_row)
print("Number of standard deviations:")
print(round((tallest_height - mean_height)/sd_height,2))

The tallest player, Hasheem Thabeet, is more than 2 standard deviations from the mean height.

Can you write code to find the shortest player and how many standard deviations his height is from the mean?

In [None]:
shortest_height =


In general, the majority of data in a dataset are within 2 - 3 standard deviations (SDs) from the mean.

The [textbook](https://inferentialthinking.com/chapters/14/2/Variability.html#chebychev-s-bounds) discussed that the mathematician Pafnuty Chebychev proved in a theorem that:
- at least 75% of the data is within 2 SDs
- at least 89% of the data is within 3 SDs
- at least 95% of the data is within 4.5 SDs


We can visually see Chebyshev's theorem by plotting the distribution of the players' heights and the lines of 2 SDs.

In [None]:
plt.figure(figsize=(5,3))
plt.hist(nba["Height"], bins=np.arange(50, 95, 1), density=True, edgecolor='black')
plt.axvline(x = mean_height + 2*sd_height, color='red', linestyle='--')
plt.axvline(x = mean_height - 2*sd_height, color='red', linestyle='--')
plt.title("Distribution of Height")
plt.xlabel("Height (inches)")
plt.grid()
plt.show()

The dashed red lines are 2 SDs from the mean, and it does look like at least 75% of the data are within 2 SDs.

---

### Standard Units

The number of units of SD is called the _standard units_ and is named _z_.  We can write a general function to find z.

In [None]:
def standard_units(array):
    return (array - np.mean(array))/np.std(array)

# in the function above, the mean is subtracted from each data value,
# then the differences are divided by the SD to get the number of SDs,
# and the array of number of SDs (z) is returned

Using the dataset of flight delay times, we check the standard units for the delay times.

In [None]:
url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/united_summer2015.csv"
delays = pd.read_csv(url)
print("First 5 rows")
delays.head()

We find the standard units of delay times, using the function we just wrote above.

In [None]:
delay_su = standard_units(delays["Delay"])
print("First 5 rows:")
delay_su.head()

Then we add the `delay_su` as a new column in the `delays` DataFrame.

In [None]:
delays["Delay_SU"] = delay_su
delays.head()

Next we sort in descending order the DataFrame by the `delay_SU` column so we can see the longest delay times.

In [None]:
print("First 10 rows of longest delays:")
delays.sort_values("Delay_SU", ascending=False).head(10)

We see that there are quite a number of flights that are more than 10 standard units. Since Chebyshev's boundary says that about 90% of the data should be 3 standard units or less, we want to check if Chebyshev's boundary holds true for this dataset.

We find the percentage of flights where the standard unit is 3 or less.

In [None]:
percent_flights = len(delays[np.abs(delays["Delay_SU"]) <= 3]) / len(delays)
print(round(percent_flights, 3))

Chebyshev's boundary still holds for the standard units of delay. About 98% of the delays are within 3 standard units, but there are about 2% of flights that have long delays.

Looking at the plot for the delays in standard units, we see that the majority of the delays are within 3 SDs, where the red dashed lines are.

In [None]:
plt.figure(figsize=(5,3))
plt.hist(delays.Delay_SU, bins=np.arange(-5, 15, 0.5), density=True, edgecolor='black')
plt.title("Distribution of Delay SU")
plt.xlabel("Delay SU")
plt.axvline(x=np.std(delays['Delay_SU'])*3, color='red', linestyle='--')
plt.axvline(x=np.std(delays['Delay_SU'])*-3, color='red', linestyle='--')
plt.grid()
plt.show()

---

## The Normal Distribution and Standard Deviation

The normal distribution is also known as the Gaussian distribution or the bell curve. It has a symmetric bell-shaped curve, with a peak at the mean value and tapering off equally on both left and right sides. The normal distribution has a special relationship with the standard deviation.

We first create a normal distribution and fill in the areas under the normal curve that's within 1 standard deviation and within 2 standard deviations.

In [None]:
# You don't need to write this code.
# The code is to demo the standard deviations of the normal distribution.

from scipy.stats import norm

mean = 0
sd = 1

# create 1000 x values centered around 0
x = np.linspace(mean - 3*sd, mean + 3*sd, 1000)
# create y values that form the normal distribution
y = norm.pdf(x, mean, sd)

# create 2 side-by-side plots
fig, axes = plt.subplots(1, 2, figsize=(12, 3))

# plot normal distribution with filled area within 1 SD
axes[0].plot(x, y)
x_fill_1 = np.linspace(mean - sd, mean + sd, 1000)
y_fill_1 = norm.pdf(x_fill_1, mean, sd)
axes[0].fill_between(x_fill_1, y_fill_1, alpha=0.4)
axes[0].set_title('Area Under Normal Curve Within 1 SD')
axes[0].grid(True)

# plot normal distribution with filled area within 2 SD
axes[1].plot(x, y)
x_fill_2 = np.linspace(mean - 2*sd, mean + 2*sd, 1000)
y_fill_2 = norm.pdf(x_fill_2, mean, sd)
axes[1].fill_between(x_fill_2, y_fill_2, alpha=0.4)
axes[1].set_title('Area Under Normal Curve within 2 SDs')
axes[1].grid(True)


We can see that the area under the curve that's within 2 SDs covers a substantial part of the entire area under the curve.

To calculate the area under the curve, we use the `cdf` function of the `norm` module. The `cdf` or _Cumulative Distribution Function_ returns the proportion of
$ \frac{ \text{area under the curve up to an x value}} { \text{total area under the curve}} $

To call the `cdf` function we use the format:<br>
`area_up_to_x = norm.cdf(x)`

Using this function, we find proportion of the area under the curve within 1 SD and within 2 SDs compared to the total area under the curve.

In [None]:
# Area within 1 SD is between -1 and 1
print("Proportion of area within 1 SD:", round(norm.cdf(1) - norm.cdf(-1), 3))

# Area within 2 SD is between -2 and 2
print("Proportion of area within 2 SDs:", round(norm.cdf(2) - norm.cdf(-2), 3))

# Area within 3 SD is between -3 and 3
print("Proportion of area within 3 SDs:", round(norm.cdf(3) - norm.cdf(-3), 3))

We note that for a normal distribution:
- At 2 SDs, the proportion of area under the curve is 95% of the total area, and at 3 SDs, the area under the curve is almost the same as the total area under the curve.
- The proportions of area under the curve within 2 SDs and 3 SDs, 95% and 99.7%, are quite a bit higher than Chebyshev's prediction of 75% and 89%. This is because Chebyshev's prediction is for a general distribution, while the area under the curve calculated above is only for the normal distribution.

---

## The Central Limit Theorem

The _central limit theorem_ says that if we take a large number of random samples from a population and find the mean of each sample, then the distribution of the means will approach a normal distribution, regardless of the original population's distribution.

Just like with Chebyshev's bound, we won't go into the mathematical proof of the central limit theorem, instead we will check that it does work with the dataset of flight delays.

First we take a look again at the flight delay dataset, which we read in earlier, and the distribution of delay times.



In [None]:
delays.head()

In [None]:
plt.figure(figsize=(4,3))
plt.hist(delays["Delay"], bins=np.arange(-20, 300, 10), density=True, edgecolor='black')
plt.title("Distribution of Flight Delay Times")
plt.xlabel("Delay Time (min)")
plt.grid()
plt.show()

The distribution of delay times does not follow a normal curve, it is right-skewed with a long right tail.

We also calculate the mean and the SD.

In [None]:
print("The mean:", round(np.mean(delays["Delay"]),2))
print("The SD:", round(np.std(delays["Delay"]),2))

Note that the mean is only about 17 minutes, but the standard deviation is close to 39.5 minutes, a large value due to the small number of data values at the end of the right tail.

Next we take large samples from the delay times and plot the mean of the samples.

We write a function to select a sample of 400 values with replacement and find the mean of the sample.

In [None]:
def get_sample_mean(num_of_data):
    return np.mean(delays["Delay"].sample(num_of_data, replace=True ))

Then we run the simulation by looping 10,000 times and call `get_sample_mean`, and we plot the means of the samples.

In [None]:
L = []
for i in range(10000):
    L.append(get_sample_mean(400))
mean_array = np.array(L)

In [None]:
plt.figure(figsize=(4,3))
plt.hist(mean_array, bins=np.arange(10, 25, 0.5), edgecolor='black')
plt.xlabel('Average Delay')
plt.title('Distribution of Average Delays')
plt.grid()
plt.show()

We see that the distribution of average delay times is approximately the bell curve or the normal distribution, as predicted by the central limit theorem. We also see that the distribution is centered around the mean, which we calculated above to be 16.66.

We now repeat the same simulation as above, but our sample size is 1,000 instead of 400.

In [None]:
L = []
for i in range(10000):
    L.append(get_sample_mean(1000))
mean_array = np.array(L)

plt.figure(figsize=(4,3))
plt.hist(mean_array, bins=np.arange(10, 25, 0.5), edgecolor='black')
plt.xlabel('Average Delay')
plt.title('Distribution of Average Delays')
plt.grid()
plt.show()

We notice that once again the distribution of average delay times is a normal distribution that's centered around the mean of 16.6.

 But we also notice that for samples of 1000 values, the distribution is taller but narrower than the distribution for samples of 400 values.

Since the 2 distributions above have different spreads and one is narrower than the other, it means they have different standard deviations. The [textbook ](https://inferentialthinking.com/chapters/14/5/Variability_of_the_Sample_Mean.html#the-sd-of-all-the-sample-means) shows that in general, the standard deviation of the means is:

$$
\text{SD of all means} = \frac{ \text{population SD}}{ \sqrt{ \text{sample size}}}
 $$

In [None]:
print("SD of sample of 400:", round(np.std(delays["Delay"]) / np.sqrt(400), 2))
print("SD of sample 1000:", round(np.std(delays["Delay"]) / np.sqrt(1000),2))

The SD of the larger sample (1000) is smaller than the SD of the smaller sample (400). A smaller SD means the distribution is narrower and the calculated result is more accurate.

In fact, as the equation above shows, the SD of the means decreases (or the accuracy increases) in reverse proportion with $\sqrt{ \text{sample size}}$. For the SD to decrease by a factor of 10, the sample size must increase by a factor of 100.

---

In this notebook we learn the properties of the mean and the standard deviation of a dataset. We observe that for large samples of a population, the distribution of the mean of the samples is approximately a normal distribution. We also learn from the central limit theorem that the accuracy of a statistic of the samples increases in proportion with the square of the sample size.