# The Central Limit Theorem for Proportions

## Objectives

- Apply the central limit theorem for proportions to identify sampling distributions of sample proportions.
- Computer probabilities involving sampling distributions of proportions.

## Population Proportions

A **population proportion** is the fraction, ratio, or percentage of the population that possesses a certain characteristic. For example, in 2006, the **proportion** of adult Americans that were married was $55.7\%$. Note a few important properties about proportions:
 
- For each member of the population, there are only two options: either the member possesses the characteristic or the member doesn't possess the characteristic. In the above example, since $55.7\%$ of U.S. adults were married in 2006, we can conclude that $100\% - 55.7\% = 44.3\%$ of U.S. adults were not married in 2006.
- The value of a proportion is always between $0$ and $1$. In the above example, the proportion $55.7\% = 0.557$ lies between $0$ and $1$.
- A proportion is the probability that, if we randomly select a member of the population, that the selected member will possess the characteristic. In the example above, if we were to randomly select an adult in the year 2006, there is a $55.7\%$ chance that the selected adult would be married.

A population proportion $p$ is calculated using the formula

$$ p = \frac{X}{N}, $$

where $X$ is the number of individuals in the population with the desired characteristic and $N$ is the size of the population.

For example, if we know that Menifee High School has a population of $4{,}582$ students, of whom $1{,}231$ are freshman, the proportion of Menifee High School students that are freshmen is

$$p = \frac{X}{N} = \frac{1{,}231}{4{,}582} = 0.2687.$$

So $26.87\%$ of all Menifee High School students are freshmen.

## The Central Limit Theorem for Proportions

We can also find the proportion of a sample, denoted by $\hat{p}$ (which is read "*p*-hat"). The formula for a sample proportion is

$$ \hat{p} = \frac{x}{n}, $$

where $x$ is the number of individuals in the *sample* with the desired characteristic, and $n$ is the size of the *sample*.

Recall from the central limit theorem for sample means says that, if the size of a sample is large enough, we can generally expect the mean of a sample to approximate the mean of the population. We might naturally ask the same question of proportions: if we take a large enough sample from a population, can we expect the sample proportion to be close to the population proportion?

The answer is "Yes". This is the idea of the **central limit theorem** for sample proportions.

```{prf:theorem} The Central Limit Theorem for Sample Proportions

Consider a population, and let $p$ be the proportion of the population that possesses some characteristic. Let $\hat{P}$ be the random variable of the sampling distribution of the sample proportions for this population, where the samples are of size $n$. If the sample size $n$ is big enough (so that $np \geq 5$ and $n(1-p) \geq 5$), then the sampling distribution is roughly a normal distribution with mean $p$ and standard deviation $\sqrt{\frac{p(1-p)}{n}}$. That is,

$$ \hat{P} \sim N\left(p, \sqrt{\frac{p(1-p)}{n}} \right). $$
```

We reiterate the observations about the central limit theorem that we made in the previous section:
- The central limit theorem does not make any requirements on how the population is distributed. This is why the central limit theorem is so powerful. The population can have any distribution. We do not even need to know *how* the population is distributed. No matter how the population is distributed, the sampling distribution of sample proportions (with $np \geq 5$ and $n(1-p) \geq 5$) will approximate a normal distribution.
- The theorem states that the distribution of sample proportions $\hat{P}$ only *approximates* a normal distribution. But in practice, as long as the sample size $n$ is large enough so that $np \geq 5$ and $n(1-p) \geq 5$, the distribution of sample proportions approximates the normal distribution so closely that we can treat the distribution as if it is normally distributed.
- We call the standard deviation of the sampling distribution, $\sqrt{\frac{p(1-p)}{n}}$, the **standard error** to distinguish it from the standard deviation of a population.
- Note that the larger the sample size $n$ is, the smaller the standard error $\sqrt{\frac{p(1-p)}{n}}$ becomes. That is, sample proportions from larger samples will tend to better approximate the population proportion than sample proportions from smaller samples.
- The central limit theorem only applies for samples that are randomly selected. If a sample is not selected randomly, the central limit theorem may not apply.

As in the previous section, we need to adjust our formulas

$$ z = \frac{x - \mu}{\sigma}, \hspace{0.5in} x = \mu + z\sigma,$$

which allow us to calculate a $z$-score from an $x$-value or and $x$-value from a $z$-score, to use the appropriate symbols for the sampling distribution of sample proportions. We use $\hat{p}$ instead of $x$ (since we are dealing with the distribution of sample proportions, not particular values of the population); we use the mean of the sampling distribution, which is the population proportion $p$, instead of $\mu$; and we use the standard error $\sqrt{\frac{p(1-p)}{n}}$ instead of the population standard deviation $\sigma$ (since the standard error is the standard deviation of the sampling distribution). Making these substitutions, the formulas above become

$$ z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}}, \hspace{0.5in} \hat{p} = p + z\sqrt{\frac{p(1-p)}{n}}. $$

***


### Example 4.4.1
A poll is conducted of $1{,}000$ Americans to determine whether or not they approve of the President of the United States. Suppose that, in reality, $54\%$ of all Americans approve of the President.

1. What is the probability that fewer than $50\%$ of those polled approve of the President?
2. What is the probability that more than $56\%$ of those polled approve of the President?
3. What is the probability that between $50\%$ and $56\%$ of those polled approve of the President?

#### Solution
First, note that the sample size is $n = 1{,}000$ and the population proportion is $p = 0.54$. Then by the central limit theorem, we know that the sample proportions are normally distributed with mean

$$ p = 0.54 $$

and standard error

$$ \sqrt{\frac{p(1 - p)}{n}} = \sqrt{\frac{0.54(1 - 0.54)}{1000}} = 0.0158. $$

So $\hat{P} \sim N(0.54, 0.0158)$.

##### Part 1
We want to find $P(\hat{p} < 0.50)$. First find the $z$-score of $\hat{p} = 0.50$:

$$ z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}} = \frac{0.50 - 0.54}{0.0158} = -2.5316. $$

So $P(\hat{p} < 0.50) = P(z < -2.5316)$. Let's use R to calculate the probability.

In [1]:
pnorm(q = -2.5316)

Then $P(\hat{p} < 0.50) = P(z < -2.5316) = 0.0057$. There is a $0.57\%$ chance that a sample of $1{,}000$ Americans will yield a sample proportion less than $50\%$.

##### Part 2
We want to find $P(\hat{p} > 0.56)$. First find the $z$-score of $\hat{p} = 0.56$:

$$ z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}} = \frac{0.56 - 0.54}{0.0158} = 1.2658. $$

So $P(\hat{p} > 0.56) = P(z > 1.2658)$. Let's use R to calculate the probability.

In [2]:
1 - pnorm(q = 1.2658)

Then $P(\hat{p} > 0.56) = P(z > 1.2658) = 0.1028$. There is a $10.28\%$ chance that a sample of $1{,}000$ Americans will yield a sample proportion more than $56\%$.

##### Part 3
We want $P(0.50 < \hat{p} < 0.56)$. We know from parts 1 and 2 that the $z$-score of $\hat{p} = 0.50$ is $z = -2.5316$ and the $z$-score of $\hat{p} = 0.56$ is $z = 1.2658$. So $P(0.50 < p' < 0.56) = P(-2.5316 < z < 1.2658)$.

The probability is the entire area under the normal density function between $z = -2.5316$ and $z = 1.2658$. To find this, we will use R to first find all the area to the left of the larger $z$-score, $z = 1.2658$, then subtract off the excess area to the left of the smaller $z$-score, $z = -2.5316$.

In [3]:
pnorm(q = 1.2658) - pnorm(q = -2.5316)

So $P(0.50 < \hat{p} < 0.56) = P(-2.5316 < z < 1.2658) = 0.8915$. There is an $89.15\%$ chance that a sample of $1{,}000$ Americans will have a sample proportion of between $50\%$ and $56\%$.

***

### Example 4.4.2

Apparently having nothing better to do, Yan flips $20$ coins, calculates the proportion of heads that come up, then repeats the process. We would expect $25\%$ of the time the proportion of heads Yan gets will be smaller than what value?

#### Solution

Let's start by calculating the properties of the sampling distribution according to the central limit theorem. We know the mean of the sampling distribution is the population proportion. We are not directly told what the population proportion is, but we know coins come up heads $50\%$ of the time, so the proportion of heads in the population is

$$ p = 0.50. $$

We can use $p$ to calculate the standard error:

$$ \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.5(1 - 0.5)}{20}} = 0.11180. $$

So $\hat{P} \sim N(0.50, 0.11180)$.

Now we need to find the $\hat{p}$-value so that $25\%$ of possible sample proportions will be smaller than that $\hat{p}$-value.

We start by using R to find the $z$-score associated with the probability we are given.

In [4]:
qnorm(p = 0.25)

This means $P(z < -0.67449) = 0.25$.

Now we use this $z$-score to calculate

$$ \hat{p} = p + z\sqrt{\frac{p(1-p)}{n}} = 0.50 + 0.67449(0.11180) = 0.42459. $$

So $P(\hat{p} < 0.42459) = 0.25$. That is, when Yan tosses his $20$ coins, we expect the proportion of heads among the $20$ coins to be smaller than $\hat{p} = 0.42459 = 42.459\%$ only $25\%$ of the time.