# Central Limit Theorem

## Sampling Variability and CLT

**Example**

* P(Length of song lasts more than 5 minutes)?

In [1]:
# iPod song = 3000 
n <- 3000
# mean song length = 3.45 mins
mu <- 3.45
# sd of song length = 1.63 mins
s <- 1.63

In [2]:
# x = length of song lasts more than 5 mins
# P(x > 5)
(p_x_greater_5 <- (350+100+25+20+5) / n)

* P(Average length of song lasts more than 6 minutes)?

In [3]:
# sample size is 100 songs
m <- 100
# standard error = sd / sqrt(m)
se <- s/sqrt(m)

In [4]:
# 6 hours = 360 mins
# P(x1 + x2 + ... x100 > 360 mins)
# = P(mean x >= 3.6 mins)
(p_xbar_greater_3.6 <- 1 - pnorm(3.6, mu, se))

## Exercises

OpenIntro Statistics, 4th edition <br>
5.1, 5.3, 5.5

**5.1 Identify the parameter, Part I.** 

* For each of the following situations, state whether the parameter of interest is a mean or a proportion. It may be helpful to examine whether individual responses are numerical or categorical.
* (a) In a survey, one hundred college students are asked how many hours per week they spend on the Internet.
* (b) In a survey, one hundred college students are asked: “What percentage of the time you spend on the Internet is part of your course work?”
* (c) In a survey, one hundred college students are asked whether or not they cited information from Wikipedia in their papers.
* (d) In a survey, one hundred college students are asked what percentage of their total weekly spending is on alcoholic beverages.
* (e) In a sample of one hundred recent college graduates, it is found that 85 percent expect to get a job within one year of their graduation date.

In [45]:
# (a) Mean. The response is numerical - number of hours.
# (b) Mean. The response is numerical - a percentage.
# (c) Proportion. The response is a binary categorical - yes or no. 
# (d) Mean. The response is numerical - a percentage.
# (e) Proportion. The response is a binary categorical - get a job or not get a job. 

**5.3 Quality control.** 

* As part of a quality control process for computer chips, an engineer at a factory randomly samples 212 chips during a week of production to test the current rate of chips with severe defects. She finds that 27 of the chips are defective.
* (a) What population is under consideration in the data set? 
* (b) What parameter is being estimated? 
* (c) What is the point estimate for the parameter? 
* (d) What is the name of the statistic can we use to measure the uncertainty of the point estimate? 
* (e) Compute the value from part (d) for this context.
* (f) The historical rate of defects is 10%. Should the engineer be surprised by the observed rate of defects during the current week? 
* (g) Suppose the true population value was found to be 10%. If we use this proportion to recompute the value in part (e) using p = 0.1 instead of ˆ p, does the resulting value change much?

In [46]:
# (a) The population is all the computer chips during a week of production.
# (b) The parameter is the rate of defects.

In [50]:
# (c) Point estimate.
round(27/212, 3)

In [49]:
# (d) Standard error / margin of error.

In [57]:
# (e) SE = sqrt(p * (1-p) / n)
round(sqrt((0.127 * (1-0.127)) / 212), 3)

In [60]:
# (f) We compute the 95% confidence interval which is between 0.08 and 0.17. 
# The historical rate of defects lie within our confidence interval hence we're not supprised.
me <- 1.96 * 0.023
0.127 - me
0.127 + me

In [61]:
# (g) The value does not change much.
round(sqrt(0.1 * (1-0.1) / 212), 3)

**5.5 Repeated water samples.** 

* A nonprofit wants to understand the fraction of households that have elevated levels of lead in their drinking water. They expect at least 5% of homes will have elevated levels of lead, but not more than about 30%. They randomly sample 800 homes and work with the owners to retrieve water samples, and they compute the fraction of these homes with elevated lead levels. They repeat this 1,000 times and build a distribution of sample proportions.
* (a) What is this distribution called? 
* (b) Would you expect the shape of this distribution to be symmetric, right skewed, or left skewed? Explain your reasoning.
* (c) If the proportions are distributed around 8%, what is the variability of the distribution? 
* (d) What is the formal name of the value you computed in (c)? 
* (e) Suppose the researchers’ budget is reduced, and they are only able to collect 250 observations per sample, but they can still collect 1,000 samples. They build a new distribution of sample proportions. How will the variability of this new distribution compare to the variability of the distribution when each sample contained 800 observations?

In [5]:
# (a) Sampling distribution.
# (b) If the population proportion is in the 5-30% range, the success-failure condition would be satisfied
# and the sampling distribution would be symmetric.

In [8]:
# (c) The variability can be represented by the standard error.
mu <- 0.08
round(sqrt(0.08 * (1-0.08) / 800), 4)
# (d) Standard error.

In [9]:
# (e) The variability will increase as the sample size decreases.