# Confidence Interval

## Confidence Interval (for a Mean)

$$ 
ME = z^\star \frac{s}{\sqrt{n}}
$$

In [40]:
# 95% CI = mu +/- 1.96 se
qnorm((1-0.95)/2)

In [41]:
# 98% CI = mu +/- 2.32 se
qnorm((1-0.98)/2)

In [42]:
# 99% CI = mu +/- 2.58 se
qnorm((1-0.99)/2)

## Accuracy vs Precision
* Commonly used CI are 90%, 95%, 98%, and 99%.
* A wider interval (higher CI) indicates a higher probability of capturing the true polulation, which increases the accuracy, but decreases the precision.
* The way to get both a higher precision and higher accuracy is to increase the sample size, as it shrinks the standard error and margin of error.

**Example**
* The General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. 
* In 2010, the survey collected responses from 1,154 US residents. Based on the survey results, a 95% confidence interval for the average number of hours Americans have to relax or pursue activities that you enjoy after an average work day is 3.53 to 3.83 hours.

In [43]:
# sample mean
3.53 + (3.83-3.53)/2

In [44]:
# standard error
(3.83-3.53)/2/1.96

In [45]:
# margin of error
(3.83-3.53)/2

## Required Sample Size for Margin of Error (ME)
* All else held constant, as sample size increases, the margin of error decreases.

$$ 
n = ( \frac{z^\star s}{ME} )^2
$$

**Example**

* Suppose a group of researchers want to test the possible effect of an epilepsy medication taken by pregnant mothers on the cognitive development of their children. As evidence, they want to estimate the IQs of three-year-old children born to mothers who were on this medication during their pregnancy.
* Previous studies suggest that the standard deviation of IQ scores of three-year-old children is 18 points. 

* How many such children should the researches sample in order to obtain a 90% confidence interval with a margin of error less than or equal to four points?

In [58]:
me <- 4  
ci <- 0.9  
sd <- 18 
z <- qnorm((1-ci)/2)

(n <- ((1.64 * sd)/me)^2)
ceiling(n)

* How would the required sample size change if we want to further decrease the margin of error, to two points?

$$ 
\frac{1}{x} ME = z^\star \frac{s}{\sqrt{n}} \frac{1}{x}
\\
\frac{1}{x} ME = z^\star \frac{s}{\sqrt{n x^2}} 
$$

In [60]:
me <- 2
(n <- ((1.64 * sd)/me)^2)
ceiling(n)

**Example**

* A sample of 50 college students were asked, how many exclusive relationships they've been in so far? 
* The students in the sample had an average of 3.2 exclusive relationships, with a standard deviation of 1.74.
* In addition, the same distribution was only slightly skewed to the right. 

* Estimate the true number of exclusive relationships based on this sample using a 95% confidence interval.

In [77]:
n <- 50  
mu <- 3.2  
sd <- 1.74  

ci <- 0.95
z <- abs(round(qnorm((1-ci)/2), 2))

se <- sd/sqrt(n)

me <- z * se

# 1.96 * 1.74/sqrt(50)

mu - me
mu + me

* What is the correct calculation of the 98% confidence interval for the average number of exclusive relationships college students on average have been in?

In [78]:
ci <- 0.98
z <- abs(round(qnorm((1-ci)/2), 2))
me <- z * se

# 2.33 * 1.74/sqrt(50)

mu - me
mu + me

## Exercises

OpenIntro Statistics, 4th edition<br>
5.7, 5.9, 5.11, 5.13

**5.7 Chronic illness, Part I.** 
* In 2013, the Pew Research Foundation reported that “45% of U.S. adults report
that they live with one or more chronic conditions”. However, this value was based on a sample, so it may not be a perfect estimate for the population parameter of interest on its own. The study reported a standard error of about 1.2%, and a normal model may reasonably be used in this setting. 
* Create a 95% confidence interval for the proportion of U.S. adults who live with one or more chronic conditions. 
* Also interpret the confidence interval in the context of the study.

In [1]:
mu <- 0.45
se <- 0.012
mu - se * 1.96
mu + se * 1.96
# We are 95% confident that the proportion of U.S. adults who live with 
# one or more chronic conditions is between 42.6% to 47.4%.

**5.9 Chronic illness, Part II.** 
* In 2013, the Pew Research Foundation reported that “45% of U.S. adults report that they live with one or more chronic conditions”, and the standard error for this estimate is 1.2%.
Identify each of the following statements as true or false. Provide an explanation to justify each of your answers.
* (a) We can say with certainty that the confidence interval from Exercise 5.7 contains the true percentage of U.S. adults who suffer from a chronic illness.
* (b) If we repeated this study 1,000 times and constructed a 95% confidence interval for each study, then approximately 950 of those confidence intervals would contain the true fraction of U.S. adults who suffer from chronic illnesses.
* (c) The poll provides statistically significant evidence (at the α = 0.05 level) that the percentage of U.S. adults who suffer from chronic illnesses is below 50%.
* (d) Since the standard error is 1.2%, only 1.2% of people in the study communicated uncertainty about their answer.

In [4]:
# (a) False. We are only 95% confident that the confidence interval from Exercise 5.7
# contains the true parameter. Within a range of plausible values, 
# sometimes the truth is missed. 5% of our samples misses the truth 
# under the 95% confidence interval. 
# (b) True. 950 out of 1000 samples represents 95% of of the samples.
# (d) False. The standard error represents the variability between samples,
# it describes the uncertainty in the overall point estimate due to randomness,
# but not the uncertainty corresponding to individual's responses.

**5.11 Waiting at an ER, Part I.** 
* A hospital administrator hoping to improve wait times decides to estimate the average emergency room waiting time at her hospital. She collects a simple random sample of 64 patients and determines the time (in minutes) between when they checked in to the ER until they were first seen by a doctor. A 95% confidence interval based on this sample is (128 minutes, 147 minutes), which is based on the normal model for the mean. Determine whether the following statements are true or false, and explain your reasoning.
* (a) We are 95% confident that the average waiting time of these 64 emergency room patients is between 128 and 147 minutes.
* (b) We are 95% confident that the average waiting time of all patients at this hospital’s emergency room is
between 128 and 147 minutes.
* (c) 95% of random samples have a sample mean between 128 and 147 minutes.
* (d) A 99% confidence interval would be narrower than the 95% confidence interval since we need to be more sure of our estimate.
* (e) The margin of error is 9.5 and the sample mean is 137.5.
* (f) In order to decrease the margin of error of a 95% confidence interval to half of what it is now, we would need to double the sample size.

In [17]:
# (a) False. We are 100% confident that the average waiting time of the sampled 64 patients
# is 137.5. We are 95% confident about the polulation waiting time, but not the sample.
# (b) True. If the samples are independent, and the success-failture condition is satisfied.
# (c) False. The confidence interval is not about a sample mean.
# (d) False. A 99% confidence interval would be wider since we have to capture more plausible 
# values of the true parameter.
# (e) False. The mean is 137.5 which is the mid-point of the interval.
# The margin of error is half the width of the interval.
# (f) False. We have to have 4 times the sample size.

**5.13 Website registration.** 
* A website is trying to increase registration for first-time visitors, exposing 1% of these visitors to a new site design. Of 752 randomly sampled visitors over a month who saw the new design, 64 registered.
* (a) Check any conditions required for constructing a confidence interval.
* (b) Compute the standard error.
* (c) Construct and interpret a 90% confidence interval for the fraction of first-time visitors of the site who would register under the new design (assuming stable behaviors by new visitors over time).

In [38]:
# (a) As the visitors are randomly sampled, independence is assumed. 
# The success-failure condition is also satisfied. Hence, central limite theorem should
# hold and the sampling distribution should follow a nearly normal distribution.
(p <- 64/752)
752 * p
752 * (1-p)

In [24]:
# (b)
round(sqrt(p * (1-p) / 752), 3)

In [37]:
# (c)
(z <- abs(round(qnorm((1-0.9)/2), 3)))
(me <- z * 0.01)

round(p + me, 3)
round(p - me, 3)