# Confidence Intervals and p-Values

To compute the confidence intervals we need to compute:
$$
    \mathbb{P}\left( \bar{X} - 2\hat{SE}\left(\bar{X}\right) \le p \le \bar{X} + 2\hat{SE}\left(\bar{X}\right) \right)
$$
which can be simplified into:
$$
    \mathbb{P}\left( -2 \le \frac{\bar{X} - p}{\hat{SE}\left( \bar{X} \right)} \le 2 \right)
$$
And since:
$$
    \frac{\bar{X} - p}{\hat{SE}\left( \bar{X} \right)} =: Z \sim N(0, 1).
$$

In [2]:
using Distributions
using StatsBase

### Solving for $z$ with `qnorm`
If I want to get a 99% confidence interval:

In [3]:
# Define the pnorm from R.
pnorm(x::Number) = cdf(Normal(0, 1), x);
# Define the qnorm from R.
qnorm(x::Number) = quantile(Normal(0, 1), x);
# Confidence interval of 99%.
confidence = 0.995
# Calculate z for the confidence interval
z = qnorm(confidence)
# Showing that qnorm gives the z value for a given probability
println("pnorm(qnorm($confidence)) = ", pnorm(z))
# Showing symmetry of 1-qnorm.
println("pnorm(qnorm(1 - $confidence)) = ", pnorm(qnorm(1 - confidence)))
# Showing that this z value gives correct probability for interval 
println("pnorm(z) - pnorm(-z) = ", pnorm(z) - pnorm(-z))

pnorm(qnorm(0.995)) = 0.9950000000000001
pnorm(qnorm(1 - 0.995)) = 0.0049999999999999324
pnorm(z) - pnorm(-z) = 0.9900000000000002


### Monte Carlo simulation of confidence intervals

In [4]:
p = 0.45
N = 1000
# Generate N observations.
x = sample([0, 1], Weights([1-p, p]), N)
# Calculate X̂.
x̂ = mean(x)
# Calculate SÊ, SE of the mean of N observations.
sê = sqrt(x̂ * (1 - x̂) / N)
# Build interval of 2 * SE above and below mean.
interval = [x̂ - 2*sê, x̂ + 2*sê]

2-element Vector{Float64}:
 0.42449990476204874
 0.4875000952379513

## A Monte Carlo Simulation for Confidence Intervals

In [5]:
p = 0.45
N = 1000
B = 10000
function getMonteCarloSimulation(p::Float64, N::Int64, B::Int64)
    # Obtain the samples.
    x = sample([0, 1], Weights([1 - p, p]), (N, B))
    # Get the mean vector.
    x̂ = mean(x, dims=1) |> vec
    # Get the standard error.
    sê = sqrt.(x̂ .* (1 .- x̂) / N)
    # Compute for which cases, p is within its confidense interval.
    inside = x̂ - 2 * sê .<= p .<= x̂ + 2 * sê
    # Return the mean of the bit array.
    return mean(inside)
end

getMonteCarloSimulation(p, N, B)

0.9559

## Power

### Confidense interval for the spread with sample of size 25

In [6]:
N = 25
x̂ = 0.48
(2 * x̂ - 1) .+ [-2, 2] * 2 * sqrt(x̂ * (1 - x̂) / N)

2-element Vector{Float64}:
 -0.4396798718974975
  0.35967987189749745

If the size was $2500$, it would tell more about the polls since $0$ would not longer be part of the computed interval.

## $p$-Values

Let us define the null hypothesis: in this case, that the spread is 0.

Remember the random variable $2 \bar{X} - 1 = 0.04$ and consider the $p$-value: how likely is it to see a value this large when the null hypothesis is true?

We write it with probability:
$$
    \mathbb{P}\left( \left| \bar{X} - 0.5 \right| > 0.02 \right)
$$
which means _what is the probability of the spread to be 4% or more_.

Also under the null hypothesis we know that
$$
    \sqrt{N} \frac{\bar{X} - 0.5}{\sqrt{0.5 \cdot (1 - 0.5)}}
$$
is a standard normal. So we can compute the probability using this equation:
$$
    \mathbb{P}\left( \sqrt{N}\frac{\left| \bar{X} - 0.5 \right|}{\sqrt{0.5 \cdot (1 - 0.5)}} > \sqrt{N} \frac{0.02}{\sqrt{0.5 \cdot (1 - 0.5)}} \right)
$$
which reduces to
$$
    \mathbb{P}\left( \sqrt{N}\frac{\left| \bar{X} - 0.5 \right|}{0.5} > Z \right)
$$
and now it can be computed with the following Julia cell:

In [7]:
N = 100
z = sqrt(N) * 0.02 / 0.5
pValue = 1 - (pnorm(z) - pnorm(-z))
print("p-value: ", pValue)

p-value: 0.6891565167793516

Note that there is a close relationship between confidense intervals and $p$-values. If a 95% confidense interval of the spread does not include $0$, we know that the $p$-value must be smaller than $0.05$ or 5%.

### Another Explanation of p-Values

The $p$-value is the probability of observing a value as extreme or more extreme than the result given that the null hypothesis is true.

In the context of the normal distribution, this refers to the probability of observing a Z-score whose absolute value is as high or higher than the Z-score of interest.

Suppose we want to find the $p$-value of an observation 2 standard deviations larger than the mean. This means we are looking for anything with $|z| \ge 2$.

Graphically, the $p$-value gives the probability of an observation that is at least as far away from the mean or further. This plot shows a standard normal distribution (centered at $z = 0$ with a standard deviation of $1$). The shaded tails are the region of the graph that are 2 standard deviations or more away from the mean.

The right tail can be found with `1 - pnorm(2)`. We want to have both tails, though, because we want to find the probability of any observation as far away from the mean or farther, in either direction. (This is what's meant by a two-tailed $p$-value.) Because the distribution is symmetrical, the right and left tails are the same size and we know that our desired value is just `2 * (1 - pnorm(2))`.

Recall that, by default, `pnorm()` gives the CDF for a normal distribution with a mean of $\mu = 0$ and standard deviation of $\sigma = 1$. To find $p$-values for a given $z$-score $z$ in a normal distribution with mean $\mu$ and standard deviation $\sigma$, use `2 * (1 - pnorm(z, μ, σ))` instead.