# Sampling and Confidence

**Population**
**Sampling** a random subset of a population
**Probability sampling** each member of the population of interest has some nonzero probability of being included in the sample
**Simple random sample** each member of the population has an equal chance of being chosen for the sample
**Stratified sampling** the population is partitioned into subgroups, then the sample is built by randomly sampling from each subgroup - used to increase the probability of a sample being representative of the population
**Sample mean**
**Population mean**

In [2]:
import random
import numpy as np
import matplotlib.pyplot as plt
import scipy.integrate as integrate

In [3]:
 def gaussian(x, mu, sigma):
     factor1 = (1/(sigma*((2*np.pi)**0.5)))
     factor2 = np.e**-(((x-mu)**2)/(2*sigma**2))
     return factor1*factor2

 area = round(integrate.quad(gaussian, -3, 3, (0, 1))[0], 4)
 print('Probability of being within 3', 'of true mean of tight dist. = ', area)
 area = round(integrate.quad(gaussian, -3, 3, (0, 100))[0], 4)
 print('Probability of being within 3', 'of true mean of wide dist. = ', area)

Probability of being within 3 of true mean of tight dist. =  0.9973
Probability of being within 3 of true mean of wide dist. =  0.0239


# Central Limit Theorem
The central limit theorem explains why it is possible to use a single sample drawn from a population to estimate the variability of the meeans of a set of hypothetical samples.
* Given a set of sufficiently large samples drawn from the same population, the means of the samples (the sample means will be approximately normally distributed.
* The normal distribution will have a mean close to the mean of the population
* The variance (computed using numpy.var) of the sample means will be close to the variance of the population divided by the sample size

# Standard Error of the Mean
**SEM** for a sample of size n is the standard deviation of the means of an infinite number of samples of size n  drawn from the same population


If \(s\) is the *sample* standard deviation and \(n\) is the sample size:

$$
\text{SEM} = \frac{s}{\sqrt{n}}
$$

If the *population* standard deviation $(\sigma)$ is known:

$$
\text{SEM} = \frac{\sigma}{\sqrt{n}}
$$


If two 95% CIs overlap, they MIGHT still be significantly different.

Why?
Because each CI has its own uncertainty; comparing them directly is not the same as a proper two-sample test.

Use the difference between the means

Difference between means:

83.5
−
67.2
=
16.3
83.5−67.2=16.3

This is the observed effect size.

Now compare it to the combined uncertainty (roughly):
Why this means significant at the 95% level

To decide whether two means differ significantly, we ask:

❓ Does the Confidence Interval for the difference include 0?

Why zero?
Because 0 difference means “the two means are equal.”

Our CI is:
[
1.5
,
  
31.1
]
[1.5,31.1]
Key point:

0 is NOT inside that interval.

This is the ONLY reason the difference is statistically significant.

✔ Because 0 is excluded → the two means are statistically different