# Sampling Distributions Notes

In [6]:
import numpy as np
import random as rd
import itertools as it  # used as an iteration tool for obtaining all possible samples
import matplotlib.pyplot as plt

## Determining Sampling Distribution of Sample Means

Suppose the following represents the prices of all laptops in existence:

\\$1000, \\$1200, \\$1600, \\$2000

We can determine the population mean and population standard deviation after defining the prices as a list in the cell below. 

In [2]:
prices = [1000, 1200, 1600, 2000]

In [None]:
np.mean(prices)

In [None]:
np.std(prices)

For this section, we will be taking a look at samples. 

### Example 1.

Let's start with calculating the sample means at all possible samples of size 2 from the prices list.

*Note*: We will sample with replacement. Differences in sampling with and without replacement become negligible as sample size increases.

In [None]:
samples = it.product(prices, repeat = 2)  # generates all possible samples of size 2 WITH replacement
sample_means = []

for i in samples:
    #print(i)  # uncomment to see the samples
    sample_means.append(np.mean(i))  # calculates the mean of each sample and adds it to the sample_means list

for mean in sample_means:
    print(mean)

### Example 2.

Create a histogram of the sample means from the previous example. Use a bin width of 150.

In [None]:
fig, ax = plt.subplots()
bin_width = 150
ax.hist(
    sample_means,
    color='r',
    edgecolor='k',
    bins = np.arange(min(sample_means), max(sample_means) + bin_width, bin_width)
)

plt.show()

### Example 3.

Determine the mean and the population standard deviation of the sample means (since we have the entire population of samples of size 2 here).

Do these values target the population mean and population standard deviation of laptop prices?

In [None]:
np.mean(sample_means)

In [None]:
np.std(sample_means)

## Mean and Standard Error

What happens if we have a much larger population and take a much larger sample size, say of 100?

In [36]:
larger_population = [rd.randint(500, 2500) for _ in range(50_000)]  # generate 50,000 random integers between 500 and 2500

In [None]:
np.mean(larger_population)

In [None]:
np.std(larger_population)

In [41]:
num_samples = 10_000
sample_means = []

for i in range(num_samples):  # create 10,000 samples
    sample_means.append(
        np.mean(rd.choices(larger_population, k = 100))  # sample 100 values (w/ replacement) from larger_population and take the mean
    )

In [None]:
np.mean(sample_means)

In [None]:
np.std(sample_means, ddof=1)  # is there much difference if we use ddof = 1 (sample standard deviation) instead?

* Does the mean of the sample means target the population mean?

* Does the sample standard deviations of the sample means target the population standard deviation?

$$
\mu_{\overline{x}} \longmapsto \mu \qquad \frac{\sigma_{\overline{x}}}{\sqrt{n}} \longmapsto \sigma
$$

*Note:* The expression $\frac{\sigma_{\overline{x}}}{\sqrt{n}}$ is called the ***standard error***.

### Example 4.

IQ scores are normally distributed with a mean of 100 and a standard deviation of 15. A sample of 200 participants had their IQ measured.

What are the approximate mean and standard error of the sample?

## The Central Limit Theorem

Various populations can have different distributions. For instance, some may be uniform, other may be normal, and others still could be something else.

However, the ***Central Limit Theorem*** states that 

<p style="text-align: center;">
As the sample size increases
<ul>
    <li>The distribution of sample means becomes normal with</li>
    <li>A mean of $\mu$ and</li>
    <li>A standard deviation $\frac{\sigma}{\sqrt{n}}$, called the <strong>standard error</strong></li>
</ul>
</p>

In [None]:
"""
Demonstrates the Central Limit Theorem on randomly generated uniform, 
normal, and geometric probability distributions.
"""

# Uniform distribution of population
data_unif = [rd.uniform(0,1000) for _ in range(50_000)]  # generate a population of 50,000 random uniform values between 0 and 1,000

# Normal distribution of population
data_norm = [rd.gauss(100,15) for _ in range(50_000)]  # generate a population of 50,000 random normal values with mean 100 and s.d. 15

# Geometric distribution of population
data_geom = [np.random.geometric(0.75) for _ in range(50_000)]  # generate a population of 50,000 random geometric values with mean of 5

sample_means_unif = []
sample_size = 50
for _ in range(10_000):  # do the following 10,000 times
    sample_unif = rd.choices(data_unif,k=sample_size)  # take a sample of 50 values from the population
    sample_means_unif.append(np.mean(sample_unif))  # add the mean of the sample to the sample_means list

sample_means_norm = []
sample_size = 50
for _ in range(10_000):  # do the following 10,000 times
    sample_norm = rd.choices(data_norm,k=sample_size)  # take a sample of 50 values from the population
    sample_means_norm.append(np.mean(sample_norm))  # add the mean of the sample to the sample_means list

sample_means_geom = []
sample_size = 50
for _ in range(10_000):  # do the following 10,000 times
    sample_geom = rd.choices(data_geom,k=sample_size)  # take a sample of 50 values from the population
    sample_means_geom.append(np.mean(sample_geom))  # add the mean of the sample to the sample_means list

pop_mean_unif = np.mean(data_unif)
mean_sample_means_unif = np.mean(sample_means_unif)

fig,ax = plt.subplots(1,3, figsize = (12,4))
ax[0].hist(data_unif, bins=20, density = True, alpha = 0.5, ec ='b', label='population')
ax[0].hist(sample_means_unif, bins=20, density = True, alpha = 0.6, label='sample mean')
ax[0].set_title('Uniform Population')
ax[0].legend(loc='upper right')
# plt.figtext(-0.25, -0.25, txt_unif_mean, wrap=True, horizontalalignment='center', fontsize=12)
ax[1].hist(data_norm, bins=20, density = True, alpha = 0.5, ec='b', label='population')
ax[1].hist(sample_means_norm, bins=20, density = True, alpha = 0.6, label='sample means')
ax[1].set_title('Normal Population')
ax[1].legend(loc='upper right')
ax[2].hist(data_geom, bins=20, density = True, alpha = 0.5, ec='b', label='population')
ax[2].hist(sample_means_geom, bins=20, density = True, alpha = 0.6, label='sample means')
ax[2].set_title('Geometric Population')
ax[2].legend(loc='upper right')
plt.show()

### Example 5. 

Referring back to IQ scores with a mean of 100 and a standard deviation of 15.

(a) What is the probability that an individual has an IQ gretaer than 102?

(b) What is the probability that in a sample of 50 people, the mean IQ is greater than 102?

(c) What is the probability that in a sample of 1000 people, the mean IQ is greater than 102?