# <a name="Intro">Sampling Distributions</a>

------------------------------------------------------------------------

-   A <font color="dodgerblue">**sampling distribution**</font> is the distribution of sample statistics (such as a mean, proportion, median, maximum, etc.) computed for **different samples of the same size from the same population**. A sampling distribution shows us how the sample statistic varies from sample to sample.

-   The problems below compare the sampling distributions for means from three different distributions.
  -   Let $X$ denote the distribution of Body Mass Index (BMI) of all adult men.
  -   Let $Y$ denote all times (in minutes) that people wait before their train arrives at a certain stop.
  -   Let $Z$ denote the depth (in km) of all earthquakes that have occurred near Fiji since 1964.



## <a name="quest1">Question 1</a>
---

1. Do you believe the distribution for $X$ will be approximately symmetric, skewed left, or skewed right? Explain.

2. Do you believe the distribution for $Y$ will be approximately symmetric, skewed left, or skewed right? Explain.

3. Do you believe the distribution for $Z$ will be approximately symmetric, skewed left, or skewed right? Explain.

### <a name="sol1">Solution to Question 1</a>
---

1.

<br>

2.

<br>

3.

## <a name="quest2">Question 2: Plotting Population Data for BMI</a>

----
Let $X$ denote the distribution of BMI of all adult men. We can
approximate this distribution by $X \sim N(26, 4)$.

- Interpret the code below. 
- Add comments to explain what each command will do. 
- Then run the code cell.

### <a name="sol2">Solution to Question 2</a>

----

Enter comments in code cell above.


In [None]:
# Create a vector name bmi of 100 bmi values 
# chosen between x=10 and x=42
bmi <- seq(26-4*4, 26+4*4, length=100)

# Add your comment here
pdf.bmi <- dnorm(bmi, 26, 4)

# Add your comment here
plot(bmi, pdf.bmi, 
     type="l", lty=1, # type="l" draws line lty=1 is solid line
     xlab="Body Mass Index (BMI)",
  ylab="Density", main="Distribution of Population")

## <a name="sol3">Question 3: Picking One Random Sample</a>

---
-   We can pick a random sample of size $n$ from a normal distribution $N(\mu, \sigma)$ using `rnorm(n, mean, sd)`.

Replace the question mark in the code below to randomly select 4
individual BMI’s from the population $X \sim N(26,4)$.

### <a name="sol3">Solution to Question 3</a>

----

Edit and run the code cell below.

In [None]:
#Randomly picks 4 values from N(26,4)
my.sample <- ? 
my.sample  # print your sample to the screen

## <a name="quest4">Question 4: Comparing Statistics and Parameters
----

Calculate the mean and standard deviation of your sample using the code
below. Then:

-   Discuss how the statistics of your sample will compare to the population parameters $\mu = 26$ and $\sigma =4$?
-   Discuss how the statistics of your sample will compare to the statistics that others in class obtain with their own samples?

In [None]:
  # enter a command to compute the mean of my.sample

  # enter a command to compute the st. dev. of my.sample
  

### <a name="sol4">Solution to Question 4</a>
----
  
After completing and running the code cell above, answer the questions.
  
<br> <br>  



# <a name="plot-bmi-samp">Plotting a Sampling Distribution with $n=4$</a>

----
A sample of $n=4$ adult men are randomly selected. The mean BMI of the
sample is calculated using the formula for a sample mean: 

$$\bar{x} = \frac{x_1 + x_2 + x_3+x_4}{4}.$$

- Then another random sample of $n=4$ adult men are randomly selected.
- The mean of the second sample is calculate.
- Repeat this (pick random sample and calculate mean) many, many times (for example 1,000 times).

The <font color="dodgerblue">sampling distribution for the mean</font> is the distribution of all sample means obtained by repeatedly picking random samples each of size $n$.

- It is important to note that **each sample must have the same size**, $n$.
- The sampling distribution for the mean is the distribution of all possible random samples.
  - It is extremely difficult and time intensive to write and run code that generates every possible random sample once, without any duplicates.
  - In practice, we generate many (say 1,000) such random samples to approximate the sampling distribution.

A sampling distribution for the mean BMI for $n=4$ can be constructed with the code below.

In [None]:
# creates an empty vector to store results
n4.bmi.bar <- numeric(1000) 

# A for loop that generates 1000 random samples 
# Each size n=4, and calculates the sample mean.
for (i in 1:1000)
{
  n4.bmi.sample <- rnorm(4, 26, 4) #Randomly picks 4 values from N(26,4)
  n4.bmi.bar[i] <- mean(n4.bmi.sample)
}

# Plot the sampling distribution
hist(n4.bmi.bar, xlim = c(14, 38), 
     xlab = "Mean BMI of Sample",
     main = "Sampling Distribution of Mean BMI for n=4",
     xaxt='n')
axis(1, at=seq(14, 38, 4), pos=0)
abline(v = 26, col = "red", lwd = 2, lty = 2)

## <a name="quest5">Question 5: Center and Spread of the Sampling Distribution</a>
----

In the code cell below, enter commands to compute the center (as
measured by the mean) and spread (as measured by the standard deviation)
of the sampling distribution created in the previous code cell when $n=4$.

Then comment on how these values compare to the population parameters $\mu=26$ and $\sigma =4$.

- Note sample means from the previous code cell are stored in the vector `n4.bmi.bar`.


In [None]:
  # enter a command to compute the mean sampling dist

  # enter a command to compute the st. dev. of sampling dist
  

### <a name="sol5">Solution to Question 5</a>
----

After completing and running the code cell above, comment on how the output compares to the mean and standard deviation of the population $X$.
  
<br> <br>  


# <a name="shape">Shape of the Sampling Distribution</a>
----

We can use a <font color="dodgerblue">**Quantile-Quantile**</font> plot (also called a <font color="dodgerblue">**qq-plot**</font>) to compare
the shape of our sampling distribution to the standard normal
distribution $N(0,1)$.

-   Run the code cell below to generate a qq-plot for the sampling distribution for the mean.
  -   The closer the points are to the line, the more normal the distribution.
  -   The plot below seems mostly normal in the middle, but the tails are slightly deviating from the tails of a normal distribution.

In [None]:
qqnorm(n4.bmi.bar)
qqline(n4.bmi.bar)

## <a name="shiny-bmi">Question 6: Center, Shape, and Spread BMI Sampling Distribution</a>

------------------------------------------------------------------------

Open the app <https://adamspiegler.shinyapps.io/clt_bmi/> to experiment with changing the sample size $n$ used when constructing a sampling distribution for the mean BMI. In particular, explore the following properties of the sampling distribution for the mean BMI:

- What is the shape of the sampling distribution?
- What is the mean of the sampling distribution?
- What is the standard deviation of the sampling distribution?
- How does the population change when generating different sampling distributions?

Fill in values that describe these properties of the sampling distribution for $n=4$, $n=9$, $n=16$, and $n=81$.

## <a name="sol6">Solution to Question 6</a>

----

<br>

| Property | Population   | $n=4$ | $n=9$ | $n=16$ | $n=81$ |
|-----|-----|-----|-----|-----|-----|
| Shape | Normal |   |   |   |   |
| Mean  | 26 |   |   |   |   |
| Standard Deviation | 4 |   |   |   |   |

<br> <br>


## <a name="quest7">Question 7: Sampling Distribution for Mean Wait Time</a>
----

Open the app <https://adamspiegler.shinyapps.io/clt_wait/> to experiment with changing the sample size $n$ used when constructing a sampling distribution for the mean wait time $\mu_Y$ between successive trains at a certain train stop using the distribution $Y \sim \mbox{Exp} \left( \frac{1}{40} \right)$. Based your observations, complete the table below for $n=4$, $n=9$, $n=16$, and $n=81$.



### <a name="sol7">Solution to Question 7</a>
----

<br>

| Property | Population   | $n=4$ | $n=9$ | $n=16$ | $n=81$ |
|-----|-----|-----|-----|-----|-----|
| Shape | Skewed Right |   |   |   |   |
| Mean  | 40 |   |   |   |   |
| Standard Deviation | $\sqrt{40}$ |   |   |   |   |

<br> <br>

# <a name="quake">Working with Empirical Data: Earthquake Depth</a>

----

The help documentation for the dataset `quakes` in the `dplyr` package provides the following summary:

> "The data set give the locations of 1000 seismic events of MB $>$ 4.0.  The events occurred in a cube near Fiji since 1964."

The data set contains the five variables listed below.

-   `lat`: Latitude of event
-   `long`: Longitude
-   `depth`: Depth (km)
-   `mag`: Richter Magnitude
-   `stations`: Number of stations reporting

Run the code cell below to load the `dplyr` data set so we can access the `quakes` data set.

In [None]:
library(dplyr)


## <a nmae="quake-summary">Numerical Summary of Quakes Data</a>
----

Run `summary()` function below to get numerical summaries for each variable in the data set `quakes`. 

Then run the `mean()` and `sd()` commands to summarize the mean and standard deviation of the population data in `quakes$depth`.

In [None]:
# requires dplyr package loaded above
summary(quakes)

In [None]:
mean(quakes$depth)
sd(quakes$depth)

## <a nam="quake-graph">Graphical Summary of Depth of Quakes</a>
----

Before we construct a sampling distribution for the mean depth of the earthquake data in `quakes`, run the `plot()` function in the code cell below tp create a density plot of the depths and get a sense of the population we will be sampling from.

In [None]:
plot(density(quakes$depth), 
         xlab = "Depth (in km)",
         main = "Depths of All Earthquakes in Fiji Since 1964",
         xaxt='n')
    axis(1, at=seq(-100, 800, 100), pos=0)
    abline(v = mean(quakes$depth), col = "red", lwd = 2, lty = 2)

## <a name="quest8">Question 8: Center, Shape, and Spread Wait Time Sampling Distribution</a>
----

Open the app <https://adamspiegler.shinyapps.io/clt_quake/> to experiment with changing the sample size $n$ used when constructing a sampling distribution for the mean earthquake depth $\mu_Z$ using the data in `quakes$depth`. Based your observations, complete the table below for $n=4$, $n=9$, $n=16$, and $n=81$.


### <a name="sol8">Solution to Question 8</a>
----

<br>

| Property | Population   | $n=4$ | $n=9$ | $n=16$ | $n=81$ |
|-----|-----|-----|-----|-----|-----|
| Shape | Bimodal and Symmetric |   |   |   |   |
| Mean  | 331.3 |   |   |   |   |
| Standard Deviation | 215.5 |   |   |   |   |

<br> <br>

# <a name="notation">Notation for Population, Sample, and Sampling Distribution</a>
----

When describing the <font color="dodgerblue">**mean**</font> of a distribution we use the notation:

-   Population mean: $\mu_X$
-   Sample mean: $\overline{x}$
-   Center of the Sampling distribution for a mean: $\mu_{\overline{X}}$

When describing the <font color="dodgerblue">**standard deviation**</font> of a distribution we use the
notation:

-   Population standard deviation: $\sigma_X$

-   Sample standard deviation: $s_X$

-   Spread of the sampling distribution is called the <font color="dodgerblue">**Standard Error**</font>.
  -   The standard error measures the variability in sample statistics due to randomness.
  -   We use the notation $\mbox{SE}(\overline{X}) = \sigma_{\overline{X}}$.


## <a name="quest9">Question 9: Shape of Sampling Distributions for a Mean</a>
----

For each of the three sampling distributions we examined, summarize how the <font color="dodgerblue">**shape**</font> of the sampling distribution for the mean changes as the size of the samples, $n$, increased.


### <a name="quest9">Solution to Question 9</a>
----

<br> <br> <br>
  
  



## <a name="quest10">Question 10: Center of Sampling Distributions for a Mean</a>
----


For each of the three sampling distributions we examined, summarize how the <font color="dodgerblue">**mean**</font> (center) of the sampling distribution changes as the size of the samples, $n$, increased.


### <a name="sol10">Solution to Question 10</a>
----

<br> <br> <br>



## <a name="quest11">Question 11: Spread of Sampling Distributions for a Mean</a>
----

For each of the three sampling distributions we examined, summarize how the <font color="dodgerblue">**standard error**</font> of the sampling distribution changes as the size of the samples, $n$, increased.

### <a name="sol11">Solution to Question 11</a>

----

<br> <br> <br>
  



# <a name="CLT">Formal Statement of the Central Limit Theorem (for Sample Means)</a>

----

Let $X_1, X_2, \ldots , X_n$ be independent, identically distributed
(iid) random variables from a population with mean and standard
deviation $\mu_X$ and $\sigma_X$, then as long as $n$ is large enough
(informally $\mathbf{n \geq 30}$), the sampling distribution for the
mean, $\overline{X}$ will:

-   Be (approximately) <font color="dodgerblue">**normally distributed**</font>.
-   Have mean equal to the mean of the population, $\color{dodgerblue}{\mu_{\overline{X}}=\mu_X}$.  
-   Have standard error $\color{dodgerblue}{\mbox{SE}(\overline{X}) = \sigma_{\overline{X}} = \frac{\sigma_X}{\sqrt{n}}}$.

The three statements above become more accurate as $n$ gets larger. We summarize the results more concisely below.

$${\large \color{dodgerblue}{\overline{X} \sim N \left( \mu_{\overline{X}} , \sigma_{\overline{X}} \right) = N \left( \mu  , \frac{\sigma}{\sqrt{n}} \right)}}$$

## <a name="quest12">Question 12</a>
----

Using properties of the expected value of linear combinations of random variables prove the following:

$$\mu_{\overline{X}} = E \left( \overline{X} \right) = E \left( \frac{X_1 + X_2 + \ldots + X_n}{n} \right) = \mu_X.$$

### <a name="sol12">Solution to Question 12</a>
----

<br> <br> <br>




## <a name="quest13">Question 13</a>
----

Using properties of the variance of linear combinations of independent random variables prove the following:


$$ \sigma^2_{\overline{X}} = \mbox{Var} \left( \overline{X} \right) = \mbox{Var} \left( \frac{X_1 + X_2 + \ldots + X_n}{n} \right) = \frac{\sigma^2_X}{n}.$$

### Solution to Question 13
---

<br> <br> <br>