## Importing Packages

------------------------------------------------------------------------

In [None]:
library(dplyr)

# Shiny Apps

------------------------------------------------------------------------

[`Shiny`](https://shiny.rstudio.com) is an R package that creates
interactive apps/widgets. If you have not install the `Shiny` package
before, in the consolefirst run the command `install.packages("shiny")`.

-   See <https://shiny.rstudio.com/tutorial/>.

In [None]:
library(shiny) # loads shiny package

# Sampling Distributions

------------------------------------------------------------------------

-   A <span style="color: blue;">**sampling distribution**</span> is the
    distribution of sample statistics (such as a mean, proportion,
    median, maximum, etc.) computed for **different samples of the same
    size from the same population**. A sampling distribution shows us
    how the sample statistic varies from sample to sample.

-   The problems below compare the sampling distributions for means from
    three different distributions.

    -   Let $X$ denote the distribution of Body Mass Index (BMI) of all
        adult men.
    -   Let $Y$ denote all times (in minutes) that people wait before
        their train arrives at a certain stop.
    -   Let $Z$ denote the depth (in km) of all earthquakes that have
        occurred near Fiji since 1964.

## Question 1: Plotting a Normal Population

------------------------------------------------------------------------

Let $X$ denote the distribution of BMI of all adult men. We can
approximate this distribution by $X \sim N(26, 4)$. Interpret the code
below. Add comments to explain what each command will do. Then run the
code.

In [None]:
# Create a vector name bmi of 100 bmi values 
# chosen between x=10 and x=42
bmi <- seq(26-4*4, 26+4*4, length=100)

# Add your comment here
pdf.bmi <- dnorm(bmi, 26, 4)

# Add your comment here
plot(bmi, pdf.bmi, 
     type="l", lty=1, # type="l" draws line lty=1 is solid line
     xlab="Body Mass Index (BMI)",
  ylab="Density", main="Distribution of Population")

### Solution to Question 1

------------------------------------------------------------------------

Enter comments in code cell above.

  
  

## Question 2: Picking One Random Sample

------------------------------------------------------------------------

-   We can pick a random sample of size $n$ from a normal distribution
    $N(\mu, \sigma)$ using `rnorm(n, mean, sd)`.

Replace the question mark in the code below to randomly select 4
individual BMI’s from the population $X \sim N(26,4)$.

In [None]:
#my.sample <- ?? #Randomly picks 4 values from N(26,4)
#my.sample

### Solution to Question 2

------------------------------------------------------------------------

-   Delete the `#` symbol at the start of each line.
-   Do not delete the `#` symbol in the comment
    `#Randomly picks 4 values from N(26,4)`
-   Replace the `??` with an appropriate R Command.

  
  

## Question 3: Comparing Statistics and Parameters

------------------------------------------------------------------------

Calculate the mean and standard deviation of your sample using the code
below. Then:

-   Discuss how the statistics of your sample will compare to the
    population parameters $\mu = 26$ and $\sigma =4$?
-   Discuss how the statistics of your sample will compare to the
    statistics that others in class obtain with their own samples?

In [None]:
  # enter a command to compute the mean of my.sample
  # enter a command to compute the st. dev. of my.sample

### Solution to Question 3

------------------------------------------------------------------------

  
  
  

## Plotting a Sampling Distribution with $n=4$

------------------------------------------------------------------------

A sample of $n=4$ adult men are randomly selected. The mean BMI of the
sample is calculated: $$ \bar{x} = \frac{x_1 + x_2 + x_3+x_4}{4}.$$

Then another random sample of $n=4$ adult men are randomly selected, and
again the mean BMI of this sample is computed. This is repeated 1000
times (each sample size $n=4$), and the sampling distribution for the
mean BMI can be constructed with the code below:

In [None]:
# creates an empty vector to store results
n4.bmi.bar <- numeric(1000) 

# A for loop that generates 1000 random samples 
# Each size n=4, and calculates the sample mean.
for (i in 1:1000)
{
  n4.bmi.sample <- rnorm(4, 26, 4) #Randomly picks 4 values from N(26,4)
  n4.bmi.bar[i] <- mean(n4.bmi.sample)
}

# Plot the sampling distribution
hist(n4.bmi.bar, xlim = c(14, 38), 
     xlab = "Mean BMI of Sample",
     main = "Sampling Distribution of Mean BMI for n=4",
     xaxt='n')
axis(1, at=seq(14, 38, 4), pos=0)
abline(v = 26, col = "red", lwd = 2, lty = 2)

## Question 4: Center and Spread of the Sampling Distribution

------------------------------------------------------------------------

In the R code block below, enter commands to compute the center (as
measured by the mean) and spread (as measured by the standard deviation)
of the sampling distribution when $n=4$.

-   Then comment on how these values compare to the population
    parameters $\mu=26$ and $\sigma =4$.

In [None]:
  # enter a command to compute the mean sampling dist
  # enter a command to compute the st. dev. of sampling dist

### Solution to Question 4

------------------------------------------------------------------------

  
  
  

## Shape of the Sampling Distribution

------------------------------------------------------------------------

We can use a **Quantile-Quantile** plot (called a **qqplot**) to compare
the shape of our sampling distribution to the standard normal
distribution $N(0,1)$.

-   The closer the points are to the line, the more normal the
    distribution.
-   The plot below seems mostly normal in the middle, but the tails are
    slightly deviating from the tails of a normal distribution.

In [None]:
qqnorm(n4.bmi.bar)
qqline(n4.bmi.bar)

## Question 5: Center, Shape, and Spread BMI Sampling Distribution

------------------------------------------------------------------------

Run the app below to investigate properties of the sampling distribution
for the mean BMI, $\mu_X$ using the distribution $X \sim N(26,4)$. Based
your observations, complete the table below.

### Solution to Question 5 for BMI

------------------------------------------------------------------------

In [None]:
# Be sure you have loaded the shiny package
# This command is in a previous code cell

shinyApp(

ui <- fluidPage(  # Define UI for random distribution app
  titlePanel("Sampling Distribution for the Mean (Normal)"),  # App title
  
  sidebarLayout(
      sidebarPanel(
      
      # Input: Slider for the sample size
      sliderInput(inputId = "sampsize",
                  label = "Sample Size:",
                  min = 1,
                  max = 100,
                  value = 4)
    ),
    
    mainPanel(
      
      # Output: Tabset w/ pop, samp dist, summary, and qqplot
      tabsetPanel(type = "tabs",
                  tabPanel("Population", plotOutput("popplot")),
                  tabPanel("Sampling Dist Plot", plotOutput("sampdistplot")),
                  tabPanel("Summary", textOutput("mean"), textOutput("se")),
                  tabPanel("Shape", plotOutput("qq"))
      )
      
    )
  )
),

server <- function(input, output) {
  
  boot.stuff <- reactive({
    boot.bmi <- numeric(1000)
    for (i in 1:1000){
      bmi.sample <- rnorm(input$sampsize, 26, 4)
      boot.bmi[i] <- mean(bmi.sample)
    }
    return(boot.bmi)
  })

  output$popplot <- renderPlot({
    
    bmi <- seq(26-4*4, 26+4*4, length=100)
    pdf.bmi <- dnorm(bmi, 26, 4)
    plot(bmi, pdf.bmi,
         type="l", lty=1,  # type="l" draws line lty=1 is solid line
         xlab="Body Mass Index (BMI)",
         ylab="Density", main="Distribution of BMI for Population")
  })
  
  
  output$sampdistplot <- renderPlot({
  
    hist(boot.stuff(), xlim = c(14, 38), 
         xlab = "Sample Mean",
         main = paste("Sampling Distribution of Mean BMI for n = ",
                      input$sampsize,
                      sep = ""),
         xaxt='n')
    axis(1, at=seq(14, 38, 4), pos=0)
    abline(v = 26, col = "red", lwd = 2, lty = 2)
  })
    
  output$mean <- renderText({
    paste("The mean of the sampling dist is ",
          round(mean(boot.stuff()), 2),
          sep = "")
  })
  output$se <- renderText({
    paste("The standard error of the sampling dist is ",
          round(sd(boot.stuff()), 4),
          sep = "")
  })
  
  output$qq <- renderPlot({
    qqnorm(boot.stuff())
    qqline(boot.stuff())
      })
},

options = list(height = 500)
)

In [None]:
#runApp("clt_bmi") # be sure you've loaded shiny package and setwd

| Property           | Population | $n=4$ | $n=9$ | $n=16$ | $n=81$ |
|--------------------|------------|-------|-------|--------|--------|
| Shape              | Normal     |       |       |        |        |
| Mean               | 26         |       |       |        |        |
| Standard Deviation | 4          |       |       |        |        |

## Question 6: Center, Shape, and Spread Wait Time Sampling Distribution

------------------------------------------------------------------------

Run the app below to investigate properties of the sampling distribution
for the mean wait time $\mu_Y$ between successive trains at a certain
train stop using the distribution
$Y \sim \mbox{Exp} \left( \frac{1}{40} \right)$. Based your
observations, complete the table below.

### Solution to Question 6 for Wait Times

------------------------------------------------------------------------

In [None]:
# Be sure you have loaded the shiny package
# This command is in a previous code cell

shinyApp(
  
ui <- fluidPage(  # Define UI for random distribution app
  titlePanel("Sampling Distribution for the Mean (Skewed Right)"),  # App title
  
  sidebarLayout(
      sidebarPanel(
      
      # Input: Slider for the sample size
      sliderInput(inputId = "sampsize",
                  label = "Sample Size:",
                  min = 1,
                  max = 100,
                  value = 4)
    ),
    
    mainPanel(
      
      # Output: Tabset w/ pop, samp dist, summary, and qqplot
      tabsetPanel(type = "tabs",
                  tabPanel("Population", plotOutput("popplot")),
                  tabPanel("Sampling Dist Plot", plotOutput("sampdistplot")),
                  tabPanel("Summary", textOutput("mean"), textOutput("se")),
                  tabPanel("Shape", plotOutput("qq"))
      )
      
    )
  )
),

server <- function(input, output) {
  
  boot.stuff <- reactive({
    boot.wait <- numeric(1000)
    for (i in 1:1000){
      wait.sample <- rexp(input$sampsize, 1/40)
      boot.wait[i] <- mean(wait.sample)
    }
    return(boot.wait)
  })

  output$popplot <- renderPlot({
    # load possible wait times.
    wait.time <- seq(0, 100, length=200)
    
    # Compute the value of f(x) of each wait time x if we assume
    # the times are exponentially distributed with mean 40 min.
    pdf.wait.time <- dexp(wait.time, 1/40)
    
    # Plot bmi on x-axis and the value of pdf, f(x) on y-axis.
    plot(wait.time, pdf.wait.time, 
         type="l", lty=1,         # type="l" draws line lty=1 is solid line
         xlab="Wait time (in minutes)",
         ylab="Density", main="Distribution of All Train Wait Times")
  })
  
  
  output$sampdistplot <- renderPlot({
  
    hist(boot.stuff(), xlim = c(0, 100), 
         xlab = "Sample Mean",
         main = paste("Sampling Distribution of Mean Wait Time for n = ",
                      input$sampsize,
                      sep = ""),
         xaxt='n')
    axis(1, at=seq(0, 100, 10), pos=0)
    abline(v = 40, col = "red", lwd = 2, lty = 2)
  })
    
  output$mean <- renderText({
    paste("The mean of the sampling dist is ",
          round(mean(boot.stuff()), 2),
          sep = "")
  })
  output$se <- renderText({
    paste("The standard error of the sampling dist is ",
          round(sd(boot.stuff()), 4),
          sep = "")
  })
  
  output$qq <- renderPlot({
    qqnorm(boot.stuff())
    qqline(boot.stuff())
      })
},

options = list(height = 500)
)

| Property           | Population   | $n=4$ | $n=9$ | $n=16$ | $n=81$ |
|--------------------|--------------|-------|-------|--------|--------|
| Shape              | Skewed Right |       |       |        |        |
| Mean               | 40           |       |       |        |        |
| Standard Deviation | $\sqrt{40}$  |       |       |        |        |

## Working with Empirical Data: Earthquake Depth

------------------------------------------------------------------------

The dataset `quakes` in the `dplyr` package provides the following
summary:

> give the locations of 1000 seismic events of MB \> 4.0. The events
> occurred in a cube near Fiji since 1964.

-   `lat`: Latitude of event
-   `long`: Longitude
-   `depth`: Depth (km)
-   `mag`: Richter Magnitude
-   `stations`: Number of stations reporting

### Numerical Summary of Quakes Data

------------------------------------------------------------------------

In [None]:
# requires dplyr package that was
# loaded with library command in previous cod cell
summary(quakes)

### Graphical Summary of Depth of Quakes

------------------------------------------------------------------------

In [None]:
plot(density(quakes$depth), 
         xlab = "Depth (in km)",
         main = "Depths of All Earthquakes in Fiji Since 1964",
         xaxt='n')
    axis(1, at=seq(-100, 800, 100), pos=0)
    abline(v = mean(quakes$depth), col = "red", lwd = 2, lty = 2)

## Question 7: Center, Shape, and Spread Wait Time Sampling Distribution

------------------------------------------------------------------------

Run the app below to investigate properties of the sampling distribution
for the mean depth of earthquakes near Fiji since 1984, $\mu_Z$, using
the observed data in `quakes`. Based your observations, complete the
table below.

### Solution to Question 7 for Earthquake Depth

------------------------------------------------------------------------

In [None]:
# Be sure you have loaded the shiny package
# This command is in a previous code cell

shinyApp(

ui <- fluidPage(  # Define UI for random distribution app
  titlePanel("Sampling Distribution for the Mean (Bimodal)"),  # App title
  
  sidebarLayout(
      sidebarPanel(
      
      # Input: Slider for the sample size
      sliderInput(inputId = "sampsize",
                  label = "Sample Size:",
                  min = 1,
                  max = 100,
                  value = 4)
    ),
    
    mainPanel(
      
      # Output: Tabset w/ pop, samp dist, summary, and qqplot
      tabsetPanel(type = "tabs",
                  tabPanel("Population", plotOutput("popplot")),
                  tabPanel("Sampling Dist Plot", plotOutput("sampdistplot")),
                  tabPanel("Summary", textOutput("mean"), textOutput("se")),
                  tabPanel("Shape", plotOutput("qq"))
      )
      
    )
  )
),

server <- function(input, output) {
  
  boot.stuff <- reactive({
    boot.quake <- numeric(1000)
    for (i in 1:1000){
      quake.sample <- sample(quakes$depth, input$sampsize, replace=FALSE)
      boot.quake[i] <- mean(quake.sample)
    }
    return(boot.quake)
  })

  output$popplot <- renderPlot({
    plot(density(quakes$depth), 
         xlab = "Depth (in km)",
         main = "Depths of All Earthquakes in Fiji Since 1964",
         xaxt='n')
    axis(1, at=seq(-100, 800, 100), pos=0)
    abline(v = mean(quakes$depth), col = "red", lwd = 2, lty = 2)  })
  
  
  output$sampdistplot <- renderPlot({
  
    hist(boot.stuff(), xlim = c(0, 700), 
         xlab = "Sample Mean",
         main = paste("Sampling Distribution of Mean Depth for n = ",
                      input$sampsize,
                      sep = ""),
         xaxt='n')
    axis(1, at=seq(0, 700, 100), pos=0)
    abline(v = mean(quakes$depth), col = "red", lwd = 2, lty = 2)
  })
    
  output$mean <- renderText({
    paste("The mean of the sampling dist is ",
          round(mean(boot.stuff()), 2),
          sep = "")
  })
  output$se <- renderText({
    paste("The standard error of the sampling dist is ",
          round(sd(boot.stuff()), 4),
          sep = "")
  })
  
  output$qq <- renderPlot({
    qqnorm(boot.stuff())
    qqline(boot.stuff())
      })
},

options = list(height = 500)
)

In [None]:
mean(quakes$depth)
sd(quakes$depth)

| Property           | Population | $n=4$ | $n=9$ | $n=16$ | $n=81$ |
|--------------------|------------|-------|-------|--------|--------|
| Shape              | Bimodal    |       |       |        |        |
| Mean               | 331        |       |       |        |        |
| Standard Deviation | 215.5      |       |       |        |        |

# Notation for Population, Mean, and Distribution of Sample Means

------------------------------------------------------------------------

When describing the **mean** of a distribution we use the notation:

-   Population mean: $\mu_X$
-   Sample mean: $\bar{x}$
-   Center of the Sampling distribution for a mean: $\mu_{\overline{X}}$

When describing the **standard deviation** of a distribution we use the
notation:

-   Population standard deviation: $\sigma_X$

-   Sample standard deviation: $s_X$

-   Spread of the sampling distribution is called the **Standard
    Error**.

    -   The standard error measures the variability in sample statistics
        due to randomness.
    -   We use the notation
        $\mbox{SE}(\overline{X}) = \sigma_{\overline{X}}$.

## Question 8: Shape of Sampling Distributions for a Mean

------------------------------------------------------------------------

In each of the three sampling distributions we examined, lets summarize
what seems to be happening as the size of the samples, $n$, is
increased.

Does the shape of the sampling distribution stay the same as the
population or does it change as $n$ increases?

### Solution to Question 8

------------------------------------------------------------------------

  
  

## Question 9: Center of Sampling Distributions for a Mean

------------------------------------------------------------------------

Does the center of the sampling distribution, $\mu_{\overline{X}}$, stay
the same or change as $n$ increases? How does the value of
$\mu_{\overline{X}}$ compare to the population mean $\mu_X$?

### Solution to Question 9

------------------------------------------------------------------------

  
  

## Question 10: Spread of Sampling Distributions for a Mean

------------------------------------------------------------------------

Does the standard error of the sampling distribution,
$\mbox{SE}(\overline{X})$, stay the same or change as $n$ increases?

### Solution to Question 10

------------------------------------------------------------------------

  
  
  

# Formal Statement of the Central Limit Theorem (for Sample Means)

------------------------------------------------------------------------

Let $X_1, X_2, \ldots , X_n$ be independent, identically distributed
(iid) random variables from a population with mean and standard
deviation $\mu$ and $\sigma$, then as long as $n$ is large enough
(informally $\mathbf{n \geq 30}$), the sampling distribution for the
mean, $\bar{X}$ will:

-   Be (approximately) normally distribution.
-   Have mean equal to the mean of the population, $\mu$.
-   Have standard error $\mbox{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}}$.

We summarize the results more concisely below:

$$ \overline{X} \sim N \left( \mu_{\overline{X}} , \sigma_{\overline{X}} \right) = N \left( \mu  , \frac{\sigma}{\sqrt{n}} \right)$$

## Question 11

------------------------------------------------------------------------

Using properties of expected value and variance of linear combinations
of independent events, prove each of the following:

1.  $\displaystyle E \left( \bar{X} \right) = E \left( \frac{X_1 + X_2 + \ldots + X_n}{n} + \right) = \mu_X$

2.  $\displaystyle \mbox{Var} \left( \bar{X} \right) = \mbox{Var} \left( \frac{X_1 + X_2 + \ldots + X_n}{n} \right) = \frac{\sigma^2_X}{n}$

### Solution to Question 11a

------------------------------------------------------------------------

  
  
  

### Solution to Question 11b

------------------------------------------------------------------------