# 5.1

Want to know one of four things when making probabilistic statements about a distribution:
- Density (pdf) at particular value (dname)
- Distribution (cfd) as a particular value (pname)
- Quantile value corresponding to a particular probability (qname)
- Random draw of values from a particular distribution (rname)

name in the above R functions symbolize the name of the distribution (i.e. dnorm). To calculate the value of the pdf at $x = 3$ (height of curve at $x = 3$) use:

In [1]:
dnorm( x = 3, mean = 2, sd = 5 )

To calculate the value of the cdf at $x = 3$ (that is $P(X \le 3)$ ), the probability that $X$ is less than or equal to $3$ use:

In [2]:
pnorm( q = 3, mean = 2, sd = 5 )

Or to calculate the quantile for probability $0.975$ use:

In [4]:
qnorm( p = 0.975, mean = 2, sd = 5 )

To generate a random sample of size $n = 10$ use:

In [5]:
rnorm( n = 10, mean = 2, sd = 5 )

Other name beyond norm include:
- *binom
- *t
- *pois
- *f
- *chisq

where the * can be d, p, q, and r. Example of getting probability of flipping a coin $10$ times and seeing $6$ heads given the probability of heads is $0.75$ use:

In [6]:
dbinom( x = 6, size = 10, prob = 0.75 )

Binomial Distribution Formula:

$${N \choose y}\ \theta^y\ ( 1 - \theta )^{ N - y }$$

Or formally $P( Y = 6 )$ if $Y \sim b( n = 10, p = 0.75 )$

In [27]:
# Can also do choose( 10, 6 )
n_choose_y <- ( 10 * 9 * 8 * 7 * 6 * 5 ) / ( 6 * 5 * 4 * 3 * 2 * 1 )
theta      <- 0.75
n          <- 10
y          <- 6

n_choose_y * theta ^ y * ( 1 - theta ) ^ ( n - y )

# 5.2

#### Hypothesis Testing

##### One Sample t-Test

Suppose $x_i \sim N( \mu, \sigma^2 )$ and we want to test $H_0: \mu = \mu_0$ versus $H_1: \mu \neq \mu_1$. If we assume $\sigma$ is unknown, we use one-sample $t$ statistic:

$$t = \frac{x - \mu_0}{\frac{s}{\sqrt(n)}} \sim t_{n - 1}$$

where

$$x = \frac{\sum_{i = 1}^n\ x_i}{n}$$

and

$$s = \sqrt(\frac{1}{n - 1}\ sum_{i = 1}^ n\ (x_i - x)^2)$$

A $100(1 - \alpha)\%$ confidence interval for $\mu$ is given by:

$$x \pm t_{n - 1}( \frac{\alpha}{2} ) \frac{s}{\sqrt(n)}$$

where $t_{n - 1(\frac{\alpha}{2})}$ is the critical value such that:

$$P(t > t_{n - 1}( \frac{\alpha}{2} )) = \frac{\alpha}{2}$$ 

for $n -1$ degrees of freedom.

#### Example

Suppose a grocery store sells 16 ounce boxes of cereal. A random sample of 9 boxes is taken and weighed:

In [28]:
cereal <- data.frame( weight = c( 15.5, 16.2, 16.1, 15.8, 15.6, 16.0, 15.8, 15.9, 16.2 ) )

The claim is a boxe weighs at least 16 ounces. Assume the weight is normally distributed and use a $0.05$ level of significance to test the claim. So:  
- $H_0: \mu \ge 16$
- $H_1: \mu < 16$

In [29]:
x_bar <- mean( cereal$weight )
s     <- sd( cereal$weight )
mu_0  <- 16
n     <- 9

t <- ( x_bar - mu_0 ) / ( s / sqrt( n ) )
t

Under the null hypothesis the test statistic has a $t$ distribution with $n - 1$ degrees of freedom, which is 8 in this case. Let's get the p-value of the test. Since this is a one-sided test with a less-than alternative,, we need the area to the left of $-1.2$ for a $t$ distribution with $8$ degrees of freedom:

$$P( t_8 < -1.2 )$$

In [30]:
pt( t, df = n - 1 )

The p-value is greater than our significance level of $0.05$, so we fail to reject the null hypothesis. A more condensed way to run the test in R is as follows:

In [31]:
t.test( x = cereal, mu = 16, alternative = c('less'), conf.level = 0.95 )


	One Sample t-test

data:  cereal
t = -1.2, df = 8, p-value = 0.1322
alternative hypothesis: true mean is less than 16
95 percent confidence interval:
     -Inf 16.05496
sample estimates:
mean of x 
     15.9 


In [33]:
# For Two Sided Test
cereal_results <- t.test( x = cereal, mu = 16, alternative = c('two.sided'), conf.level = 0.95 )

In [34]:
names( cereal_results )

In [35]:
cereal_results$conf.int

Let us check this by hand, but first we need to get the critical value:

$$t_{n - 1}(\frac{\alpha}{2} ) = t_8( 0.025 )$$ 

In [36]:
qt( 0.975, df = 8 )

Now plug into formula:

$$x \pm t_{n - 1}( \frac{\alpha}{2} ) \frac{s}{\sqrt(n)}$$

In [43]:
c( 
    mean( cereal$weight ) - qt( 0.975, df = 8 ) * sd( cereal$weight ) / sqrt(9), 
    mean( cereal$weight ) + qt( 0.975, df = 8 ) * sd( cereal$weight ) / sqrt(9)
)

#### Two-Sample t-Test

Suppose:
- $x_i \sim N(\mu_x, \sigma^2)$
- $y_i \sim N(\mu_y, \sigma^2)$

We want to test:
- $H_0: \mu_x - \mu_y = \mu_0$
- $H_1: \mu_x - \mu_y \neq \mu_0$

If $\sigma$ is unknown, then the two-sample t-test statistic:

$$t = \frac{(\bar x - \bar y) - \mu_0}{s_p \sqrt(\frac{1}{n} + \frac{1}{m})} \sim t_{n + m - 2}$$

where:

$$\bar x = \frac{\sum_{i = 1}^ n\ x_i}{n}, \bar y = \frac{\sum_{i = 1}^m}{m}, s_p^2 = \frac{(n - 1)s_x^2 + (m - 1)s^2_y}{n + m - 2}$$

where:

$$t_{n + m - 2(\frac{\alpha}{2})}$$

is a critical value such that $P(t > t_{n + m - 2(\frac{\alpha}{2})}) = \frac{\alpha}{2}$

#### Example

Given $n = 6$ observations of $X$ and $m = 8$ observations of $Y$

In [46]:
x <- c( 70, 82, 78, 74, 94, 82 )
n <- length( x )

y <- c( 64, 72, 60, 76, 72, 80, 84, 68 )
m <- length( y )

Test:
- $H_0: \mu_1 = \mu_2$
- $H_1: \mu_1 > \mu_2$

In [47]:
x_bar <- mean(x)
s_x   <- sd(x)

y_bar <- mean(y)
s_y   <- sd(y)

Now calculate the pooled standard deviation:

In [48]:
s_p <- sqrt( ( ( n - 1 ) * s_x ^ 2 + ( m - 1 ) * s_y ^ 2 ) / ( n + m - 2 ) )

And the test statistic is

In [49]:
t <- ( ( x_bar - y_bar ) - 0 ) / ( s_p * sqrt( 1 / n + 1 / m ) )
t

Now calculate the p-value

In [50]:
1 - pt( t, df = n + m - 2 )

Or shortcut

In [51]:
t.test( x, y, alternative = c( 'greater' ), var.equal = TRUE )


	Two Sample t-test

data:  x and y
t = 1.8234, df = 12, p-value = 0.04662
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 0.1802451       Inf
sample estimates:
mean of x mean of y 
       80        72 


# Simulation

Simulation and Model fitting are related, but opposites:
- Simulation: data generating process is known. Know form of the model as well as the value of each of the parameters. Control the distribution and parameters which define the randomness, or noise in the data.
- Model Fitting: data is known. Assume a certain form of the model and find the best possible values of the parameters given the observed data. Seeking to uncover the truth. Often attempt to fit many models, and we will learn metrics to assess which model fits best.

#### Paired Differences

Consider the model:
- $X_{11}, ..., X_{1n} \sim N(\mu_1, \sigma^2)$
- $X_{21}, ..., X_{2n} \sim N(\mu_2, \sigma^2)$

Assume $\mu_1 = 6$, $\mu_2 = 5$, $\sigma^2 = 4$, and $n = 25$. Also let $D = X_1 - X_2$. Suppose we would like to calculate $P(0 < D < 2)$. First we will need to obtain the distribution of $D$.

$$D = \bar X_1 - \bar X_2 \sim N(\mu_1 - \mu_2, \frac{\sigma^2}{n} + \frac{\sigma^2}{n}) = N(6 - 5, \frac{4}{25} + \frac{4}{25})$$

In other words:

$$D \sim N(\mu = 1, \sigma^2 = 0.32)$$

and thus:

$$P(0 < D < 2) = P(D < 2) - P(D < 0)$$

Using R:

In [54]:
pnorm( 2, mean = 1, sd = sqrt( 0.32 ) ) - pnorm( 0, mean = 1, sd = sqrt( 0.32 ) )

An alternative is to simulate a large number of observations of $D$ then use the Empirical Distribution to calculate the probability. THe strategy is to repreatedly:
- Generata a sample of $25$ random observations from $N(\mu_1 = 6, \sigma^2 = 4)$ and call the mean of this $X_{1s}$
- Generata a sample of $25$ random observations from $N(\mu_1 = 5, \sigma^2 = 4)$ and call the mean of this $X_{2s}$
- Calculate the differences of the means $d_s = X_{s1} - X_{2x}$

Repeat a large number of times and then use the distribution of the simulated bservations of $d_s$ as an estimate for the true distribution of $D$.

In [55]:
set.seed( 19920917 )

num_samples <- 10000
differences <- rep( 0, num_samples ) # Store the d_s

for ( s in 1:num_samples ) {
    x1 <- rnorm( n = 25, mean = 6, sd = 2 )
    x2 <- rnorm( n = 25, mean = 5, sd = 2 )
    differences[s] = mean(x1) - mean(x2)
}

mean( 0 < differences & differences < 2 )

In [57]:
hist(
    differences,
    breaks = 20,
    main   = 'Empirical Distribution of D',
    xlab   = 'Simulated Values of D',
    col    = 'dodgerblue',
    border = 'darkorange'
)

ERROR: Error in png(tf, width, height, "in", pointsize, bg, res, antialias = antialias): unable to start png() device


Plot with title "Empirical Distribution of D"

In [58]:
mean( differences )

In [59]:
var( differences )

#### Distribution of Sample Mean

Simulate for a Poisson Distribution.

$$X \sim Pois(\mu)$$

then:

$$E[X] = \mu$$

and 

$$Var[X] = \mu$$

For a random variable $X$ with finite mean $\mu$ and finite variance $\sigma^2$, the central limit theorem tells us that the mean $\bar X$ of random samples of size $n$ is approximately normal for large values of $n$. Assume $\mu = 10$ and $n = 50$

In [60]:
set.seed( 19920917 )

mu          = 10
sample_size = 50
samples     = 100000
x_bars      = rep( 0, samples )

for( i in 1:samples ) {
    x_bars[i] = mean( rpois( sample_size, lambda = mu ) )
}

Now compare sample statistics from the empirical distribution with their known values based on the parent distribution:

In [61]:
c( mean( x_bars ), mu )

In [62]:
c( var( x_bars ), mu / sample_size )

In [63]:
c( sd( x_bars ), sqrt( mu ) / sqrt( sample_size ) )

Calculate the proportion of sample means that are within 2 standard deviations of the population mean:

In [65]:
mean(
    x_bars > mu - 2 * sqrt( mu ) / sqrt( sample_size ) &
    x_bars < mu + 2 * sqrt( mu ) / sqrt( sample_size )
)