<center><h1>Review Basics of Hypothesis Testing</h1></center>
<center><h3>Ellen Duong</h3></center>
<center><h3>Paul Stey</h3></center>
<center><h3>2023-10-26</h3></center>

# 1. Resources 
 - Discovering Statistics using R, Field, A. _et al_., 2012
 - Statistical Rethinking, McElreath, R., 2015
 - All of Statistics, Wasserman, L., 2004

## 1.1 Review of Core Concepts
	
- Bayesians, Frequentists, and Likelihoodists
- There are a few approaches to statistical inference:
    - Bayesian
    - Likelihoodist
    - Frequentist

We will be concerned primarily with the frequentist approach.

### 1.1.1 What is hypothesis testing?
<br>
<br>
<center>Hypothesis testing is the process of using data to make decisions under uncertainty.</center>


### 1.1.2 What is hypothesis testing? (cont.)
	
The frequentist approach is typically choosing between 2 competing hypotheses.

  - Null hypothesis (usually written $H_0$)
  - Alternative hypothesis (usually written $H_1$ or sometimes $H_A$)

### 1.1.3 What is hypothesis testing? (cont.)

For example, we might be interested in whether some new medication, $M$, reduces cholesterol. Here the competing hypotheses are:

<center>
<br>

  $H_0$: $\mu_{1} = \mu_{2}$ $M$ does not reduce cholesterol (null hypothesis)
  
  $H_1$: $\mu_{1} < \mu_{2}$ $M$ reduces cholesterol (alternative hypothesis)

<br>
</center>

where $\mu_1$ is mean cholesterol for those receiving $M$ in the population, and $\mu_2$ is mean cholesterol for those _not_ receiving $M$ in the population.

### 1.1.4 Notes on hypothesis testing

Some important things to note:

1. Previous example is one-sided test; two-sided tests generally look like:

  - $H_0$: $\mu_{1} = \mu_{2}$
  - $H_1$: $\mu_{1} \ne \mu_{2}$

2. Two-sided tests tend to be more common
3. You should clearly articulate hypotheses _prior to conducting statistical tests_

### 1.1.5 Notes on Hypothesis Testing (cont.)

General process of hypothesis testing:

1. Specify the null and alternative hypotheses, $H_0$ and $H_1$
2. Determine the test to be used, which gives us:

  - Our test statistic
  - Corresponding probability distribution 

3. Set a level of significance (e.g., $\alpha = 0.05$)
4. Use our data to compute our test statistic (and perhaps its standard error)
5. Use test statistic and its accompanying distribution to obtain _p_-value


# 2. Review of _p_-values

<br>
<br>
<center>What is a <i>p</i>-value?</center>


## 2.1. Understanding _p_-values

A <i>p</i>-value is a probability. In particular, it is the probability of finding data <i>as extreme or more extreme</i> than what he have observed, given that the null hypothesis is true.

### 2.1.1 Understanding _p_-values (cont.)

In other words, a _p_-value can be used to answer this question:

<br>

  <center><i>If the null hypothesis is true, are my data unusual?</i></center>

<br>
When a p-value is small, our answer is "yes". And when the answer is "yes", we are generally inclined to take this as evidence against the null hypothesis.
			

### 2.1.2 Understanding _p_-values (cont.)

A _p_-value is **NOT**:

  - The probability the null hypothesis is true
  - The probability that the data were produced by chance alone
  - A measure of effect size
     + Be wary of papers discussing "highly" or "extremely" significant results based _p_-values
     + Also beware of studies using _p_-values as inputs to subsequent computations or tests


### 2.1.3 Understanding _p_-values (cont.)

Other notes on _p_-values:

1. Their use is controversial in some circles
2. Can be easily abused to show significant results
3. Despite limitations, they are ubiquitous in science
  - We have used them for so long, it's hard to change course (but Bayesians are trying!)
  - For many applied researchers and practitioners, they are a convenient way to turn observed data in to a "yes"/"no" decision

# 3. The Decision Problem

<br>

<center>Ultimately, we want to be able to draw conclusions and make decisions based on data</center>




## 3.1 Deciding between $H_0$ and $H_1$
<br>
		<center>So, how do we choose between our hypotheses?</center>
<br>

1. Our default is to believe $H_0$

2. We use our data to determine if we have sufficient reason to reject $H_0$ 

3. This is where we rely on work from probability theory

### 3.1.1 Deciding between $H_0$ and $H_1$ (cont.)
<br>
<center>Because we are relying on probabilistic reasoning about whether or not to reject $H_0$, we can be wrong.</center>

![image](images/error_types.png)

### 3.1.2 Deciding between $H_0$ and $H_1$ (cont.)


<br>

Question:

<br>

<center>How do we know when we have committed a Type I error or a Type II error?</center>

<br>

### 3.1.3 Deciding between $H_0$ and $H_1$ (cont.)

<br>

Answer: 
<br>

<center>In general, we cannot <i>know</i> unequivocally when we have committed a Type I error or a Type II error.</center>

<br>

This has important implications:

1. Replication is _absolutely crucial_ in science
2. Must be _hyper vigilant_ about inflated Type I error from repeated testing (more on this later)
3. Should be generally skeptical, and especially so for low power studies with "oh-wow" results

<center><h1>The Binomial Test and Categorical Data</h1></center>

# 4. Categorical Data

  - Variables representing group members 
  - Examples: 
    + Political party affiliation
    + City of origin
    + Gender
    + Ethnicity

# 5. The Binomial Test
  - Probably the most basic example of a hypothesis tests (and very useful)
  - Used to compare distribution of observations in two categories against theoretical distribution
  - Essentially, we use the binomial test when we have a problem that can be expressed in terms of "successes" and "failures"

## 5.1 Binomial Test Examples

Example questions we can answer:
  - Given $N$ tosses of a coin, $X_1, X_2, ..., X_n$, where $X_i = 1$ denotes heads and $X_i = 0$ is tails, is this a fair coin?
  - Given the counts of females and males in a particular class, are there significantly more females than males?
  - Suppose we are doing quality control on a medical device known to have a 0.001\%  failure rate. Given the number of failures in a specific batch and the batch size, does this batch have significantly more failures than we expect?

### 5.1.1 Review Binomial Distribution

1. Discrete probability distribution
2. Has two parameter
  - $n$: number of "trials"
  - $p$: probability of "success" for a given trial
  


<center><img src="images/binomial_distribution_pmf.png" width="700"></center>

[1.] Image source: wikipedia.org

## 5.2 Binomial Test

<center><img src="images/binomial_plot.png" width="700"></center>





## 5.3 Binomial Test: Coin Toss Example

Suppose we have the following data after tossing a coin several times:

[H, T, T, T, H, H, T, H, T, T, H, T, T, T, T]

Is this a fair coin?

### 5.3.1 Data Generation

In [1]:
# create variable to store data
coin_tosses <- c("H", "T", "T", "T", "H", "H", "T", "H", "T", "T", "H", 
                 "T", "T", "T", "T")

# get number of tosses
n_tosses <- length(coin_tosses)

# get number of heads
n_heads <- sum(coin_tosses == "H")

# print variables we created to check sanity
print(n_tosses)
print(n_heads)

[1] 15
[1] 5


### 5.3.2 Using `binom.test()`

In [2]:
# run binomial test on coin toss data

bin_test1 <- binom.test(n_heads, n_tosses)

print(bin_test1)


	Exact binomial test

data:  n_heads and n_tosses
number of successes = 5, number of trials = 15, p-value = 0.3018
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.1182411 0.6161963
sample estimates:
probability of success 
             0.3333333 



## 5.4 Binomial Test: Device Defects Examples

Suppose we are doing quality control for a medical device known to have a 0.0001% failure rate. We are given a batch of 250000 to be tested. Of these, we find 17 defective devices. Does this batch have a significantly higher failure rate than our known failure rate?

In [3]:
# specify our inputs

p_failure <- 0.0001      # a-priori known failure rate

n_trials <- 250000        # number of devices produced

n_defectives <- 17        # number of defective devices

### 5.4.1 Device Defects Example (cont.)

In [4]:
# run binomial test on medical device data

test2 <- binom.test(n_defectives, n_trials, p = p_failure, alternative = "greater")

print(test2)


	Exact binomial test

data:  n_defectives and n_trials
number of successes = 17, number of trials = 250000, p-value = 0.9623
alternative hypothesis: true probability of success is greater than 1e-04
95 percent confidence interval:
 4.332901e-05 1.000000e+00
sample estimates:
probability of success 
               6.8e-05 



<center><h1>Challenge Problem</h1></center>

Let's use the Providence Police Departments arrests data to answer the following research question: _Is there a statistically significant difference between the number of males and females arrested?_

In [None]:
# Read in the arrests data, then run a binomial test

# Using Pearson's $\chi^2$ Test for Categorical Data

# 6. Pearson's $\chi^2$ Test

What if our categorical variables has more than 2 categories?

There are a few options when you have a variable with more than 2 categories
  - Exact Multinomial Test (EMT package in _R_)
  - _G_-Test for Goodness-of-Fit (also called likelihood ratio test)
  - Pearson's $\chi^2$ (Goodness-of-Fit) Test

# 6.1 Pearson's $\chi^2$ (Goodness-of-Fit) Test
Pearson's $\chi^2$ goodness-of-fit test can be used when we have some categorical variable, $X$, where each $X_i$ is a value from one of $K$ categories, and where $K \ge 2$ and we have an expected probability, $P_k$, for each category.

## 6.2 Pearson's $\chi^2$ Goodness-of-Fit Test Example
		
Suppose we want to determine whether or not a die is loaded (i.e., not a fair die). Say we roll the die 100 times, and we obtain the following results:

| Face | Count |
| :- | --- |
| 1 | 13 |
| 2 | 21 |
| 3 | 15 |
| 4 | 17 |
| 5 | 20 |
| 6 | 14 |


Are we confident that this is a fair die?

### 6.2.1 Pearson's $\chi^2$ Test Example (cont.)
The test statistic is $\chi^2$ and is computed using:

$$ \chi^{2}=\sum _{k=1}^{K}{\frac {(O_{k}-E_{k})^{2}}{E_{k}}},  $$

where $K$ is the number of categories, $O_k$ is the observed count for category $k$, and $E_k$ is the expected count for category $k$ under the null hypothesis. The degrees of freedom are: $df = K - 1$.

### 6.2.2 Pearson's $\chi^2$ Test Example (cont.)
The $\chi^2$ test statistic follows the $\chi^2$ distribution, a continuous distribution with a single parameter—the degrees of freedom (i.e., $df$).

<center><img src="images/chisq_dist.jpg" width="650"></center>

[1.] Image source: https://stats.libretext.org

### 6.2.3 Pearson's $\chi^2$ Test Example (cont.)
	
With this $\chi^2$ and $df$, we evaluate probability of observed data if the null hypothesis is true.
  - Note that Pearson's $\chi^2$ goodness-of-fit test assumes observations are independent from one another

<br>
<center><img src="images/chisq_dist2.jpg" width="650"></center>

[1.] Image source: https://actuarialmodelingtopics.wordpress.com

### 6.2.4 Using the `chisq.test()` Function 

In [None]:
roll_cnts <- c(13, 21, 15, 17, 20, 14)     # create vector with our counts

probs <- rep(1/6, 6)                       # create vector with 6 elements, all 1/6

In [None]:
test1 <- chisq.test(roll_cnts, p = probs)  # run test

print(test1)

### 6.2.5 Using `str()` on Output of `chisq.test()`

In [None]:
str(test1)              # examine components of test object

In [None]:
test1$residuals

# Using Pearson's $\chi^2$ Test of Independence

# 7. Pearson's $\chi^2$ (Independence) Test

We can also use Pearson's $\chi^2$ to solve a different sort of problem. In particular, we can use Pearson's $\chi^2$ to test the extent to which two categorical variables are independent.


# 7.1 Pearson's $\chi^2$ (Independence) Test Example

Suppose we would like to teach cats to dance. 

		
We have two training systems: using food as a reward, and using affection as a reward. Suppose after a week of training the cats, we test dancing ability. So, we have two categorical variables: _training_ and _dance_, each with two levels.


|.|.|Food as reward|Affection as reward|
|---|---|---|---|
|Cat Dances? |Yes| 28           | 48                |
|     .     |No | 10           | 114               |


From these data, are the _training_ and _dance_ variables independent?

*Source: Field _et al._ (2012)

### 7.1.1 Pearson's $\chi^2$ Independence Test (cont.)
The test statistic is $\chi^2$ and is computed using:

 $$ \chi ^{2}=\sum _{{i=1}}^{{r}}\sum _{{j=1}}^{{c}} {(O_{{i,j}}-E_{{i,j}})^{2} \over E_{{i,j}}},  $$
	
where $$ E_{i,j} = { \text{row-total}_i \times \text{column-total}_j \over N} $$
	
and where $O_{i,j}$ is the observed count in cell $i, j$ and $E_{i,j}$ is the expected count for cell $i,j$ under the null hypothesis. 

### 7.1.2 Pearson's $\chi^2$ Independence Test (cont.)
Note:
  - Degrees of freedom: $ df = (r - 1)(c - 1) $ where $r$ is the number of rows, and $c$ is the number of columns
  - Assumption that observations are independent from one another 
    + E.g., In above example, a cat could only be in one _training_ condition

# 8. Pearson's $\chi^2$ Independence Test in R

In [None]:
can_dance <- c(rep(TRUE, 76), rep(FALSE, 124))

training <- c(rep("food", 28), rep("affection", 48), rep("food", 10), rep("affection", 114))

cats <- data.frame(can_dance, training)

head(cats)

## 8.1 Running $\chi^2$ Test of Independence

In [None]:
# sanity check to make sure data are correct
xtab1 <- table(cats$can_dance, cats$training)

print(xtab1)

In [None]:
test1 <- chisq.test(cats$training, cats$can_dance)

print(test1)

<center><h1>Continuous Variables and <i>T</i>-Tests</h1></center> 

# 9. Introduction
To this point, we have been looking at categorical data (e.g., "heads"/"tails", yes/no, cat dances/cat doesn't dance). We will now explore some interesting new methods; in particular, we can start looking at continuous variables. 

# 10. Student's _T_-Test


- The t-test refers to a family of statistical tests whose test statistic follows the t-distribution. 
- First published by William Gossett under pseudonym "Student"

<center><img src="images/gossett.jpg" width=180/></center>



## 10.1 _T_-Distribution

- The t-distribution is a continuous probability distribution
- Has 1 parameter 
  + $\nu$: degrees of freedom
- Similar to normal distribution
  + Symmetric and bell-shaped
  
<center><img src="images/t_dist.png" width=270/></center>


### 10.1.1 Student's _T_-Test (cont.)

We will discuss three types of _t_-tests
  - One-sample _t_-test
  - Independent (two-sample) _t_-test
  - Dependent samples _t_-test
    + Also known as "paired-samples" _t_-test


## 10.2 Notes on _t_-tests in R
  - All three versions of the _t_-test are implemented in R as the `t.test()`function
  - Specifying different arguments to the function will give you different type of _t_-test
  - In all three cases, the _t_-test can be done as one-sided or two sided. We will generally prefer two-sided tests

# 11. One-Sample _t_-test
The one-sample _t_-test is used to test the null hypothesis that the population mean is equal to some value $\mu_0$. The test statistics is defined as $$t = \frac{\overline{x} - \mu_0}{\sigma_{\overline{x}}},$$
where $\overline{x}$ is the sample mean and $\sigma_{\overline{x}}$ is our estimate of the standard error of the mean. Recall it is defined as $$\sigma_{\overline{x}} = {s \over \sqrt{n}},$$
where $s$ is the sample standard deviation and $n$ is the sample size.


## 11.1 One-Sample _t_-test (cont.)
So, our test statistics is defined as $$t = \frac{\overline{x} - \mu_0}{s / \sqrt{n}}.$$ We also need to know the degrees of freedoms ($\nu$) so we can compare our $t$ to the appropriate _t_-distribution. 

In the one-sample case, $\nu = n - 1$, where $n$ is our sample size. 

### 11.1.1 One-Sample _t_-test Example 

Suppose you teach high school math and you would like to know whether your students perform at, above, or below average on the math portion of the SAT, which is known to be 527.

In [None]:
# Define vector of student's SAT scores
sat <- c(527, 554, 534, 541, 539, 542, 498, 512, 
         528, 531, 563, 566, 498, 503, 551, 582, 
         529, 549, 571, 523, 543, 588, 571)

# our sample mean
mean(sat)

### 11.1.2 One-Sample _t_-test Example (cont.) 

In [None]:
t.test(sat, mu = 527)



## 11.2 One-Sample _t_-test  Assumptions
Assumptions of one-sample _t_-test:

  - Observations are independent 
  - Variable is normally distributed in population
    + In practice, _t_-test is fairly robust to violations of normality provided $n$ is not small.



# 12. Independent (Two-Sample) _t_-test
The independent _t_-test:

  - More common version of the _t_-test
  - Used to compare means from two different groups
  - Test statistic is:
    $$t = \frac{\overline{x}_1 - \overline{x}_2}{\sqrt{{s_{1}^{2} \over n_1} + {s_{2}^{2} \over n_2}}},$$
    where $s_{k}^{2}$ is variance of Group $k$, and $n_k$ is sample size.

  - Our degrees of freedom are: $\nu = n_1 + n_2 - 2$



## 12.1 Two-Sample _t_-test Example

  - `spider` data from Andy Field's _Discovering Statistics with R_
  - Treating arachnophobia
  - Two treatment groups (12 subjects per group):
      + real spider
      + picture of spider
  - Measure anxiety after exposure to spider or picture

In [None]:
library(ggplot2)

spider <- read.csv("data/spiderlong.csv")

### 12.1.1 Examine Data

In [None]:
head(spider)
tail(spider)

### 12.1.2 Examine Data (cont.)

In [None]:
ggplot(spider, aes(x = anxiety, fill = group)) +
    geom_density(alpha = 0.5, colour = "grey") +
    xlim(10, 85) +
    scale_fill_manual(values = c("navy", "purple"))

### 12.1.3 Examine Data (cont.)

In [None]:
ggplot(spider, aes(y = anxiety, x = group, fill = group)) +
    geom_boxplot(width = 0.2) + 
    geom_jitter(width = 0.2)

## 12.2 Two-Sample _t_-test

In [None]:
t.test(anxiety ~ group, data = spider)

## 12.3 Independent (two-sample) _t_-test (cont.)
Assumptions of independent (two-sample) _t_-test:

  - Observations are independent 
  - Variable is normally distributed in population
    + In practice, _t_-test is fairly robust to violations of normality provided $n$ is not small.

  - Homogeneity of variance

  - Assume equal variances in two populations 
    + The _t_-test is also quite robust to violations of homogeneity of variance


# 13. Dependent (paired-sample) _t_-test

The paired _t_-test is often used when we have repeated measurements (i.e., one sample with two measurement occasions). The test statistics is defined as $$t = \frac{\overline{x}_D - \mu_0}{{s_D \over \sqrt{n}}},$$

## 13.1 Paired-Sample _t_-test Example

  - Suppose we are interested in cholesterol level changes over time
  - Recruit 100 patients each to follow over time
  - Two time poits:
    + Time1: baseline measurement at beginning of study
    + Time2: after 10 years
  - Measure their total cholesterol
  - Research questions:

<center><i>Do cholesterol levels in adults increase over time?</i></center>

### 13.1.1 Paired-Sample _t_-test Example (cont.)

In [None]:
chol_df <- read.csv("data/cholesterol_data.csv")

head(chol_df)

### 13.1.2 Paired-Sample _t_-test Example (cont.)

In [None]:
ggplot(chol_df) +
    geom_density(alpha = 0.5, colour = "grey", fill = "navy", aes(x = time1)) +
    geom_density(alpha = 0.5, colour = "grey", fill = "purple", aes(x = time2)) +
    xlim(80, 380)

### 13.1.3 Paired-Sample _t_-test (cont.)

Running paired-sample _t_-test using the `t.test()` function

In [None]:
t.test(chol_df$time1, chol_df$time2, paired = TRUE)