# **Week 8: Large Sample Hypothesis Testing**

```
.------------------------------------.
|   __  ____  ______  _  ___ _____   |
|  |  \/  \ \/ / __ )/ |/ _ \___  |  |
|  | |\/| |\  /|  _ \| | | | | / /   |
|  | |  | |/  \| |_) | | |_| |/ /    |
|  |_|  |_/_/\_\____/|_|\___//_/     |
'------------------------------------'

```

Through the following examples, we will explore the concepts of (large-sample) hypothesis testing (LSHT) and examine their practical implications.


## **Pre-Configurating the Notebook**

### **Switching to the R Kernel on Colab**

By default, Google Colab uses Python as its programming language. To use R instead, you’ll need to manually switch the kernel by going to **Runtime > Change runtime type**, and selecting R as the kernel. This allows you to run R code in the Colab environment.

However, our notebook is already configured to use R by default. Unless something goes wrong, you shouldn’t need to manually change runtime type.

### **Importing Required Packages**
**Run the following lines of code**:

In [1]:
#Do not modify

setwd("/content")

# Remove `MXB107-Notebooks` if exists,
if (dir.exists("MXB107-Notebooks")) {
  system("rm -rf MXB107-Notebooks")
}

# Fork the repository
system("git clone https://github.com/edelweiss611428/MXB107-Notebooks.git")

# Change working directory to "MXB107-Notebooks"
setwd("MXB107-Notebooks")

#
invisible(source("R/preConfigurated.R"))

Loading required package: ggplot2

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: tidyr

Loading required package: stringr

Loading required package: magrittr


Attaching package: ‘magrittr’


The following object is masked from ‘package:tidyr’:

    extract


Loading required package: IRdisplay

Loading required package: png

“there is no package called ‘png’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Loading required package: grid

Loading required package: knitr



**Do not modify the following**

In [2]:
if (!require("testthat")) install.packages("testthat"); library("testthat")

test_that("Test if all packages have been loaded", {

  expect_true(all(c("ggplot2", "tidyr", "dplyr", "stringr", "magrittr", "knitr") %in% loadedNamespaces()))

})

Loading required package: testthat


Attaching package: ‘testthat’


The following objects are masked from ‘package:magrittr’:

    equals, is_less_than, not


The following object is masked from ‘package:tidyr’:

    matches


The following object is masked from ‘package:dplyr’:

    matches




[32mTest passed[39m 🥳


## **Reference Tables for LSHT for Sample Means**

| Scenario | Parameter | Null Hypothesis | Test Statistic (z) |
|----------|-----------|----------------|----------------|
| One-sample mean | $\mu$ | $\mu = \mu_0$ | $\frac{\bar{x}-\mu_0}{s/\sqrt{n}}$ |
| One-sample proportion | $p$ | $p = p_0$ | $\frac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}$ |
| Two-sample mean | $\mu_1 - \mu_2$ | $\mu_1  = \mu_2$ | $\frac{\bar{x}_1-\bar{x}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}$ |
| Two-sample proportion | $p_1 - p_2$ | $p_1  = p_2$ | $\frac{\hat{p}_1-\hat{p}_2}{\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}}$ |


Under the null hypothesis, all the test statistics $z$ in these tables are approximately distributed as a standard Gaussian $N(0,1)$ (for large enough sample sizes). If $\sigma$ is unknown, it can be replaced with the sample standard deviation.

Any substantial deviation from the null hypothesis will tend to produce $z$ values that are unlikely under this standard Gaussian distribution, which is why extreme values of $z$ provide evidence against $H_0$. Even though any deviation from $H_0$ can provide evidence against it, the choice between a one-sided and a two-sided test depends on our research goal and the direction of interest.

| Test Type | Alternative Hypothesis | Rejection Region |
|-----------|----------------------|----------------|
| One-sided (right) | $H_1: \theta > \theta_0$ | Reject $H_0$ if $z > z_{1-\alpha}$ |
| One-sided (left) | $H_1: \theta < \theta_0$ | Reject $H_0$ if $z < z_{\alpha}$ |
| Two-sided | $H_1: \theta \neq \theta_0$ | Reject $H_0$ if $|z| > z_{1-\alpha/2}$ |


If we specifically care about deviations in one direction — for example, testing whether the average battery life is less than 8 hours — a one-sided test is appropriate. Allocating all of the Type I error $\alpha$ to that direction increases the test’s ability to detect deviations that matter in practice.

On the other hand, if deviations in either direction are meaningful — for instance, testing whether the average rating of a show differs from 7.7, whether higher or lower — a two-sided test is necessary. Splitting $\alpha$ between both tails ensures we properly account for evidence against $H_0$ in either direction.

There are other scenarios (e.g., both the null and the alternative hypotheses are intervals). However, they are out of the scope of this unit.

## **Making Sense of Hypothesis Testing**


In [3]:
smpl_data = c(7.2, 8.53, 8.07, 7.99, 7.79, 7.77, 8.9, 7.64, 7.35, 8.45, 9.14, 7.93, 7.35, 7.52, 7.41, 8.27,
7.55, 7.5, 8.53, 8.37, 8.17, 8.15, 8.02, 7.63,7.64, 8.83, 8.17, 7.41, 7.7, 8.21)

### **Demonstrating Example**

A phone company advertises that the average battery life of their phones (when continuously watching videos), denoted as $\mu$, is 8 hours.

To verify this claim, an independent random sample of 30 phones was tested. Battery life is assumed to follow a normal distribution, and the population standard deviation is known to be 1 hour.

**Hint**:
- Use the asymptotic properties of the sample mean
- Replace the unknown standard deviation $\sigma$ with its estimate

**Write down the asymptotic sampling distribution of sample mean.**

Given i.i.d. $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)$, we have:

$$\bar{x} \sim \mathcal{N}\Big(\mu,\frac{\sigma^2}{30}\Big)$$

This is the exact sampling distribution as $x_1, \ldots, x_n$ are i.i.d. Gaussian random variables.

**Write down the null and alternative hypotheses for testing whether the company’s claim is correct.**


Since we are testing whether or not there is evidence *against* the company’s claim that the average battery life is 8 hours, the alternative hypothesis should challenge this claim.  

Because we are not concerned if the battery lasts longer than 8 hours (that would be favorable to consumers), we only test if it is **less** than 8 hours.  

$$
\begin{align}
H_0: \mu &= 8 \\
H_1: \mu &< 8
\end{align}
$$

This is a **left-tailed test** (or often simply referred to as one-tailed or one-sided test) of the mean.

**Approximate the sampling distribution of the sample mean under the null hypothesis.**

In [4]:
var(smpl_data)

$$\bar{x} \sim \mathcal{N}\Big(7.9,\frac{0.253}{30}\Big)$$

**Define the z-test statistic for testing the null hypothesis and derive the rejection region.**


$$
z = \frac{\bar{x} - \mu_0}{\sigma_{\bar{x}}} \approx \frac{\bar{x} - 8}{\sqrt{\frac{0.253}{30}}}
$$


At $\alpha = 0.05$, we reject the null hypothesis if $z < z_{0.05} = -1.645$. Thus, the rejection region is $(-\infty, -1.645)$.

**Why is this approach valid?**

Under the null hypothesis (i.e., if $H_0$ is true), the test statistic follows (approximately) a standard Gaussian distribution:  

$$
z \mid H_0 \sim \mathcal{N}(0,1)
$$  

Here, the probability of observing $z < -1.645$ under $H_0$ is 0.05, which is relatively unlikely. If we observe a test statistic less than -1.645, this provides evidence **against** the null hypothesis that $\mu = 8$.  

For example, if the true population mean is substantially smaller than 8, the sample mean is likely to be smaller, resulting in a more negative test statistic $z$.  




**Given the sample data, compute the test statistic and state the Neyman-Pearson decision.**

In [5]:
xbar = mean(smpl_data)
s = sd(smpl_data)
n = 30
z = (xbar-8)/(s/sqrt(n))
z

As $z ≈ -0.294 ∉ (-∞, -1.645)$, there is no evidence against the null hypothesis that $\mu = 8$. There is insufficient evidence to reject the null hypothesis.

### **Intuitions of Hypothesis Testing**

#### **Hypothesis Testing Looks For Evidence**

Hypothesis testing is based on the philosophy that if an event is unlikely under scenario A, but we still observe it in reality, this serves as evidence against scenario A (calling into question the validity or existence of A).

By convention, the **null hypothesis** is set to represent the idea that "nothing special is happening," while the **alternative hypothesis** is the one that *challenges* this assumption.  

For example:  
- If you want to test whether the average battery life is less than 8 hours, the null hypothesis would be:  
  $$
  H_0: \mu = 8
  $$  
  This is the "nothing special" scenario.  

- If you suspect someone might have malicious intent, the null hypothesis would be:  
  $$
  H_0: \text{No malicious intent}
  $$
  The alternative would be:  
  $$
  H_1: \text{Malicious intent}
  $$  

Of course, if you start observing lots of *suspicious* actions, those observations serve as **evidence against the null**, which may lead you to favour the alternative.



#### **Hypothesis Testing Does Not Prove Truth**

Hypothesis testing cannot establish whether a hypothesis is true or false—it only assesses whether the data provide sufficient evidence to reject the null hypothesis.

Even if we reject the null hypothesis, this does **not** mean that $H_0$ is false. This is because we may still commit a **Type I error**, which occurs when we reject the null hypothesis even though it is actually true.  

Therefore, we should **never say**:  
- "The null hypothesis is wrong."  
- "The alternative hypothesis is correct."
- "We accept the alternative hypothesis."

Instead, say:

- "There is evidence against the null hypothesis."
- "We reject the null hypothesis in favour of the alternative."

The good news is that the probability of a Type I error is something we can control.  Most conventional hypothesis testing procedures (such as Neyman-Pearson or Fisher’s p-value approach) are based on pre-specifying a Type I error probability, often denoted by $\alpha$. A common choice is $\alpha = 0.05$, which serves as the threshold for deciding whether the observed data provide sufficient evidence against $H_0$.  

Back to this example:

$$
\begin{align}
H_0: &\text{ No malicious intent} \\
H_1: &\text{ Malicious intent}
\end{align}
$$

What if we observe no suspicious actions? Does that mean $H_0$ is `true`? Not necessarily — they may simply be waiting for an opportunity. In hypothesis testing, we also have **Type 2 error** - failing to reject the null hypothesis when it's actually false. As a result, failing to reject the null does not imply:

- "The null hypothesis is correct."
- "We accept the null hypothesis."

Instead, we should conclude that
- "There is no (or insufficient) evidence against the null hypothesis."

Unfortunately, there is an inherent trade-off between Type I and Type II errors. The Neyman–Pearson lemma shows how to construct the most powerful test for a given size (i.e., a fixed Type I error rate). This test minimises the Type II error among all tests with that Type I error. However, for any fixed Type I error rate, you cannot reduce the Type II error further. In practice, you first choose the Type I error rate you are willing to tolerate, and then apply Neyman–Pearson to obtain a test that achieves the best possible power against a given alternative.

How to actually construct such tests is beyond the scope of this unit. Instead, we will only state a corollary of the Neyman–Pearson lemma, which shows how to determine the rejection region in the simple case of testing hypotheses about the sample means.

#### **Only Meaningful Deviation Matters**

We do hypothesis testing not to determine whether $H_0$ is true, but rather to assess whether the data provide evidence of a **meaningful** deviation from the null hypothesis — usually the “nothing is happening” scenario that we care about.


For example, suppose the true mean battery life is 7.999 hours instead of 8. Such a tiny difference is practically indistinguishable, so it does not matter. What matters is whether the observed data show a meaningful departure from the claimed value of 8 hours. In this case, it is very likely that we will fail to reject the null hypothesis $\mu_0 = 8$, because the deviation is too small to detect. The Type II error rate will be high. Is that a problem? Not at all. Even if we fail to reject $H_0$ in most scenarios, this still indicates there is no evidence against the hypothesis that the mean battery life is 8, which is essentially very close to the truth.


#### **The Danger of Post-hoc Hypotheses**

It is very bad practice to adjust the hypothesis after looking at the data.

Hypothesis testing assumes that the null and alternative hypotheses are specified before collecting or examining the data.
Changing your hypothesis after observing the data (sometimes called “data snooping” or “p-hacking”) inflates the Type I error rate and makes your conclusions unreliable.
Connection to the battery-life example:
Suppose you originally want to test

$$
\begin{align}
H_0: \mu &= 8 \\
H_1: \mu &< 8
\end{align}
$$

If you peek at the data and see a mean around 7.95 hours, and then decide to only test a smaller deviation (say $\mu < 7.9$) to get “nicer” results, you are **post-hoc adjusting the hypothesis**.  This biases the test: your Type I error is no longer controlled.

The correct approach: decide in advance what deviation you want to detect (e.g., battery life shorter than 8 hours) and stick with it, regardless of what the observed sample mean turns out to be. If you want to change the hypotheses, you need to collect new data.

**Another issue, called multiple testing, can lead to invalid inferences if not properly addressed. This topic will be covered in the next lecture.**

## **Workshop Questions**
Through out this section, we assume a Type 1 error rate of 0.05.




### **Question 1**

The following questions are based on the `episodes` dataset. While you are expected to use R to compute the answers, the underlying concepts are identical to those in pen-and-paper hypothesis testing calculations.

In [12]:
episodes = read.csv("./datasets/episodes.csv")
episodes %>% str()
episodes

'data.frame':	704 obs. of  57 variables:
 $ Series                        : chr  "TOS" "TOS" "TOS" "TOS" ...
 $ Series.Name                   : chr  "The Original Series" "The Original Series" "The Original Series" "The Original Series" ...
 $ Season                        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Episode                       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ IMDB.Ranking                  : num  7.3 7.2 7.8 8 7.8 6.9 7.6 7.1 7.5 8.2 ...
 $ Title                         : chr  "The Man Trap" "Charlie X" "Where No Man Has Gone Before" "The Naked Time" ...
 $ Star.date                     : chr  "1513.1" "1533.6" "1312.4" "1704.2" ...
 $ Air.date                      : chr  "8/9/66" "15/9/66" "22/9/66" "29/9/66" ...
 $ Bechdel.Wallace.Test          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Director                      : chr  "Marc Daniels" "Lawrence Dobkin" "James Goldstone" "Marc Daniels" ...
 $ Writer.1                      : chr  "George Clayton Johnson" "Gene Rode

Series,Series.Name,Season,Episode,IMDB.Ranking,Title,Star.date,Air.date,Bechdel.Wallace.Test,Director,⋯,Consulting.Producer.1,Consulting.Producer.2,Female.Executive.Producer,Female.Co.Executive.Producer,Female.Producer,Female.Co.Producer,Female.Associate.Producer,Female.Supervising.Producer,Female.Co.Supervising.Producer,Female.Line.Producer
<chr>,<chr>,<int>,<int>,<dbl>,<chr>,<chr>,<chr>,<lgl>,<chr>,⋯,<chr>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>
TOS,The Original Series,1,1,7.3,The Man Trap,1513.1,8/9/66,FALSE,Marc Daniels,⋯,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
TOS,The Original Series,1,2,7.2,Charlie X,1533.6,15/9/66,FALSE,Lawrence Dobkin,⋯,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
TOS,The Original Series,1,3,7.8,Where No Man Has Gone Before,1312.4,22/9/66,FALSE,James Goldstone,⋯,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
TOS,The Original Series,1,4,8.0,The Naked Time,1704.2,29/9/66,FALSE,Marc Daniels,⋯,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
TOS,The Original Series,1,5,7.8,The Enemy Within,1672.1,6/10/66,FALSE,Leo Penn,⋯,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
TOS,The Original Series,1,6,6.9,Mudd's Women,1329.8,13/10/66,FALSE,Harey Hart,⋯,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
TOS,The Original Series,1,7,7.6,What Are Little Girls Made Of?,2712.4,20/10/66,FALSE,James Goldstone,⋯,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
TOS,The Original Series,1,8,7.1,Miri,2713.5,27/10/66,FALSE,Vincent McEveety,⋯,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
TOS,The Original Series,1,9,7.5,Dagger of the Mind,2715.1,3/11/66,FALSE,Vincent McEveety,⋯,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
TOS,The Original Series,1,10,8.2,The Corbomite Maneuver,1512.2,10/11/66,FALSE,Joseph Sargent,⋯,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE


#### **Question 1.1**

Is there evidence that the average IMDB rating of Star Trek: The Original Series episodes is lower than 7.7? Interpret the results for a non-statistician stakeholder.

In [11]:
# H0: Mean = 7.7, H1: Mean < 7.7, Left one tail test
# Hypothesis with error rate a = 0.05, Z95

episodes %>%
  summarise(Z0 = qnorm(0.05), Z = ((mean(IMDB.Ranking) - 7.7) / (sd(IMDB.Ranking) / sqrt(n())) ) )

# As Z < Z0, there is evidence against the null hypothesis H0. As we have rejected the null hyptothesis
# there is evidence to support the alternative hypothesis that the average IMDB rating is below 7.7

Z0,Z
<dbl>,<dbl>
-1.644854,-5.104216




<details>
<summary>▶️ Click to show the solution</summary>

```r

#H0: mu = 7.7
#H1: mu < 7.7

episodes %>%
  filter(Series == "TOS") %>%
  summarise(avgIMDB = mean(IMDB.Ranking),
            sdIMDB = sd(IMDB.Ranking),
            n = n(),
            mu_H0 = 7.7,
            z_05 = qnorm(0.05)) %>%
  mutate(z_obs = (avgIMDB - mu_H0)/(sdIMDB/sqrt(n)))

#As z_obs < z_05, at 95% significance level, there is evidence against the null hypothesis that the average IMDB ranking of TOS is 7.7.
# We reject the null hypothesis in favour of the alternative one that the average rating of TOS is lower than 7.7

```

</details>


#### **Question 1.2**


Is there evidence that the mean IMDB rating of episodes of Star Trek: The Original Series differs from the mean rating of Star Trek: The Next Generation? Interpret the results for a non-statistician stakeholder.

In [35]:
# H0 OrignalSeriesMean = NextGenerationMean
# H1 OrignalSeriesMean != NextGenerationMean

episodes %>%
  filter(Series %in% c("TOS", "TNG")) %>%
  group_by(Series) %>%
  summarise(mean = mean(IMDB.Ranking), sd = sd(IMDB.Ranking), n = n()) -> seriesStats

TOS_Statistics = filter(seriesStats, Series == "TOS")
TNG_Statistics = filter(seriesStats, Series == "TNG")

Z0 = qnorm(0.025)
Z = (TOS_Statistics$mean - TNG_Statistics$mean) / sqrt((TOS_Statistics$sd/TOS_Statistics$n) + (TNG_Statistics$sd/TNG_Statistics$n))
Z0
Z
# As |Z0| > Z there is no evidence against the null hypothesis H0.
# As the null hypothesis is not rejected there is evidence that the IMDB mean for both series are the same.



<details>
<summary>▶️ Click to show the solution</summary>

```r

#H0: mu_TOS - mu_TNG = 0
#H1: mu_TOS - mu_TNG != 0

episodes %>%
  filter(Series ==  c("TOS", "TNG")) %>%
  group_by(Series) %>%
  summarise(avgIMDB = mean(IMDB.Ranking),
            varIMDB = var(IMDB.Ranking),
            n = n()) -> summaryStats

TNG_stats = summaryStats %>% filter(Series == "TNG")
TOS_stats = summaryStats %>% filter(Series == "TOS")

z_obs = (TOS_stats$avgIMDB - TNG_stats$avgIMDB)/sqrt(TOS_stats$varIMDB/TOS_stats$n + TNG_stats$varIMDB/TNG_stats$n)
z_975 = qnorm(0.975)

abs(z_obs) > z_975

#As abc(z_obs) < z_975, at 95% significance level, there is no evidence against the null hypothesis that
#the average ranking of TOS is equal to that of TNG. Do not reject the null hypothesis.

```

</details>


#### **Question 1.3**

Is there evidence that the proportion of Star Trek: The Next Generation episodes that pass the Bechdel-Wallace Test is different from 0.4? Interpret the results for a non-statistician stakeholder.


**Note**: While these series have ended and you technically have the full “population” of some values (e.g., results of the Bechdel-Wallace test), we still ask you to test whether or not the proportion of episodes that pass the test is equal to 0.4. This may seem counter-intuitive, but you can think of it as follows:

- The episode test results are treated as realisations from an unknown probability distribution $f$ (here, Bernoulli(p)).
- Although the episodes are released, we are interested in the underlying process that generates these values. This includes not-yet-released episodes or hypothetical similar episodes. Simply examining the “complete” population of Bechdel test results is not sufficient; instead, we rely on a statistical model to quantify uncertainty.

In [81]:
# H0: p = 0.4
# H1: p != 0.5
TNG_episodes = filter(episodes, Series == "TNG")
n = nrow(TNG_episodes)
p = 0.4
p_hat =  nrow(filter(TNG_episodes, Bechdel.Wallace.Test == "TRUE")) / n

Z0 = qnorm(0.975)
Z = (p_hat - p) / sqrt(p*(1-p)/n)
Z0
Z

# As |Z| > Z there is no evidence against the null hypothesis H0.
# There is evidence to say that the the proportion of Bechdel test doesn't differ from 0.4



<details>
<summary>▶️ Click to show the solution</summary>

```r

#H0: p = 0.4
#H1: p != 0.4

episodes %>%
  filter(Series == "TNG") %>%
  summarise(p_hat = mean(Bechdel.Wallace.Test),
            n = n(),
            p_H0 = 0.4,
            z_975 = qnorm(0.975)) %>%
  mutate(z_obs = abs((p_hat-p_H0)/sqrt(p_H0*(1-p_H0)/n)))


#As abc(z_obs) < z_975, at 95% significance level, there is no evidence against the null hypothesis that
#the proportion of TNG episodes that pass the Bechdel-Wallace Test is 0.4. Do not reject the null hypothesis.

```

</details>


#### **Question 1.4**


Is there evidence that the proportion of episodes that pass the Bechdel-Wallace Test differs between Star Trek: The Next Generation and Star Trek: Voyager? Interpret the results for a non-statistician stakeholder.

In [70]:
#H0: TNGp = TOSp
#H1: TNGp != TOSp

episodes %>%
  filter(Series %in% c("TNG", "TOS")) %>%
  mutate(p0 = mean(Bechdel.Wallace.Test) / n()) %>%
  group_by(Series) %>%
  summarise(p_hat = mean(Bechdel.Wallace.Test), n = n(), p0 = p0[1]) -> summarisedEpisodes

p0 = summarisedEpisodes$p0[1]

TNG_statistics = filter(summarisedEpisodes, Series == "TNG")
TOS_statistics = filter(summarisedEpisodes, Series == "TOS")

Z0 = qnorm(0.025)
Z = (TNG_statistics$p_hat - TOS_statistics$p_hat) / sqrt((TNG_statistics$p_hat*(1-TNG_statistics$p_hat) / TNG_statistics$n) + (TOS_statistics$p_hat*(1-TOS_statistics$p_hat) / TOS_statistics$n))

Z0
Z
# As |Z| > Z0 there is evidence against the null hypothesis H0.
# There is evidence that the proportion of episodes that pass the Bechdel-Wallace differs between TNG and TOS



<details>
<summary>▶️ Click to show the solution</summary>

```r

#H0: p_TOS - p_TNG = 0
#H1: p_TOS - p_TNG != 0

episodes %>%
  filter(Series ==  c("TOS", "TNG")) %>%
  group_by(Series) %>%
  summarise(p_hat = mean(Bechdel.Wallace.Test),
            n = n()) -> summaryStats

TNG_stats = summaryStats %>% filter(Series == "TNG")
TOS_stats = summaryStats %>% filter(Series == "TOS")

z_obs = (TOS_stats$p_hat - TNG_stats$p_hat)/sqrt(TOS_stats$p_hat*(1-TOS_stats$p_hat)/TOS_stats$n + TNG_stats$p_hat*(1-TNG_stats$p_hat)/TNG_stats$n)
z_975 = qnorm(0.975)

abs(z_obs) > z_975

#As abc(z_obs) > z_975, at 95% significance level, there is evidence against the null hypothesis that
#the proportion of episodes passing the Bechdel test of TOS is equal to that of TNG. Reject the null hypothesis in favour of the
#alternative one that these two proportions are different.

```

</details>


### **Question 2**

The following questions are based on the `epa_data` dataset. While you are expected to use R to compute the answers, the underlying concepts are identical to those in pen-and-paper hypothesis testing calculations.


In [4]:
epa_data = read.csv("./datasets/epa_data.csv")
epa_data %>% str()

'data.frame':	13569 obs. of  9 variables:
 $ city : int  16 15 16 19 19 19 19 19 19 19 ...
 $ hwy  : int  24 22 22 27 29 24 26 27 29 24 ...
 $ cyl  : int  8 8 8 4 4 4 4 4 4 4 ...
 $ disp : num  5 5 5 2 2 2.4 2.4 2 2 2.4 ...
 $ drive: chr  "Rear-Wheel Drive" "Rear-Wheel Drive" "Rear-Wheel Drive" "Rear-Wheel Drive" ...
 $ make : chr  "Jaguar" "Jaguar" "Jaguar" "Pontiac" ...
 $ model: chr  "XK" "XK" "XK Convertible" "Solstice" ...
 $ trans: chr  "Automatic" "Automatic" "Automatic" "Automatic" ...
 $ year : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...


#### **Question 2.1**

Is there evidence of a difference in the average city mileage between cars manufactured in 2015 and 2020?  Interpret the results for a non-statistician stakeholder.

In [5]:
#H0: 2015Mean = 2020Mean
#H1: 2015Mean != 2020Mean

epa_data %>%
  filter(year %in% c("2015", "2020")) %>%
  group_by(year) %>%
  summarise(
    mean = mean(city),
    sd = sd(city),
    n = n()
  ) -> epa_data_summarised

city_2015 = filter(epa_data_summarised, year == "2015")
city_2020 = filter(epa_data_summarised, year == "2020")

Z0 = qnorm(0.975)
Z = (city_2015$mean - city_2020$mean) / sqrt((city_2015$sd^2 / city_2015$n) + (city_2020$sd^2 / city_2020$n))
Z0
Z
#As |Z| > Z0 at 95% signficance level, there is evidence against the null hypothesis H0.
# There is evidence that there is a difference between the average city mileage of 2015 and 2020



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>


#### **Question 2.2**

Is there evidence that the proportion of cars produced with manual transmissions in 2010 is greater than 0.5? Interpret the results for a non-statistician stakeholder.

In [6]:
# H0: Mean2010 = 0.5
# H1: Mean2010 > 0.5
epa_data %>%
  filter(year == "2010") %>%
  mutate(manual = ifelse(trans == "Manual", 1, 0)) %>%
  summarise(
    p_hat = mean(manual),
    n = n()
  ) -> epa_summary
p = 0.5

Z0 = qnorm(0.95)
Z = (epa_summary$p_hat - p) / sqrt(p*(1-p)/epa_summary$n)

Z0
Z
# As Z < Z0 at a 95% significance level, there is no evidence against the null hypothesis H0.
# There is evidence that the proportion of cars produced with manual transmissions is not greater than 0.5



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>


#### **Question 2.3**

Is there evidence that the proportion of cars produced with manual transmissions for the years 2011 and 2012 has decreased? Interpret the results for a non-statistician stakeholder.

In [7]:
# H0: p2012 = p2011
# H1: p2012 < p2011
# using 95% significance
# one tailed to the left

Zscore = qnorm(0.05)

epa_data %>%
  mutate(manual = ifelse(trans == "Manual",1, 0)) -> epa_data_with_manual

epa_data_with_manual %>%
  filter(year == "2012") %>%
  summarise(phat = mean(manual), n = n()) -> summary2012

epa_data_with_manual %>%
  filter(year == "2011") %>%
  summarise(phat = mean(manual), n = n()) -> summary2011

SE = sqrt((summary2012$phat*(1-summary2012$phat)/summary2012$n) + (summary2011$phat*(1-summary2011$phat)/summary2011$n))
Z = (summary2012$phat - summary2011$phat) / SE

Zscore
Z
# As -Z > Z0, there is no evidence against the null hypothesis.
# Therefore there is evidence that the proportion of cars produced with manual transmission did not decrease from 2011 to 2012



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>
