### **15.1** EPO EXPERIMENT WITH REPEATED MEASURES

- Any observed mean difference between the samples might be resulted from various uncontrolled variables, beside from the effect that is being test, if it exists.
- In order to minimize the influence of these factors, investigators can apply the so-call repeated mearsure, that is, assign the subjects that were previously in the treatment group, to the control group in the repeated experiment, so that the difference within that group itself can be examined.

#### **Difference (D) Scores** 

- Definition:
    + The arithmetic difference between each pair of scores in repeated measures or, more generally, in two related samples.

<center><b>DIFFERENCE SCORE (D)</b></center> <br>
<center>$\Large D = X_1 - X_2$</center>
Where: <br>
+ $X_1, X_2$: paired scores for each subject measured twice. <br>
+ $X_1$: the score of the subject that was assigned to the treatment group. <br>
+ $X_2$: the score of the subject that was assigned to the control group.

#### **Mean Difference Score ($\overline{D}$)** 

<center><b>MEAN DIFFERENCE SCORE ($\overline{D}$)</b></center> <br>
<center>$\Large \overline{D} = \frac{\sum{D}}{n}$</center>

#### **Comparing the Two Experiments** 

- The distributions for the scores of repeated measures usually have smaller variability, therefore, produce better statistical stability and a higher chance of rejecting the null hypothesis + a likely smaller p-value.

#### **Repeated Measures** 

- Definition:
    + Whenever the same subject is measured more than once.

#### **Two Related Samples**

- Definition:
    + Each observation in one sample is paired, on a one-to-one basis, with a single observation in the other sample.

- Repeated measures might not always be feasible since, as discussed below, several potential complications must be resolved before measuring subjects twice. An investigator still might choose to use two related samples by matching pairs of different subjects in terms of some uncontrolled variable that appears to have a considerable impact on the dependent variable.

#### **Progress Check 15.1**

(a) Two independent samples.

(b) Two related samples, repeated measures.

(c) Two related samples, matched pairs.

(d) Two related samples, matched pairs.

#### **Some Complications with Repeated Measurements**

#### **Counterbalancing**

- Definition:
    + Reversing the order of conditions for equal numbers of all subjects.

### **15.2** STATISTICAL HYPOTHESES

#### **Null Hypothesis**

#### **Alternative (or Reasearch) Hypothesis**

#### **Two Other Possible Alternative Hypotheses**

### **15.3** SAMPLING DISTRIBUTION OF $\overline{D}$

<center><b>MEAN OF ALL SAMPLE MEAN DIFFERENCE SCORES</b></center> <br>
<center>$\Large \mu_{\overline{D}} = \mu_D$</center>
Where: <br>
+ $\mu_{\overline{D}}$: the mean of mean difference scores for all possible samples. <br>
+ $\mu_D$: the population mean of all sample difference scores.

<center><b>STANDARD ERROR OF ALL SAMPLE MEAN DIFFERENCE SCORES</b></center> <br>
<center>$\Large \sigma_{\overline{D}} = \frac{\sigma_D}{\sqrt{n}}$</center>

### **15.4** t TEST

<center><b>t RATIO FOR TWO POPULATION MEANS (TWO RELATED SAMPLES)</b></center> <br>
<center>$\Large t = \frac{\overline{D} \,-\, \mu_{\text{hyp}}}{s_{\overline{D}}}$</center>

#### **Finding Critical t Values**

#### **Summary for EPO Experiment**

- Repeated measures eliminates one important source of variability - the variability due to individual differences, which inflates the standard error term and therefore, stretches the sampling distribution horizontally and increases the probability of type II error (in other words, increases the proportion of $\beta$).

- It is important to mention the use of repeated measures (or any matching) in the conclusion of the report.

### **15.5** DETAILS: CALCULATIONS FOR t TEST

<center><b>SAMPLE STANDARD DEVIATION, $s_D$</b></center> <br>
<center>$\Large s_D = \sqrt{\frac{SS_D}{n-1}}$</center>

<center><b>ESTIMATED STANDARD ERROR, $s_{\overline{D}}$</b></center> <br>
<center>$\Large s_{\overline{D}} = \frac{s_D}{\sqrt{n}}$

#### **Progress Check 15.2**

Research Problems: <br>
$\;\;\;\;$When subjects are of matched pairs for home environment, each subject in a pair was randomly and equally assigned to either the group that received real vitamin C or the group that received fake one, does the consumption of vitamin C reduce the severity of common colds. <br>
Statistical Hypotheses: <br>
$\;\;\;\;$$H_0: \mu_{\overline{D}} \ge 0$ <br>
$\;\;\;\;$$H_1: \mu_{\overline{D}} \lt 0$ <br>
Decision Rule: <br>
$\;\;\;\;$Reject the null hypothesis at the 0.05 level of significance if $t \le -1.833$ given that df = n - 1 = 9. <br>
Calculations: <br>

In [1]:
import numpy as np
import os
import sys

sys.path.append(os.path.join(sys.path[0], os.path.pardir))

import utils.HypothesisTest as HT

In [5]:
t_group = np.array([2, 5, 7, 0, 3, 7, 4, 5, 1, 3])
c_group = np.array([3, 4, 9, 3, 5, 7, 6, 8, 2, 5])
diff_scores = t_group - c_group
std_error = HT.estimated_stderror(diff_scores, 10, "1 sample")
t = HT.calc_tscore(diff_scores.mean(), 0, std_error)
t

-3.75

Decision: <br>
$\;\;\;\;$Reject $H_0$ at the 0.05 level of significance because the calculated t of -3.75 is less than -1.833. <br>
Interpretation: <br>
$\;\;\;\;$There is evidence that when subjects are of matched pairs for home environment, vitamin C is found to reduce the severity of common colds.

### **15.6** ESTIMATING EFFECT SIZE

#### **Confidence Interval for $\mu_D$**

<center><b>CONFIDENCE INTERVAL FOR $\mu_D$ (TWO RELATED SAMPLES)</b></center> <br>
<center>$\Large \overline{D} \pm (t_{\text{conf}})(s_{\overline{D}})$</center>

#### **Finding $t_{\text{conf}}$**

#### **Interpreting Confidence Intervals for $\mu_D$**

#### **Progress Check 15.3**

In [6]:
import utils.ConfidenceInterval as CI

In [11]:
interval = CI.confidence_interval(diff_scores.mean(), 2.262, std_error)
interval

array([-2.4, -0.6])

Interpretation: <br>
$\;\;\;\;$When the subjects are matched for home environment, we are 95% confident that the interval between -2.4 and -0.6 covers the reduction in estimated severity of common colds.

#### **Standardized Effect Size, Cohen's d**

<center><b>STANDARDIZED EFFECT SIZE, COHEN'S d (TWO RELATED SAMPLES)</b></center> <br>
<center>$\Large d = \frac{\overline{D}}{s_D}$</center>

#### **Progress Check 15.4**

In [13]:
d = -1.5 / 1.27
d

-1.1811023622047243

### **15.7** ASSUMPTIONS

### **15.8** OVERVIEW: THREE t TESTS FOR POPULATION MEANS

![image.png](attachment:05c0ffd9-97e6-4dde-ab16-d2b76a3dd14c.png)

#### **One or Two Samples**

#### **Are the Two Samples Paired**

#### **Examples**

#### **Study A**

#### **Study B**

#### **Study C**

#### **Progress Check 15.5**

(a) Because there's no indication of any pairing criteria, and the subjects are separated into two sets of samples, the appropriate t test is that for two independent samples.

(b) There subjects are of matched pairs for biological identifications, and divided into two different groups, so that the appropriate t test is the one for two related samples.

(c) 

(d) t test for one sample.

(e) t test for two related samples, repeated measures.

### **15.9** t TEST FOR POPULATION CORRELATION COEFFICIENT, $\rho$

- Definition:
    + $\rho$ (population correlation coefficient): A number between -1.00 and 1.00 that describes the linear relationship between pairs of quantitative variables for some population.

- The t test for $\rho$ always examines the significance of the observed correlation coefficient within a set of samples in order to eventually identify whether there exists any relationship between pairs of variables in the population.

#### **Null Hypothesis**

- The null hypothesis in this t test always assumes $\rho = 0$.

#### **Focus on Relationships Instead of Mean Difference**

- As mentioned above, the goal of this t test is to identify the existence of any relationship in the population based on the observed correlation coefficient in the given sample, so that the mean difference between variables in the sample is ignored.

#### **t Test**

<center><b>t RATIO FOR A SINGLE POPULATION CORRELATION COEFFICIENT</b></center> <br>
<center>$\Large t = \frac{r - \rho_{\text{hyp}}}{\sqrt{\frac{1 - r^2}{n - 2}}}$</center>
Where: <br>
+ r: the observed correlation coeffcient of the pairs of variables in the sample. <br>
+ $r^2$: the total amount of variability of one variable that is predictable from the relationship with the other variable. <br>
+ $\rho_{\text{hyp}}$: the population correlation coefficient. <br>
+ n: the sample size

#### **Importance of Sample Size**

- The size of the sample has an inverse proportional relationship with the precision of the generalization to the population.
- An investigator should always consult the power curve and the limits of the confidence interval to determine the most appropriate sample size possible.

#### **Progres Check 15.6**

In [2]:
std_error = HT.estimated_stderror(sample_size=25, pearson_r=0.43, test_type="correlation coefficient")
t = HT.calc_tscore(0.43, 0, std_error)
t

2.263

- There is sufficient evidence to suggest that there's correlation between educational level and annual income for the population of California taxpayers based on a t test.

#### **Assumptions**

- The use of this t test assumes that the relationship between the two variables can be described by a straight line, and that the sample originates from a normal bivariate distribution, that is, both the distributions of variable 1 and variable 2 are normal. If the assumption is violated, then the result should be taken with caution.

### **Review Questions**

In [1]:
import numpy as np
import os
import sys

sys.path.append(os.path.join(sys.path[0], os.path.pardir))

import utils.HypothesisTest as HT
import utils.ConfidenceInterval as CI

#### **15.7**

(a) <br>
- Research Problem: <br>
Does physical exercise cause an increase in the mean GPAs of students, given that pairs of students are orginally matched for their GPAs? <br>
- Statistical Hypotheses: <br>
$H_0: \mu_D \le 0$ <br>
$H_1: \mu_D \gt 0$ <br>
- Decision Rule: <br>
Reject the null hypothesis at the 0.01 level of significance if $t \ge 3.143$, given df = 7 - 1 = 6. <br>
- Calculations:

In [3]:
t_group = np.array([4.00, 2.67, 3.65, 2.11, 3.21, 3.60, 2.80])
c_group = np.array([3.75, 2.74, 3.42, 1.67, 3.00, 3.25, 2.65])
diff = t_group - c_group
std_error = HT.estimated_stderror(data=diff, sample_size=7, test_type="1 sample")
t_score = HT.calc_tscore(diff.mean(), 0, std_error)
t_score

3.714

- Decision: <br>
Reject the null hypothesis because t = 3.714 that is greater than 3.143. <br>
- Interpretation: <br>
There's sufficient evidence to suggest that regular physical exercises does improve academic achievement in the population of college students.

(b) p < 0.01

(c) d = 1.50

In [None]:
d = diff.mean() / diff.std()
d

1.495994283562021

(d) On average, academic achievement for the college students group that performs regular physical exercise ($\overline{X}$ = 3.15, s = 0.61) significantly exceeds that for the group of college students that does not exercise ($\overline{X}$ = 2.93, s = 0.62) according to a t test [t(6) = 3.714, p < 0.01 and d = 1.50], when pairs of students are matched for their original GPAs.

#### **15.8**

(a)

In [16]:
c_group = np.array([28, 29, 31, 44, 35, 20, 50, 25])
t_group = np.array([26, 27, 32, 44, 35, 16, 47, 23])
diff = t_group - c_group
std_error = HT.estimated_stderror(data=diff, sample_size=8)
t_score = HT.calc_tscore(diff.mean(), 0, std_error)
t_score

-2.5

(b) p < 0.05

(c)

In [18]:
ci = CI.confidence_interval(diff.mean(), 2.365, std_error)
ci

array([-2.92, -0.08])

Interpretation: <br>
We can be 95% confident that the interval includes the true mean reduction in cigarettes consumption by the population of teenager smokers during the month after the antismoking film presentation.

In [19]:
d = diff.mean() / diff.std()
d

-0.9486832980505138

Interpretation: <br>
The value of d = -.94 indicates a significantly large estimate of effect size, which further supports the decision to reject the null hypothesis.

(d) Even though there is sufficient evidence to support the rejection of the null hypothesis, the confidence interval suggests a probable insignificant effect size in the vicinity of -0.08. If the investigator is to determine that result is not convincing enough, then another test with a larger sample size, determined with the aid of the power curves or the inverse calculation from the desired confidence interval limits, should be conducted to obtain a more precise result.

#### **15.9**

(a)

In [20]:
t_score = HT.calc_tscore(2.12, 0, 1.5)
t_score

1.413

The null hypothesis is decided to be retained because t is less than 1.699.

(b) p > 0.05.

(c) Several uncontrolled factors should be taken into account and resolve before conducting an experiment such as the present one, some of them are:
+ The speed at which the drivers drive the cars.
+ The types of car that are used in the experiment.
+ The types of road on which the cars travel.
+ The weather condition under which the experiment is conducted.

Should these problems not addressed, any result obtained from the present experiment cannot be taken seriously due to the unspecified weights of uncontrolled variables.

#### **15.10**

(a)

In [23]:
std_error = 66.33 / np.sqrt(12)
t_score = HT.calc_tscore(51.33, 0, std_error)
t_score

2.681

(b) p < 0.05.

(c) Yes.

(d) We can be 95% confident that the obtained interval includes the true mean difference in running times of the athletes after and before taking the treatment.

In [24]:
ci = CI.confidence_interval(51.33, 2.201, std_error)
ci

array([ 9.19, 93.47])

(e) The value of Cohens'd d indicates a significantly large estimated effect.

In [26]:
d = HT.calc_cohensd(51.33, 0, 66.33**2)
d

0.77

(f)

(g) The counterbalancing method effectively minimizes the effect the uncontrolled variable, the psychological believes of the experimented athletes that receiving the treatment could enhance/hinder their performances.

(h) 24 hours is a considerably short period for such a repeated measure, even in such a case that the amount of time is scientifically proven to be sufficient for the dissipation of the effect of the treatment, it is still not a reasonabe interval for the relaxation of the athletes to participate in another long-distance run, with or without the blood-doped substance.

#### **15.11**

Although the subjects share a similar characteristics, that is, they are all college students of first year, there's no other evidence to suggest one-to-one relationship between any pairs of students, so that the t test repeated measures is not appropriate because there exist individual differences.

#### **15.12**

The t test for two independent samples, although applied on the same set of subjects, will produce a larger standard error, thus increases the probability of type II error.

#### **15.13**

t score = -1.5, p > 0.05 => there's no evidence to suggest there's a relationship between the test scores and the amount time students spent on taking the test.

In [8]:
estimated_stderror = HT.estimated_stderror(sample_size=38, pearson_r=-0.24, test_type="correlation coefficient")
tscore = HT.calc_tscore(-0.24, 0, estimated_stderror)
tscore

-1.5

#### **15.14**

(a) t score  = 2.486

In [10]:
data = [28, 53, 17, 37, 27, 27, 22, -25, -7, 0]
estimated_stderror = HT.estimated_stderror(data=data, sample_size=len(data), test_type="2 related samples")
tscore = HT.calc_tscore(np.mean(data), 0, estimated_stderror)
tscore

2.486

(b) p < 0.05

(c) CI = [1.61, 34.19]

In [11]:
CI.confidence_interval(np.mean(data), 2.262, estimated_stderror)

array([ 1.61, 34.19])

(d) Cohen's d = 0.8

In [13]:
cohensd = np.mean(data) / np.std(data)
cohensd

0.8289435947794199

(e) The top ten major league batters for 2014, on average, regressed toward the mean for 2015 with a mean drop of 17.90 points of batting average, that was statistically significant (0.05). This is a large effect (0.8).