In [1]:
import numpy as np
import os
import sys

sys.path.append(os.path.join(sys.path[0], os.path.pardir))

import utils.HypothesisTest as HT
import utils.CorrelationCoefficient as CC

### **14.1** EPO EXPERIMENT

#### **Two Independent Samples**

- Definition:
    + Two Independent Samples: Observations in each sample are based on different (and unmatched) subjects.

#### **Two Populations**

- In such an experiment, where the sample size is limited, the populations can only be considered as hypothetical, thus requires caution for any generalizations that resulted from the experiment.
- In order to assert the obtained generality, it will require additional experimentation.

#### **Difference between Population Means**

- Definition:
    + Effect: Any difference between the population means.

- The value of the differences between observed means, if obtained as being:
    + Negative: indicates that the treatment given to one group has a negative effect
    + Zero: indicates that the treatment has no effect
    + Positive: indicates that the treatment has a positive effect.

### **14.2** STATISTICAL HYPOTHESES

#### **Null Hypothesis**

- Depend on the goal of an experiment, the null hypothesis may take one of the forms:
    + Negative effect: $H_0: \mu_1 - \mu_2 \ge 0$
    + No effect: $H_0: \mu_1 - \mu_2 = 0$
    + Positive effect: $H_0: \mu_1 - \mu_2 \le 0$ <br>

Where:
+ $\mu_1$: the mean score of the treatment group
+ $\mu_2$: the mean score of the control group

#### **Alternative (or Research) Hypothesis**

- In contrast to the null hypothesis, the alternative hypothesis will have a form, that is the complement of the null hypothesis, as followings:
    + Negative effect: $H_1: \mu_1 - \mu_2 \lt 0$
    + No effect: $H_1: \mu_1 - \mu_2 \neq 0$
    + Positive effect: $H_1: \mu_1 - \mu_2 \gt 0$

#### **Progress Check 14.1** 

(a) <br>
Research Problem: <br>
$\;\;\;\;$Do the mean reading scores differ between the group that exposed to a special bilingual program and the other group that expoosed to a traditional reading program. <br>
Statistical Hypotheses: <br>
$\;\;\;\;$$H_0: \mu_1 - \mu_2 = 0$
$\;\;\;\;$$H_1: \mu_1 - \mu_2 \neq 0$

(b) <br>
$\;\;\;\;$$H_0: \mu_1 - \mu_2 \le 0$
$\;\;\;\;$$H_1: \mu_1 - \mu_2 \gt 0$

(c) <br>
$\;\;\;\;$$H_0: \mu_1 - \mu_2 \ge 0$
$\;\;\;\;$$H_1: \mu_1 - \mu_2 \lt 0$

(d) <br>
$\;\;\;\;$$H_0: \mu_1 - \mu_2 = 0$
$\;\;\;\;$$H_1: \mu_1 - \mu_2 \neq 0$

### **14.3** SAMPLING DISTRIBUTION OF $\overline{X}_1 - \overline{X}_2$

- Definition:
    + Sampling Distribution of $\overline{X}_1 - \overline{X}_2$: A distribution of the differences between sample means of all possible random samples from two underlying populations.

#### **Mean of the Sampling Distribution, $\mu_{\overline{X}_1 - \overline{X}_2}$**

- The mean of the sampling distribution of $\overline{X}_1 - \overline{X}_2$ is similar to that of sampling distribution of sample means, as introduced in chapter 9, that is: <br> <br>
<center>$\Large \mu_{\overline{X}_1 - \overline{X}_2} = \mu_1 - \mu_2$

#### **Standard Error of the Sampling Distribution, $\sigma_{\overline{X}_1 - \overline{X}_2}$**

- Definition:
    + Standard Error of the Difference Between Means $\sigma_{\overline{X}_1 - \overline{X}_2}$: A rough measure of the avarage amount by which any sample means difference deviates from the difference between population means.

- The standard deviation of the sampling distribution of $\overline{X}_1 - \overline{X}_2$ equals: <br><br>
<center>$\Large \sigma_{\overline{X}_1 - \overline{X}_2} = \sqrt{\frac{{\sigma_1}^2}{n_1} + \frac{{\sigma_2}^2}{n_2}}$

### **14.4** t TEST

- Since the use of z test requires that the standard deviations of both populations are known, which is rare in practice, so that they must be estimated, thus leads to the application of t test.

#### **t Ratio**

- The null hypothesis can be tested using a t ratio. Expressed in words, <br> <br>
<center>$\Large t = \frac{\text{difference between sample means}\, -\, \text{hypothesized difference between population means}}{\text{estimated standard error}}$ <br> <br>

- Expressed in symbols,
<center><b>t RATIO FOR TWO POPULATION MEANS (TWO INDEPENDENT SAMPLES)</b></center>
<center>$\Large t = \frac{(\overline{X}_1 - \overline{X}_2)\, -\, (\mu_1 - \mu_2)_{\text{hyp}}}{s_{\overline{X}_1 - \overline{X}_2}}$</center>

#### **Finding Critical t Values**

- Because there are two independent samples that originate from two different distribution involved, the degrees of freedom for this t test is df = $n_1$ + $n_2$ - 2, and is used to consult the same table as in previous chapter.

#### **Progress Check 14.2** 

(a) df = 12 + 11 - 2 = 21. <br>
t = 2.080.

(b) df = 15 + 13 - 2 = 26. <br>
t = 1.706.

(c) df = 25 + 25 - 2 = 48. <br>
t = -2.423.

(d) df = 8 + 10 - 2 = 16. <br>
t = 2.921.

#### **Summary for EPO Experiment**

### **14.5** DETAILS: CALCULATIONS FOR THE t TEST

#### **Pooled Variance Estimate, ${s_p}^2$**

- Definition:
    + The most accurate estimate of the population variance (assumed to be the same for both populations) based on a combination of two sums of squares and their degrees of freedom.

<center><b>POOLED VARIANCE ESTIMATE ${s_p}^2$</b></center> <br>
<center>$\Large {s_p}^2 = \frac{SS_1 + SS_2}{df} = \frac{SS_1 + SS_2}{n_1 + n_2 - 2}$</center>

#### **Estimated Standard Error, $s_{\overline{X}_1 - \overline{X}_2}$**

- Definition:
    + The standard deviation of the sampling distribution for the difference between means whenever unknown variance common for both populations must be estimated.

<center><b>ESTIMATED STANDARD ERROR, $s_{\overline{X}_1 - \overline{X}_2}$</b></center> <br>
<center>$\Large s_{\overline{X}_1 - \overline{X}_2} = \sqrt{\frac{{s_p}^2}{n_1} + \frac{{s_p}^2}{n_2}}$</center>

#### **Progress Check 14.3**

In [2]:
diff_group = [5, 20, 7, 23, 30, 24, 9, 8, 20, 12]
easy_group = [13, 6, 6, 5, 3, 6, 10, 20, 9, 12]

In [3]:
estimated_stderror = HT.estimated_stderror(data=[diff_group, easy_group], sample_size=[10, 10], test_type="2 independent samples")
diffgroup_mean = np.mean(diff_group)
easygroup_mean = np.mean(easy_group)
tscore = HT.calc_tscore(diffgroup_mean - easygroup_mean, 0, estimated_stderror)
tscore

2.152

### **14.6** p-VALUES

- Definition:
    + p-Value: The degree of rarity of the result, given that the null hypothesis is true, smaller p-values tend to discredit the null hypothesis and to support the research hypothesis.

#### **Finding Approximate p-Values**

#### **Progress Check 14.4**

(a) p < 0.001

(b) p < 0.05

(c) p < 0.01

(d) p > 0.05

(e) p > 0.05

#### **Reading p-Values Reported by Others**

#### **Evaluation of the p-Value Approach**

- Advantage:
    + Provides some support for the research hypothesis even when the observed t value is slightly less deviant than the critical t value for some level of significance.
- Disadvantage:
    + The decision to reject or to retain the null hypothesis is ambiguous.
    + There is no acknowledgement of either type I or type II error, therefore, the statistical hypotheses narrow down to only 2 possible outcomes, those are a false null hypothesis or a true one.

#### **Level of Significance of p-Value?**

#### **Progress Check 14.5**

$(a_2)\, (b_1)\, (c_2)\, (d_1)\, (e_2)$

#### **Progress Check 14.6**

$(a_2)\, (b_1)\, (c_2)\, (e_2)$

### **14.7** STATISTICALLY SIGNIFICANT RESULTS

- Essentially, <i>rejecting the null hypothesis</i> and <i>statistically significant</i> indicate the same result, that is, the observed difference cannot be attributable to chance because the probability is too low.
- However:
    + <i>Rejecting the null hypothesis</i>: refers to the population.
    + <i>Statistically significant</i>: refers to the sample.

- Definition:
    + Statistical Significance: Implies only that the null hypothesis is probably false, not whether it's false because of a large or small difference between population means.
- So that statistical significance does not imply the size nor the importance of the effect.

#### **Beware of Excessively Large Sample Sizes**

- A large sample size will increase the precision of the observed result, thus also increases the probability of detecting even a small, yet unimportant effect. Therefore, before conducting any hypothesis test, the investigator should always consult different tools to determine the sample sizes and the probability that is associated with each of them, and decide a threshold for the least degree of effect to be considered important. 

#### **Avoid an Erroneous Conditional Probability**

- To reject the null hypothesis at, for example, the 0.05 level of significance is to signify that:
    + First, the null hypothesis is assumed to be true, that is, the observable means from a test are centered about the hypothesized population mean.
    + Then, after we find the actual value for the observed mean, we consult the hypothesized distribution and conclude that the probability for such a mean to occur is $\le 0.05$, or it is only 5% likely to occur.
    + Since the probability is too low, we suspect that the observed mean may not originate from the hypothesized distribution but from an unknown true population distribution.
    + Finally, we come to the conclusion that there's a real effect that cannot be attributable to chance, or variability.
- The whole procedure does not imply anyhow, about the probability of $H_0$ being either true or false.

### **14.8** ESTIMATING EFFECT SIZE: POINT ESTIMATES AND CONFIDENCE INTERVALS

#### **Point Estimate ($\overline{X}_1 - \overline{X}_2$)**

#### **Confidence Interval**

- Definition:
    + Confidence Intervals for $\mu_1 - \mu_2$: The ranges of values that, in the long run, include the true unknown effect (difference between population means) a certain percent (depends on the particular levels of significance) of the time.

<center><b>CONFIDENCE INTERVAL (CI) FOR $\mu_1 - \mu_2$ (TWO INDEPENDENT SAMPLES)</b></center>
<center>$\Large \overline{X}_1 - \overline{X}_2 \pm (t_{\text{conf}})(s_{\overline{X}_1 - \overline{X}_2})$</center>

#### **Interpreting Confidence Intervals for $\mu_1 - \mu_2$**

- Key point: 
    + The signs of the limits (upper and lower) are particularly important, since they indicate the direction of the difference between the two population means, which eventually affects the test decision.
    + A single interpretation is possible only if the two limits of the confidence interval for $\mu_1 - \mu_2$ share the same signs, either both positive or negative.
    + Besides these points, any confidence interval can be interpreted similarly to that introduced in chapter 12.

#### **Progress Check 14.7**

(a) CI 2

(b) CI 2

(c) CI 3

(d) CI 1

### **14.9** ESTIMATING EFFECT SIZE: COHEN'S d

- Definition:
    + Standardized Effect Estimate, Cohen's d: Describes effect size by expressing the observed mean difference in standard deviation units.

<center><b>STANDARDIZED EFFECT SIZE, COHEN'S d (TWO INDEPENDENT SAMPLES)</b></center>
<center>$\Large d = \frac{\text{mean difference}}{\text{standard deviation}} = \frac{\overline{X}_1 - \overline{X}_2}{\sqrt{{s_p}^2}}$</center>

- Advantages:
    + Provides a stable frame of reference that is not influenced by the sample sizes.
    + Disregard the units of measurement so that different d's can be compared against one another.

#### **Cohen's Guidelines for d**

- Guidelines: the effect size is considered
    + Small: if it's less than or in the vicinity of 0.20 - or one-fifth of a standard deviation.
    + Medium: if it's in the vicinity 0.50 - or one-half of a standard deviation.
    + Large: if it's more than or in the vicinity of 0.80 - or four-fifths of a standard deviation.

- Illustrations of Cohen's d based on two normal distributions: <br>
![image.png](attachment:cf3d8186-c1b3-4e4f-81b8-8f8726d0803a.png)

- Remark:
    + To really evaluate the effect size as either small, medium or large, an investigator should always take the context into consideration rather than counting on simply the given proportion.

### **14.10** META-ANALYSIS

- Definition:
    + Meta-analysis: a set of data-collecting and statistical procedures designed to summarize the various effects reported by groups of similar studies.

### **14.11** IMPORTANCE OF REPLICATION

### **14.12** REPORTS IN THE LITERATURE

#### **Progress Check 14.8** 

In [4]:
# (a):
HT.calc_cohensd(15.8, 9.0, 7.06**2)

0.96

(b) The average required time for solving the same puzzle of the group receiving "difficult" instructions (M = 15.8, SD = 8.64) significantly exceeds that of the group receiving "easy" instructions (M = 9.0, SD = 5.01), according to a t test [t(18) = 2.15, p < 0.05 and d = 0.96]

### **14.13** ASSUMPTIONS

- Whether testing a hypothesis or constructing a confidence interval, t assumes that both underlying populations are normally distributed with equal variances.
- In the event that there are conspicuous departures from normality or equality of variances in the data for the two groups, consider the following possibilities:
    + Increase sample sizes to minimize the effect of any non-normality
    + Equate sample sizes to minimize the effect of unequal population variances
    + Use a slightly less sensitive, more complex version of t designed for unequal variances
    + Use a less sensitive but more assumption-free test, such as the Mann-Whitney U test.

### **14.14** COMPUTER OUTPUT

#### **Progress Check 14.9**

(a) The result of equal variances t should be reported. Why?

(b)

### **Review Questions**

#### **14.10** 

(a) The difference between the groups in experiment B is likely to be viewed as real due to the lower variability. And because,

In [None]:
control_gr1 = [10, 10, 10, 10, 10, 9, 11]
treatment_gr1 = [11, 12, 12, 12, 12, 12, 13]
estimated_stderror1 = HT.estimated_stderror([control_gr1, treatment_gr1], [7, 7], "2 samples")
t_ratio1 = (2 - 0) / estimated_stderror1
t_ratio1

In [19]:
control_gr2 = [7, 9, 9, 10, 11, 11, 13]
treatment_gr2 = [9, 11, 11, 12, 13, 13, 15]
estimated_stderror2 = HT.estimated_stderror([control_gr2, treatment_gr2], [7, 7], "2 samples")
t_ratio2 = (2 - 0) / estimated_stderror2
t_ratio2

1.9607843137254901

(b) <br>
Statistical Hypotheses: <br>
$\;\;\;\;$$H_0: \mu_1 - \mu_2 = 0$ <br>
$\;\;\;\;$$H_1: \mu_1 - \mu_2 \neq 0$ <br>
Decision Rule: <br>
$\;\;\;\;$Reject $H_0$ at the 0.05 level of significance if $t \ge 2.179$ or $t \le -2.179$ given df = 14 - 2 = 12 <br>.
Calculation <br>
$\;\;\;\;$$s_{\overline{X}_1 - \overline{X}_2} = 0.31$ <br> <br>
$\;\;\;\;$$t = \frac{2 - 0}{0.31} = 6.45$ <br>
Decision <br>
$\;\;\;\;$Reject $H_0$ because $t = 6.45$, which is greater than 2.179. <br>
Conclusion <br>
$\;\;\;\;$Population means differ for experiment B.

(c) t = 1.96, therefore, $H_0$ is rejected because it is 2.179.

(d) Experiment B: p < 0.001. <br>
Experiment C: p > 0.05. <br>

(e) Based on the result, the mean difference in experiment B can be viewed as being statistically significant, therefore, real. On the other hand, the observed difference in experiment C can only be considered as merely transitory.

(f)

In [21]:
HT.calc_cohensd(12, 10, HT.pooled_variance(*[control_gr1, treatment_gr1], 12)**2)

6.0

#### **14.11**

In [29]:
def calc_tscore(data):
    length = list()
    means = list()
    for each_set in data:
        length.append(len(each_set))
        means.append(np.mean(each_set))
        
    estimated_stderror = HT.estimated_stderror(data, length, "2 samples")
    
    return (means[0] - means[1]) / estimated_stderror 

In [30]:
treatment_gr = [2, 5, 20, 15, 4, 10]
control_gr = [3, 8, 7, 10, 14, 0]
calc_tscore([treatment_gr, control_gr])

0.6628787878787881

t = 0.66. <br>
Retain the null hypothesis (that there exists none of the so call "committee atmosphere" on compliance) because t is less than 2.228.

#### **14.12**

(a) t = 1.11 is less than the critical value 1.671. Therefore, the null hypothesis is retained.

(b) Not appropriate.

#### **14.13**

In [10]:
import utils.HypothesisTest as HT
import utils.ConfidenceInterval as CI

(a) t = 3.067. <br>
Reject the null hypothesis because t = 3.067, which is greater than the critical value of 2.042 at the 0.05 level of significance. <br>
There's evidence that, on average, the grading policy of letter grades does have positive effect on the achievement score over the pass/fail grading policy.

In [None]:
t = HT.calc_tscore(86.2 - 81.6, 0, 1.5)
t

(b) The decision to reject the null hypothesis will still be made, however, the interpretation will be stated with the same meaning, but under a different form of: "There's evidence that, on average, the grading policy of simple pass/fail has a negative effect on the achievement score over the one of letter grades".

(c) Because of self-selection, groups might differ with respect to any one or several uncontrolled variables, such as motivation, aptitude, and so on, in addition to the difference in grading policy. Hence, any observed difference between the mean achievement scores cannot be solely attributed to the difference in grading policy.

(d) p < 0.01

(e)

In [7]:
d = HT.calc_cohensd(86.2, 81.6, 5**2)
d

0.92

(f) The achievement scores for the group receiving letters grading policy (M = 86.2, SD = 5.39) and that receiving pass/fail grading policy (M = 81.6, SD = 4.58) differed significantly [t(38) = 2.042, p < 0.01 and d = 0.92]

#### **14.14**

(a) t = 3.25. <br>
Reject the null hypothesis because t exceeds the critical value of 1.980. <br>
There's evidence that drinking alcohol does affect the performance of drivers.

In [8]:
t = HT.calc_tscore(26.4 - 18.6, 0, 2.4)
t

3.25

(b) p < 0.01

(c) CI = [3.05, 12.55]. <br>
Interpretation: we can claim with 95% confidence that the mean difference of performance scores between drivers in treatment and control groups is included in the inverval.

In [11]:
CI.confidence_interval(26.4 - 18.6, 1.980, 2.4)

array([ 3.05, 12.55])

(d) d = 0.59

In [13]:
d = HT.calc_cohensd(26.4, 18.6, 13.15**2)
d

0.59

(e) The average performance scores between the drivers who drink alcohol (M = 26.4, SD = 13.99) and who do not (M = 18.6, SD = 12.15) before taking a driving simulator differed significantly, according to a t test [t(120) = 1.980, p < 0.01 and d = 0.59].

#### **14.15**

#### **14.16**

#### **14.17**

#### **14.18**