# Statistical Data Management Session 10: Inferences Based on a Two Samples Tests of Hypothesis (chapter 9 in McClave & Sincich)


**We expect you to be able to solve these exercises both with and without Python.**

## 1. Executive Workout Dropouts *(Ex 9.116 from the book)*

The *Journal of Sport Behaviour* (2001) conducted a study of variety in exercise workouts. One group of 40 people varied their exercise routine in workouts, while a second group of 40 exercisers had no set schedule or regulations for their workouts. By the end of the study, 15 people had dropped out of the first exercise group and 23 had dropped out of the second group.

1. Find the dropout rates (i.e., the percentage of exercisers who had dropped out of the exercise group) for each of the two groups of exercisers. 
2. Find a 90% confidence interval for the difference between the dropout rates of the two groups of exercisers.
3. Give a practical interpretation of the confidence interval you found in part 2.
4. Suppose you want to estimate the true difference in dropout rates to within 0.1, with the 90% confidence interval. Determine the number of exercisers to be sampled from each group in order to obtain such an estimate. Assume equal sample sizes, and assume $p_1 \approx \hat{p_1}$ and $p_2 \approx \hat{p_2}$.


1. 

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as sts
import time
%matplotlib inline

n_1 = 40
n_2 = 40
p_hat1 = 15 / n_1
p_hat2 = 23 / n_2
q_hat1 = 1 - p_hat1
q_hat2 = 1 - p_hat2

2. 90% confidence interval for the difference between the dropout rates of the two groups of exercisers: 

In [None]:
# check for large sample => all satisfied
print(n_1*p_hat1 >= 15)
print(n_1*q_hat1 >= 15)
print(n_2*p_hat2 >= 15)
print(n_2*q_hat2 >= 15)

# confidence interval
standard_normal = sts.norm(0,1)
alpha = 0.1
z_alpha_div2 = standard_normal.ppf(1 - alpha/2)
print(z_alpha_div2)

left_bound =  (p_hat1 - p_hat2) - z_alpha_div2 * np.sqrt((p_hat1*q_hat1/n_1)+(p_hat2*q_hat2/n_2))
right_bound = (p_hat1 - p_hat2) + z_alpha_div2 * np.sqrt((p_hat1*q_hat1/n_1)+(p_hat2*q_hat2/n_2))
interval = [left_bound, right_bound]
print(interval)

3. We are 90% confident that the difference between the dropout rates of the two groups of exercisers is between -0.38 and -0.02. Since both ends of this confidence interval are less than 0, there is evidence that the dropout rate for the "varied exercise" group is less than the rate for the "no schedule" group.

4. This would mean that the term substracted and added when calculating a confidence interval, the 

    $z_{\alpha/2}\sqrt{\frac{\hat{p}_1\hat{q}_1}{n_1}+\frac{\hat{p}_2\hat{q}_2}{n_2}}$

    is at most 0.1. Using the fact that $n_1 = n_2$ and solving for $n_1$ yields:
    $n_1 = (\hat{p}_1\hat{q}_1+\hat{p}_2\hat{q}_2)z_{\alpha/2}^2 \frac{1}{0.01}$. 
    
    Round up to the nearest natural number to be on the safe side:

In [None]:
print(np.ceil((p_hat1*q_hat1+p_hat2*q_hat2)*z_alpha_div2**2/0.01))

## 2. Salary Increase

Do workers generally increase their salary when changing jobs? To test this, 18 workers in a certain field are interviewed before and after they change jobs. Assume that salaries in this field are normally distributed.

1. Formulate $H_0$ and $H_a$.

This is a paired sample. $H_0: D_0 = \mu_1-\mu_2=0$, we test whether workers increase their salary, so perform a one-sided test $H_a: D_0 < 0$. 

2. Run the cell below to define the dataframe.

In [None]:
df_salaries = pd.DataFrame({
    'before_change': [1750,1875,1803,1862,1543,2122,1967,1781,2071,2051,1700,1564,1444,1715,1599,1907,2142,1801],
    'after_change': [1795,1928,1896,1834,1567,1630,1832,1892,1854,1831,1823,1816,1915,1734,2018,1727,1688,2089]})
print(df_salaries)

3. Perform the test of hypothesis at $\alpha = 0.05$.

In [None]:
differences = df_salaries['after_change'] - df_salaries['before_change']
#print(differences)
x_d = differences.mean()
s_d = differences.std()
n_d = len(differences)

D_0 = 0
alpha = 0.05
t_distribution = sts.t(n_d - 1) # To account for the fact that sigma_d is unknown. If it was: use standard normal (z-distribution) instead
t_alpha = t_distribution.ppf(1 - alpha)

t_statistic = (x_d - D_0) / (s_d / np.sqrt(n_d))
print("Test statistic:   ", t_statistic)
print("Critical t value: ", t_alpha)
# Conclusion: at alpha = 0.05, the observed difference in wages is not significant.

## 3. Compare Argument Skills *(ex. 9.23 from the book)*
Educators frequently lament weaknesses in student’s oral and written arguments. In *Thinking and Reasoning* (April 2007), researchers at Columbia University conducted a series of studies to assess the cognitive skills required for successful arguments. One study focused on whether students would choose to argue by weakening the opposing position or by strengthening the favored position. (For example, suppose you are told you would do better at basketball than soccer, but you like soccer. An argument that weakens the opposing position is "You need to be tall to play basketball." An argument that strengthens the favored position is "With practice, I can become really good at soccer.") A sample of 52 graduate students in psychology was equally divided into two groups. Group 1 was presented with 10 items such that the argument always attempts to strengthen the favored position. Group 2 was presented with the same 10 items, but this time the argument always attempts to weaken the nonfavored position. Each student then rated the 10 arguments on a 5-point scale from very weak (1) to very strong (5). The variable of interest was the sum of the 10 item scores, called the total rating. Summary statistics for the data are shown in the accompanying table. You may assume both total ratings follow a normal distribution.

| | Group 1  (support favored postion)| Group 2 (weaken opposing position)|
|:---| :---:| :---:|
|sample size | 26| 26|
|mean|28.6|24.9|
|standard deviation|12.5|12.2|


1. In order to determine whether or not the difference in mean between the two positions is significant, the researchers would have to assume that the variance of the total rating for Group 1 and Group 2 is the same. Test the validity of this assumption using a test at $\alpha=0.05$.
2. Compare the mean total ratings for the two groups at $\alpha=0.05$ and give a practical interpretation of the result.

1. We test, two-sided, whether there is a difference in the variances between the groups: $H_0: \sigma_1^2 = \sigma_2^2; H_a: \sigma_1^2 \neq \sigma_2^2$.

In [None]:
n_1 = 26
x_bar1 = 28.6
s_1 = 12.5
n_2 = 26
x_bar2 = 24.9
s_2 = 12.2

alpha = 0.05

F_test_statistic = s_1**2 / s_2**2 # always "largest / smallest"!
print("Test statistic:   ", F_test_statistic)

F_distribution = sts.f(dfn=n_1-1, dfd=n_2-1)
critical_F_value = F_distribution.ppf(1 - alpha/2) 
print("Critical F value: ", critical_F_value)

interval = np.linspace(0, 5, 1000)
part = np.linspace(critical_F_value, 5, 100)
test_statistic = np.linspace(F_test_statistic, F_test_statistic, 1)
plt.plot(interval, F_distribution.pdf(interval))
plt.fill_between(part, F_distribution.pdf(part), color='red')
plt.fill_between(test_statistic, F_distribution.pdf(test_statistic), color='green')
plt.show()
plt.close()

The test statistic doesn't lie in the critical region, so we don't reject $H_0$ and can continue the analysis under the assumption that the population variances are equal.

2. Here, $H_0$ would be that there is no difference between the means of the groups ($D_0=0$), $H_a$ that there is ($D_0\neq 0$), so a two-sided test. As both $n_1 < 30$ and $n_2 < 30$, we use the corrected test for a small sample.

In [None]:
D_0 = 0
s2_p = (((n_1-1) * s_1**2) + ((n_2-1) * s_2**2)) / (n_1-1+n_2-1)
print(s2_p)
t_test_statistic = (x_bar1 - x_bar2 - D_0) / np.sqrt((s2_p) * ((1/n_1) + (1/n_2)))
print("Test statistic:   ", t_test_statistic)

degrees_of_freedom = n_1 + n_2 - 2
t_distribution = sts.t(degrees_of_freedom)
critical_t_value = t_distribution.ppf(1 - alpha/2)
print("Critical t value: ", critical_t_value)

We conclude that we cannot reject $H_0$, the observed difference between the groups is not significant at the $\alpha=0.05$ level.

## 4. Patent Infringement Case

*Chance* (Fall 2002) described a lawsuit charging Intel Corp. with infringing on a patent for an invention used in the automatic manufacture of computer chips. In response, Intel accused the inventor of adding material to his patent notebook after the patent was witnessed and granted. The case rested on whether a patent witness's signature was written on top of or under key text in the notebook. Intel hired a physicist who used an X-ray beam to measure the relative concentrations of certain elements (e.g., nickel, zinc, potassium) at several spots on the notebook page. The zinc measurements for three notebook locations (on a text line, on a witness line, and on the intersection of the witness and text line) are provided in the following table. You may assume that measurements are drawn from a normal distribution.

| $\qquad $ | $\qquad$|
|---:| :---|
|text line: | .335 .374 .440|
|witness line:| .210 .262 .188 .329 .439 .397|
|intersection: | .393 .353 .285 .295 .319|

A large difference in variation in zinc level between the intersection and e.g. the text line would support Intel's claim.
 
1.	Use a test (at $\alpha=.05$) to compare the variation in zinc measurements for the text line with the corresponding variation for the intersection.
2.	Use a test (at $\alpha=.05$) to compare the variation in zinc measurements for the witness line with the corresponding variation for the intersection.

In [None]:
df_text = pd.DataFrame([0.335, 0.374, 0.44])
df_witness = pd.DataFrame([0.21, 0.262, 0.188, 0.329, 0.439, 0.397])
df_intersection = pd.DataFrame([0.393, 0.353, 0.285, 0.295, 0.319])

1. Say $\sigma_t^2, \sigma_w^2, \sigma_i^2$ are the variances of zinc measurements for the text-line, witness-line and intersection, respectively.

    To determine whether the variation in zinc measurements for the text-line and the variation in zinc measurements for the intersection differ, we test $H_0: \sigma_t^2=\sigma_i^2, H_a: \sigma_t^2\neq\sigma_i^2$. We use a two-sided F-test.

    The test statistic is $F = s_t^2/s_i^2$.
    
    Calculations are below, with an indication of the value of the test statistic in green and the rejection region (barely visible) in red.

In [None]:
# variances (s2 to indicate s² as "²" is not allowed in variable names):
s2_t = df_text.var()
s2_i = df_intersection.var()

n_t = len(df_text)
n_i = len(df_intersection)

F1 = s2_t / s2_i
print("Test statistic:   ", F1)

f_tVSi = sts.f(n_t - 1, n_i - 1)
# when in doubt what comes first, use: sts.f(dfn = n_t-1, dfd = n_w-1)

alpha = 0.05
critical_F_value = f_tVSi.ppf(1 - alpha/2) # "from which point onwards does the rhs tail weigh 0.025?"
print("Critical F value: ", critical_F_value)
# note that we only have to consider the rhs critical value, because we always use "largest variance/ smallest variance"

# code for illustration purposes only
interval = np.linspace(-1, 13, 1000)
part = np.linspace(critical_F_value, 13, 1000)
test_statistic = np.linspace(float(F1), float(F1), 1)
plt.figure(figsize=(10,6))
plt.plot(interval, f_tVSi.pdf(interval))
plt.fill_between(part, f_tVSi.pdf(part), color='red')
plt.fill_between(test_statistic, f_tVSi.pdf(test_statistic), color='green')
plt.show()
plt.close()

The rejection region is for values of the test statistic > 10.65, which is not the case. There is insufficient evidence to indicate the variation in zinc measurements for the text-line and the variation in zinc measurements for the intersection differ at $\alpha = 0.05$.

2. Similarly, for $H_0: \sigma_w^2 = \sigma_i^2$ and $H_a: \sigma_w^2 \neq \sigma_i^2$:

In [None]:
s2_w = df_witness.var()
n_w = len(df_witness)

F2 = s2_w / s2_i
print("Test statistic:   ", F2)

f_wVSi = sts.f(n_w-1, n_i-1)
# when in doubt what comes first, use: sts.f(dfn = n_t-1, dfd = n_w-1)

critical_F_value2 = f_wVSi.ppf(1 - alpha/2)
print("Critical F value: ", critical_F_value2)

interval = np.linspace(-1, 11, 1000)
part = np.linspace(critical_F_value2, 11, 1000)
test_statistic = np.linspace(float(F2), float(F2), 1)
plt.figure(figsize=(10,6))
plt.plot(interval, f_wVSi.pdf(interval))
plt.fill_between(part, f_wVSi.pdf(part), color='red')
plt.fill_between(test_statistic, f_wVSi.pdf(test_statistic), color='green')
plt.show()
plt.close()

Again, the observed difference in variation is not significant and $H_0$ is not rejected.

## 5. SQL Recap

The file ``salary_differences.sql`` provided on Toledo contains the information used in exercise 2. Import the file using MySQL Workbench and write the appropriate queries to retrieve the relevant information. Re-run your analysis (without running the cell which defined the dataframe!) to check whether you have the correct information. Note that some workers have a ``NULL`` value in the table listing wages after they changed jobs. This indicates that these workers didn't change jobs (and should therefore be excluded from the result).

In [None]:
conn = sqlite3.connect("../../shared/salary_differences.db")

query = """
SELECT before_change.salary AS before_change, after_change.salary AS after_change 
FROM before_change 
JOIN after_change ON before_change.worker_id = after_change.worker_id 
WHERE after_change.salary IS NOT NULL
"""

df_salaries = pd.read_sql_query(query, conn)
print(df_salaries)

# Note that our analysis still works!
differences = df_salaries['after_change'] - df_salaries['before_change']
#print(differences)
x_d = differences.mean()
s_d = differences.std()
n_d = len(differences)

D_0 = 0
alpha = 0.05
t_distribution = sts.t(n_d - 1)
t_alpha = t_distribution.ppf(1 - alpha)

t_statistic = (x_d - D_0) / (s_d / np.sqrt(n_d))
print("Test statistic:   ", t_statistic)
print("Critical t value: ", t_alpha)
# Conclusion: at alpha = 0.05, the observed difference in wages is not significant.