# Statistical Data Management Session 10: Inferences Based on a Two Samples Tests of Hypothesis (chapter 9 in McClave & Sincich)


**We expect you to be able to solve these exercises both with and without Python.**

## 1. Executive Workout Dropouts *(Ex 9.116 from the book)*

The *Journal of Sport Behaviour* (2001) conducted a study of variety in exercise workouts. One group of 40 people varied their exercise routine in workouts, while a second group of 40 exercisers had no set schedule or regulations for their workouts. By the end of the study, 15 people had dropped out of the first exercise group and 23 had dropped out of the second group.

1. Find the dropout rates (i.e., the percentage of exercisers who had dropped out of the exercise group) for each of the two groups of exercisers. 
2. Find a 90% confidence interval for the difference between the dropout rates of the two groups of exercisers.
3. Give a practical interpretation of the confidence interval you found in part 2.
4. Suppose you want to estimate the true difference in dropout rates to within 0.1, with the 90% confidence interval. Determine the number of exercisers to be sampled from each group in order to obtain such an estimate. Assume equal sample sizes, and assume $p_1 \approx \hat{p_1}$ and $p_2 \approx \hat{p_2}$.

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as sts
import time
%matplotlib inline




## 2. Salary Increase

Do workers generally increase their salary when changing jobs? To test this, 18 workers in a certain field are interviewed before and after they change jobs. Assume that salaries in this field are normally distributed.

1. Formulate $H_0$ and $H_a$.
2. Run the cell below to define the dataframe.

In [None]:
df_salaries = pd.DataFrame({
    'before_change': [1750,1875,1803,1862,1543,2122,1967,1781,2071,2051,1700,1564,1444,1715,1599,1907,2142,1801],
    'after_change': [1795,1928,1896,1834,1567,1630,1832,1892,1854,1831,1823,1816,1915,1734,2018,1727,1688,2089]})
print(df_salaries)

3. Perform the test of hypothesis at $\alpha = 0.05$.

## 3. Compare Argument Skills *(ex. 9.23 from the book)*
Educators frequently lament weaknesses in studentâ€™s oral and written arguments. In *Thinking and Reasoning* (April 2007), researchers at Columbia University conducted a series of studies to assess the cognitive skills required for successful arguments. One study focused on whether students would choose to argue by weakening the opposing position or by strengthening the favored position. (For example, suppose you are told you would do better at basketball than soccer, but you like soccer. An argument that weakens the opposing position is "You need to be tall to play basketball." An argument that strengthens the favored position is "With practice, I can become really good at soccer.") A sample of 52 graduate students in psychology was equally divided into two groups. Group 1 was presented with 10 items such that the argument always attempts to strengthen the favored position. Group 2 was presented with the same 10 items, but this time the argument always attempts to weaken the nonfavored position. Each student then rated the 10 arguments on a 5-point scale from very weak (1) to very strong (5). The variable of interest was the sum of the 10 item scores, called the total rating. Summary statistics for the data are shown in the accompanying table. You may assume both total ratings follow a normal distribution.

| | Group 1  (support favored postion)| Group 2 (weaken opposing position)|
|:---| :---:| :---:|
|sample size | 26| 26|
|mean|28.6|24.9|
|standard deviation|12.5|12.2|


1. In order to determine whether or not the difference in mean between the two positions is significant, the researchers would have to assume that the variance of the total rating for Group 1 and Group 2 is the same. Test the validity of this assumption using a test at $\alpha=0.05$.
2. Compare the mean total ratings for the two groups at $\alpha=0.05$ and give a practical interpretation of the result.

In [None]:
n_1 = 26
x_bar1 = 28.6
s_1 = 12.5
n_2 = 26
x_bar2 = 24.9
s_2 = 12.2

F_distribution = sts.f(dfn=..., dfd=...)

interval = np.linspace(0, 5, 1000)
plt.plot(interval, F_distribution.pdf(interval))
plt.show()
plt.close()

## 4. Patent Infringement Case

*Chance* (Fall 2002) described a lawsuit charging Intel Corp. with infringing on a patent for an invention used in the automatic manufacture of computer chips. In response, Intel accused the inventor of adding material to his patent notebook after the patent was witnessed and granted. The case rested on whether a patent witness's signature was written on top of or under key text in the notebook. Intel hired a physicist who used an X-ray beam to measure the relative concentrations of certain elements (e.g., nickel, zinc, potassium) at several spots on the notebook page. The zinc measurements for three notebook locations (on a text line, on a witness line, and on the intersection of the witness and text line) are provided in the following table. You may assume that measurements are drawn from a normal distribution.

| $\qquad $ | $\qquad$|
|---:| :---|
|text line: | .335 .374 .440|
|witness line:| .210 .262 .188 .329 .439 .397|
|intersection: | .393 .353 .285 .295 .319|

A large difference in variation in zinc level between the intersection and e.g. the text line would support Intel's claim.
 
1. Use a test (at $\alpha=.05$) to compare the variation in zinc measurements for the text line with the corresponding variation for the intersection.
2. Use a test (at $\alpha=.05$) to compare the variation in zinc measurements for the witness line with the corresponding variation for the intersection.

In [None]:
df_text = pd.DataFrame([0.335, 0.374, 0.44])
df_witness = pd.DataFrame([0.21, 0.262, 0.188, 0.329, 0.439, 0.397])
df_intersection = pd.DataFrame([0.393, 0.353, 0.285, 0.295, 0.319])

## 5. SQL Recap

The file ``salary_differences.sql`` provided on Toledo contains the information used in exercise 2. Import the file using MySQL Workbench and write the appropriate queries to retrieve the relevant information. Re-run your analysis (without running the cell which defined the dataframe!) to check whether you have the correct information. Note that some workers have a ``NULL`` value in the table listing wages after they changed jobs. This indicates that these workers didn't change jobs (and should therefore be excluded from the result).

In [None]:
conn = sqlite3.connect("../../salary_differences.db")
query = """
SELECT before_change.salary AS before_change, after_change.salary AS after_change FROM before_change JOIN after_change ON before_change.worker_id = after_change.worker_id WHERE after_change.salary IS NOT NULL 
"""
df_salaries = pd.read_sql_query(query, conn)
print(df_salaries)