# Statistical Data Management Session 10: Inferences Based on a Two Samples Tests of Hypothesis (chapter 9 in McClave & Sincich)


## 1. Executive Workout Dropouts *(Ex 9.116 from the book)*

The *Journal of Sport Behaviour* (2001) conducted a study of variety in exercise workouts. One group of 40 people varied their exercise routine in workouts, while a second group of 40 exercisers had no set schedule or regulations for their workouts. By the end of the study, 15 people had dropped out of the first exercise group and 23 had dropped out of the second group.

1. Find the dropout rates (i.e., the percentage of exercisers who had dropped out of the exercise group) for each of the two groups of exercisers. 
2. Find a 90% confidence interval for the difference between the dropout rates of the two groups of exercisers.
3. Give a practical interpretation of the confidence interval you found in part 2.
4. Suppose you want to estimate the true difference in dropout rates to within 0.1, with the 90% confidence interval. Determine the number of exercisers to be sampled from each group in order to obtain such an estimate. Assume equal sample sizes, and use the estimate from earlier, i.e. assume $p_1 \approx \hat{p_1}$ and $p_2 \approx \hat{p_2}$.

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as sts
import time
%matplotlib inline



## 2. Salary Increase

Do workers generally increase their salary when changing jobs? To test this, 18 workers in a certain field are interviewed before and after they change jobs. Assume that salaries in this field are normally distributed.

1. Formulate $H_0$ and $H_a$.
2. Run the cell below to define the dataframe.

In [None]:
df_salaries = pd.DataFrame({
    'before_change': [1750,1875,1803,1862,1543,2122,1967,1781,2071,2051,1700,1564,1444,1715,1599,1907,2142,1801],
    'after_change': [1795,1928,1896,1834,1567,1630,1832,1892,1854,1831,1823,1816,1915,1734,2018,1727,1688,2089]})
print(df_salaries)

3. Perform the test of hypothesis at $\alpha = 0.05$. Also calculate and interpret the $p$-value.

In [None]:
differences = df_salaries['after_change'] - df_salaries['before_change']




4. What would change in case of a large sample?

## 3. Pumpkin fertiliser *(ex. 9.23 from the book)*
To test the effect of two types of fertiliser, a sample of 52 pumpkins was equally divided into two groups, and each fertilised using only type A or type B, respectively. Summary statistics for the weights (in kg) of the pumpkins at harvest are shown in the accompanying table. You may assume pumpkin weights follow a normal distribution.

| | Type A| Type B|
|:---| :---:| :---:|
|sample size | 26| 26|
|mean|28.6|24.9|
|standard deviation|12.5|12.2|


1. In order to determine whether or not the difference in mean between the two types is significant, the researchers would have to assume that the variance of weights in the two groups is the same. Test the validity of this assumption using a test at $\alpha=0.05$.

2. Compare the mean weight for the two groups at $\alpha=0.05$ and give a practical interpretation of the result.

## 4. Comparing e-scooter run distances

The following two dataframes contain distances (in km) of one-charge runs of two different types (A and B) of e-scooter. Test the following:
1. Whether type A has a significantly longer mean run time than type B, for level of significance $\alpha=0.01$.
2. Whether type B has a significantly longer mean run time than type A, for level of significance $\alpha=0.01$.
3. Whether type A has a significantly different mean run time than type B, for level of significance $\alpha=0.02$. (Why can you draw a conclusion directly, without doing additional calculations?)
4. Whether type B has a significantly longer mean run time than type A, for level of significance $\alpha=0.005$.

In [None]:
dataframe_a = pd.read_csv("../../shared/dataframe_a.csv", header=None)
dataframe_b = pd.read_csv("../../shared/dataframe_b.csv", header=None)
print("mean a:", dataframe_a.mean())
print("mean b:", dataframe_b.mean())

## 5. Challenge: automated testing

The previous exercise invites for automation! Write a function that accepts 4 parameters: (1&2) two large ($n_i\geq30)$ dataframes containing quantitative data, (3) a string indicating whether we test for "larger", "two-sided" or "smaller" and (4) an alpha significance level. The function calculates and returns whether the difference in mean between the dataframes is significantly larger/different/smaller on the given level of significance.

In [None]:
def test_is_significant(df1, df2, kind, alpha):
   # < your implementation goes here > 
    return False
    

In [None]:
# These should yield a non-significant test result (False, i.e. don't reject H_0)
print(test_is_significant(dataframe_a, dataframe_b, "larger", 0.005))
print(test_is_significant(dataframe_b, dataframe_a, "smaller", 0.005))
print(test_is_significant(dataframe_a, dataframe_b, "two-sided", 0.01))
print(test_is_significant(dataframe_b, dataframe_a, "two-sided", 0.01))
print(test_is_significant(dataframe_b, dataframe_a, "larger", 0.01))
print()

# These tests should be significant (True, i.e. reject H_0)
print(test_is_significant(dataframe_a, dataframe_b, "larger", 0.01))
print(test_is_significant(dataframe_b, dataframe_a, "smaller", 0.01))
print(test_is_significant(dataframe_a, dataframe_b, "two-sided", 0.02))
print(test_is_significant(dataframe_b, dataframe_a, "two-sided", 0.02))



## 6. SQL Recap

The file ``salary_differences.sql`` provided on Toledo contains the information used in exercise 2. Import the file using MySQL Workbench and write the appropriate queries to retrieve the relevant information. Re-run your analysis (without running the cell which defined the dataframe!) to check whether you have the correct information. Note that some workers have a ``NULL`` value in the table listing wages after they changed jobs. This indicates that these workers didn't change jobs (and should therefore be excluded from the result).

In [None]:
conn = sqlite3.connect("../../shared/salary_differences.db")

query = """
<your query here>
"""

df_salaries = pd.read_sql_query(query, conn)
print(df_salaries)