Hypothesis Testing - Tutorial
===

**Dr Chao Shu (chao.shu@qmul.ac.uk)**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats

sns.set_theme(style="ticks")

# Set the random seed for reproducibility
np.random.seed(42)

# Import the tutorial dataset
stack_overflow_df = pd.read_csv("datasets/T03_stack_overflow.csv")

## Introduction
---

Each year, [Stack Overflow](https://stackoverflow.com/) makes surveys on its users, most of which are software developers. The surveys are about users' personal information, how they use Stack Overflow, their work, and development tools they use, etc. 

In this tutorial, we'll look at a subset of the sruvey responses in [2020](https://insights.stackoverflow.com/survey/2020) from users who are identified as Data Scientists and try to explore some fun facts.

As usual, let's first take a quick look at the dataset. 


In [None]:
print(stack_overflow_df.shape)
stack_overflow_df.head()


In [None]:
stack_overflow_df.info()

## 💰Annual Compensation of Data Scientists
---

Suppose we are insterested in the income of data scientists and we heard from somewhere on the internet that the mean annual compensation of data scientists is $100, 000. Based on the Stack Over flow survey data, we would like know whether there is sufficient evidence to support this claim.

**Step 1**: Set up a hypothesis

**Q1.1** State the null and alternative hypotheses.

put your answer here.

**Step 2**: Collect a sample and find out the summary statistics of the sample.

Since we already have the data, let's explore the sample data. The annual compensation, converted to dollars, is stored in th `converted_comp` column.

**Q1.2**: find the sample size, sample mean and sample standard deviation. Furthermore, plot a histogram and a boxplot to examine the distribution of the annual compensation data.

In [None]:
sample_size = stack_overflow_df._____[0]
sample_mean = stack_overflow_df['_____']._____()
sample_std = stack_overflow_df['_____']._____()

print("sample size =  ", sample_size)
print("Sample (mean, std) = ({}, {})".format(sample_mean, sample_std))

In [None]:
# Check the distribution of the data of interest
fig, axs = plt.subplots(1, 2, figsize=(8, 3))

sns._____(data=_____['________'], ax=axs[0])
sns._____(data=_____['________'], ax=axs[1])

fig.tight_layout()
plt.show()

**Q1.3**: What hypothesis test method(s) should be used? Why?

put your answer here.

**Step 3**: Define a significance level

As usual, we set $\alpha = 0.05$.

**Step 4**: Generate the sampling distribution of the statistic of interest under the null hypothesis

**Q1.4**: Under the null hypothesis, what is the mean annual compensation? 

In [None]:
mean_null = _____

**Q1.5**: Find out the **standard error** of the mean annual compensation.

**Q1.6**: What distribution does the sampling distribution of the mean annual compensation follow? What distribution will the null distribution follow?

Put your answer here.

**Step 5**: Determine the p-value

**Q1.7**: Calculate p-value using a suitable approach.

In [None]:
# Calculate z-score
z = (________ - ________) / ________

# Calculate p-value for this right-tailed test
p_value = ________

print("(z-score, p-value) = ({}, {})".format(z, p_value))

**Step 6** Draw conclusions


**Q1.8**: Based on the p-value and the significance level, what can we conclude about the mean annual compensation of data scientists?

Put your answer here.

## 🧒Annual Compensation of Data Scientists who First Programmed as a Child
---

We noticed in the survey the participants were asked at what age they wrote their first line of code, which is stored in `age_1st_code` column. We are interested to know whether the data scientists who first programmed as early as children earn more than those that started as adults.

### Data Wrangling

In this mini-project, we classify the participants into two categories, those who first programmed as children and those who first programmed as adults, based on the age at which they wrote their first line of code.

**Q2.1**: Add a column `age_category` to the `stack_overflow_df` containing labels to every respondent based on the definition below:
- The respondents who wrote their first line of codes at 14 or older are labelled as "adult"
- The respondents who wrote their first line of codes before 14 are labelled as "child"

*hint: use [`pandas.cut`](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) to generate labels for each respondent*

In [None]:
# Check the summary statistics of the data in `age_1st_code` column
stack_overflow_df['________'].________()

In [None]:
# Convert the ages to integer (optional)
stack_overflow_df['age_1st_code'] = stack_overflow_df['age_1st_code'].astype(int)

# Define the age categories
bins = [0, ___, stack_overflow_df['age_1st_code'].max()]  # The range will be (0, 13], (13, 45] with right=True as default
labels = ['______', '______']

# Create a new column to store the categories
stack_overflow_df['________'] = pd.cut(stack_overflow_df['________'], bins=bins, labels=labels)

### Hypothesis Test

The question we try to answer in this hypothesis test is: whether the data scientists who first programmed as early as children earn more than those that started as adults.

**Step 1**: Set up a hypothesis

**Q2.2** State the null and alternative hypotheses.

Put your answer here.

**Step 2**: Collect a sample and find out the summary statistics of the sample.

Since we already have the data, let's explore the sample data in different age categories.

**Q2.3**: Find the sample mean, sample standard deviation and sample size for each age group.

In [None]:
# Save data scientists in different age category in separate dataframes
child_df = stack_overflow_df.query('__________ == "______"')
adult_df = stack_overflow_df.query('__________ == "______"')

# Sample statistics for the "child" group
mean_child, std_child, n_child = child_df['___________'].______(), child_df['___________'].______(), child_df['___________'].________
mean_adult, std_adult, n_adult = adult_df['___________'].______(), adult_df['___________'].______(), adult_df['___________'].________
print("'child' group: (mean, std, n) = ({}, {}, {})".format(mean_child, std_child, n_child))
print("'adult' group: (mean, std, n) = ({}, {}, {})".format(mean_adult, std_adult, n_adult))


**Q2.4**: What shape does the distribution of each age group's annual compensation follow?

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(8, 4))

# Show the histograms of the annual compensation of each age group 
sns.histplot(child_df['________'], ax=axs[0])
sns.histplot(adult_df['________'], ax=axs[1])

fig.tight_layout()
plt.show()

**Step 3**: Define a significance level

As usual, we set $\alpha = 0.05$.

**Step 4**: Generate the sampling distribution of the statistic of interest under the null hypothesis

**Q2.5**: Using bootstrappting, obtain standard errors of the mean annual compensation for both age groups as well as the standard error of the mean annual compensation difference between the two age groups

In [None]:
boot_child_means, boot_adult_means, boot_diff_means = [], [], []

for i in range(10000):
    # Boostrapping the original sample to generate a new sample
    boot_sample = stack_overflow_df.sample(frac=___, replace=___)

    # Calculate the mean annual compensation for the two age groups, respectively.
    child_mean = ____________[____________['________'] == 'child']['____________'].mean() 
    adult_mean = ____________[____________['________'] == '_____']['____________'].mean() 

    # Insert current mean to the list of means to generate the sampling distributions of the mean annual compensation for both age groups
    # as well as the sampling distribution of the mean difference between the two age groups.
    boot_child_means.append(____________)
    boot_adult_means.append(____________)
    boot_diff_means.append(____________ - ____________)

print("standard error of the mean annual compensation for data scientists who start programming as children from bootstrapping: ", np.std(boot_child_means))
print("standard error of the mean annual compensation for data scientists who start programming as adults from bootstrapping: ", np.std(boot_adult_means))
print("standard error of the mean difference from bootstrapping: ", np.std(boot_diff_means))


**Q2.6**: Use histograms to visualise the sampling distributions of the mean anunual compensation for both age groups and the sampling distributions of the mean anunual compensation between the two groups.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(8, 3))

sns.histplot(data=____________, ax=axs[0], alpha=0.5, label="child")
sns.histplot(data=____________, ax=axs[0], alpha=0.5, label="adult")
axs[0].set_xlabel("Mean Annual Compensation")
axs[0].legend()

sns.histplot(data=____________, ax=axs[1])
axs[1].set_xlabel("Mean Difference between two age groups")

fig.tight_layout()
plt.show()

**Q2.7**: What distribution does the sampling distribution of the mean annual compensation difference follow? What distribution will the null distribution of the mean annual compensation difference follow?

Put your answer here.

**Step 5**: Determine the p-value

**Q2.8**: Calculate p-value using a suitable approach.

In [None]:
# Calculate z-score
z = (________ - ________) / ________

# Calculate p-value for this right-tailed test
p_value = ________

print("(z-score, p-value) = ({}, {})".format(z, p_value))

**Step 6** Draw conclusions


**Q2.9**: Based on the p-value and the significance level, what conclusion can you draw?

Put your answer here.

### Perform a Parametric Test

We have been using bootstrapping in this tutorial so far. From the boostrapping results, it can be observed that the sampling distribution of the statistic of interest are approximately normally distributed, which demonstrates that the sample size in this example is sufficient large so that the CLT is applicable.

**Q2.10**: Perform a two-sample t-test using `scipy.stats.ttest_ind()` and compare the results from the two-sample t-test with the results based on bootstrapping.