# 1. Hypothesis Testing Fundamentals

## 1.1 Hypothesis Tests and Z-Scores

*A little bit of introduction*

---

A hypothesis test starts with a statement about a population parmeter, such as the population mean or proportion.
This kind of statement is known as the **null hypothesis** $(H_0)$. 

We then collect sample data, and calculate a sample statistic (e.g., sample mean), to see if there is enough evidence to reject the **null hypothesis** in favor of an alternative hypothesis $(H_A)$.

What is the main reason to use an A/B test?

It provides a way to check outcomes of competing scenarios and decide which way to proceed. Why? A/B testing lets you compare scenarios to see which best achieves some goal.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
late_shipments= pd.read_feather('datasets/late_shipments.feather')
stack_overflow= pd.read_feather('datasets/stack_overflow.feather')

In [4]:
# Print the late_shipments dataset
late_shipments.head()

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
1,30998.0,Botswana,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,25.0,800.0,32.0,1.6,"Trinity Biotech, Plc",Yes,10.0,559.89,reasonable,1.72
2,69871.0,Vietnam,PMO - US,Direct Drop,EXW,Air,0.0,No,ARV,Adult,...,22925.0,110040.0,4.8,0.08,Hetero Unit III Hyderabad IN,Yes,3723.0,19056.13,expensive,181.57
3,17648.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,152535.0,361507.95,2.37,0.04,"Aurobindo Unit III, India",Yes,7698.0,11372.23,expensive,779.41
4,5647.0,Uganda,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test - Ancillary,...,850.0,8.5,0.01,0.0,Inverness Japan,Yes,56.0,360.0,reasonable,0.01


In [5]:
# Calculate the proportion of late shipments
print(late_shipments['late'].value_counts(normalize= True))
late_prop_samp = late_shipments['late'].value_counts(normalize= True)['Yes']

# Print the results
print(late_prop_samp) # This is for yes only

late
No     0.939
Yes    0.061
Name: proportion, dtype: float64
0.061


The proportion of late shipments in the sample is `0.061` or `6.1%`

In [6]:
# Creating the bootstrap distribution
late_shipments_boot_distn= []
for i in range(5000):
    late_shipments_boot_distn.append(
        np.mean(
            late_shipments.sample(frac= 1, replace= True)['late_delivery']
        )
    )

*Z-Scores*

---

Z-scores play a crucial role in hypothesis testing. They standardize the difference between the sample statistic and the hypothesized parameter value under the null hypothesis. The formula of calculating a z-score is as follows:

$$
z= \frac{\text{sample\_stat} - \text{hypoth\_param\_value}}{\text{standard error}}
$$

Where:
- `sample_Stat` is the sample statistic (e.g., sample mean)
- `hypoth_param_value` is the hypothesized value of the population parameter under the null hypothesis
- `standard_error` is the standard error of the sample statistic

In [7]:
# Hypothesize that the proportion is 6%
late_prop_hyp = 0.06

# Calculate the standard error
std_error = np.std(late_shipments_boot_distn, ddof= 1)

# Find z-score of late_prop_samp
z_score = (late_prop_samp - late_prop_hyp)/std_error

# Print z_score
print(z_score)

0.13284462933329028


## 1.2 P-Values

Hypothesis tests are used to determine whether the sample statistic lies in the tails of the null distribution. However, the way that the alternative hypothesis is phrased affects which tail(s) we are interested in.

The tails of the distribution that are relevant depend on whether the alternative hypothesis refers to "greater than", "less than", or "differences between."




*What are p-values?*

---

P-values provide a measure of the strength of evidence against the null hypothesis. A p-value is the probability of observing a sample statistic as extreme as the one calculated from the data, assuming the null hypothesis is True.

In [8]:
from scipy.stats import norm
# Calculate the p-value
p_value = 1 - norm.cdf(z_score, loc= 0, scale= 1)
                 
# Print the p-value
print(p_value)

0.4471581290129252


*Interpreting p-values*

---

1. A small p-value, typically p $\leq 0.05$ indicates strong evidence against the null hypothesis. In such cases, we reject the null hypothesis, in favor of the alternative hypothesis.

2. A large p-value, suggests weak evidence against the null hypothesis. If the p-value is greater than the significance level, we fail to reject the null hypothesis.

## 1.3 Statistical Significance
1. Recall that p-values are something to quantify evidence for the null hypothesis
2. Large p-value -> fail to reject null hypothesis (i.o.w accept $H_0$)
3. Small p-value -> Reject null hypothesis (i.o.w goes with $H_A$)
4. Where is the cutoff point then?

In [9]:
# Calculate 95% confidence interval using quantile method
lower= np.quantile(late_shipments_boot_distn, 0.025)
upper= np.quantile(late_shipments_boot_distn, 0.975)

# Print the confidence interval
print((lower, upper))

(0.046, 0.076)


# 2. Two-Sample and ANOVA tests

## 2.1 Performing t-tests

1. **What is hypothesis testing?**

**Purpose**: Test whether an assumption (hypothesis) about a dataset is true.

What are two types of hypotheses?
- Null Hypothesis ($H_0$): No effect or no difference (e.g. $\mu_1 = \mu_2$)
- Alternative Hypothesis ($H_A$): Indicates an effect or a difference (e.g. $\mu_1 > \mu_2)$


---
2. **Comparing two groups**

For example, comparing a compensation for those two who started coding as a child vs. as an adult.

Data:

- $\bar{x}_{child} = 132,419.57$
- $\bar{x}_{adult} = 111,313.31$

Question: Are child coders compensated more than adult coders?

---
3. **Hypothesis Testing Workflow**

- Identify population parameter that is hypothesized about.
- Specify the null and alternative hypotheses
- Detgermine (standardized) test statistic and corresponding null distribution
- Conduct hypothesis test in Python
- Measure evidence against the null hypothesis and compare to significance level
- Interpret the results in the context of the original problem

sc: Datacamp <3

---

**Below this are practice from datacamp**

The late shipments dataset have been splitted into 2 category, "yes" group, and "no" group.

In [19]:
late_shipments_yes= late_shipments[late_shipments['late'] == 'Yes']
late_shipments_no= late_shipments[late_shipments['late'] == 'No']

late_shipments_yes_kg= late_shipments_yes['weight_kilograms']
late_shipments_no_kg= late_shipments_no['weight_kilograms']

n_yes= len(late_shipments_yes_kg)
n_no= len(late_shipments_no_kg)

Recall that,
> The hypothesis test for determining, IF there is a difference between two means of two populations uses a different type of test statistic, called "t". It can be calculated from three values from each sample, using the following equation.

$$
t = \frac{\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}}{\sqrt{\frac{s^2_{\text{child}}}{n_{\text{child}}} + \frac{s^2_{\text{adult}}}{n_{\text{adult}}}}}

$$

We can wonder that why shipments are late?

Does shipments that arrived _on time_ has less weight than the shipments that are late?

In [21]:
xbar_yes= late_shipments_yes_kg.mean()
xbar_no= late_shipments_no_kg.mean()

s_yes= late_shipments_yes_kg.std()
s_no= late_shipments_no_kg.std()
# Calculate the numerator of the test statistic
numerator =xbar_yes-xbar_no

# Calculate the denominator of the test statistic
denominator = np.sqrt( (s_yes**2)/n_yes +  (s_no**2)/n_no)

# Calculate the test statistic
t_stat = numerator/denominator

# Print the test statistic
print(t_stat)

2.3936661778766433


## 2.2 Calculating p-values from t-statistic