## A/B testing with python



>### Use Case Scenario 
>
>Imagine you are working as a waitress in a restaurant. It's only a summer job, but you’ve realized that your basic salary is nothing and that you need to make good tips to earn enough money for your winter ski trip with your high school friends. You’re not happy with your tips either, and you’re wondering if there is anything that could help you increase your salary.
>
>You've decided to take a scientific approach and perform an A/B testing experiment. To be sure about what causes any potential difference, you decided to change only one thing at work and test its effect: the time of the shift (day shift or night shift). You’ve always worked during the day, but maybe evening shifts would be more profitable in terms of tips. Now you want to learn how to apply A/B testing methods to help you with your experiment.

## Let's break it down:

**Goal:** Will switching from day shifts to night shifts in a waitressing job increase tips, to fund the ski trip?

#### Defining A/B test:

1. **Initial goal:** 
   - The night shift should bring a higher number or amount of tips.
   
2. **Define test metrics:**
   - **Option A:** Conversion Rate: tips per shift.
   - **Option B:** Average Shift-Tips Amount.

3. **Define Minimum Effect of Interest (MEI):** 
   - +5% Conversion Rate (CR) or +15% Average Shift-Tips Amount.

4. **Decide on the direction of change:** 
   - One-sided -> larger.

5. **Formulate the Null Hypothesis:** 
   - $H_{0}$: The type of shift doesn't influence the tips conversion rate or average shift-tips amount.
   - $H_{1}$: The type of shift will improve the tips conversion rate or average shift-tips amount by 5%.

6. **Consider the significance level $\alpha$, Type I error (wrongly finding a difference):** 
   - Confidence level $(1-\alpha)$ at 95% is acceptable. (p-value must be below 0.05 to reject $H_{0}$).

7. **Choose the appropriate test based on the distribution and sample size.**

#### Objectives:

The second part of the AB testing topic is more hands on.

1. Selecting a test statistic 

2. Example of comparing conversion rates with **z-test and statsmodels**

3. Example of comparing means with **t-test and scipy**

4. Exercise: **t-test online calculator** 

5. Bonus: sample size with an online calculator and exercises

>As we are not gonna perform any calculations by ourselves we check out two options how to calculate p-value
>- with python libraries (like `statsmodels` or `scipy`) 
>- and online calculators 
  



# 1. Selecting a test statistic (Statistical Tests)

*In hypothesis testing we always work with the same concept but depending on the metric we're testing and the population characteristics we might be using different test statistics.*

*Test statistics refer to different distribution functions and will have different assumptions about the data. Know what kind of data you're dealing with and then pick the best method for your A/B test.*

### t-test

also named Student's test, and it's used when:

- the observations are normally distributed
- *the sampling distributions have similar variances*
- appropriate for comparing means of two groups
- used when sample sizes are small and variances are unknown

###  z-test

- the sample is normally distributed
- we know the true characteristics of the populations (mean & standard deviation)
- sample size/s is/are higher than 30
- appropriate for comparing means

*Note: if the samples are large enough, even if not all the conditions are met, we go for z-test instead of t-test statistics*

<details><summary><b><font color="yellow">And many more...</font></b></summary>

- **Chi-square test:**   
  <sup>examines relationships between categorical variables. Used to test if distributions of categorical variables differ.</sup>  
- **Welch's t-test:**  
  <sup>similar to regular t-test but doesn't hold the requirement on the similar variances.</sup>  
- **Fisher's exact test:**  
  <sup>used when comparing two binomial distributions such as a click-through rate.</sup> 
- **ANOVA (Analysis of Variance):**  
  <sup>Compares means among three or more groups. Useful for examining more than two groups.</sup>
- **Bayesian Methods:**  
  <sup>Offers a probability of a hypothesis being true given the data, rather than a p-value.</sup>
- **F-test:**  
  <sup>Compares variances of two populations. Often used in the context of ANOVA.</sup>
- **Mann-Whitney U Test:**  
  <sup>Non-parametric test for assessing whether two independent samples come from the same distribution.</sup>  
- **Wilcoxon Signed-Rank Test:**  
  <sup>Non-parametric test for comparing two related samples, matched samples, or repeated measurements on a single sample.</sup>  
- **Kruskal-Wallis H Test:**  
  <sup>Non-parametric method for testing whether samples originate from the same distribution. Used for more than two groups.</sup>  
</details>

> ### Python packages in hypothesis testing
>*Besides the online calculators and other dedicated software we can also find supporting libraries in python. The two main libraries that share different hypothesis testing tools are: **statsmodels** and **scipy**.*

# 2. `statsmodels` - Z-Test showcase

*Statsmodels is heavily statistic oriented python module including an implementation of different statistical models, statistical testing tools and others.*

### Z-Test Example - comparing conversion rates

Using the **`proportions_ztest`** function from the **`statsmodels.stats.proportions`** we can compare two conversion rates. The code below presents how to perform such a test for a dataset where you have two variants and a conversion column. Note that in order to perform this kind of a test we have to make sure that we meet the z-test conditions (see above).

In the 'data' folder you can find two datasets with conversion rates - that we can use for testing the hypothesis. 

- `tips_success.csv`: will give you a p-value that is small and the difference between the conversion rates will be bigger than 5 %. 
- `tips_too_small_diff.csv` will give a p-value bigger than 5% and a smaller diff in the conversion rates.

### Let's run a z-test the a dataset `tips_success.csv`:

In [16]:
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest

In [17]:
# getting data

tips = pd.read_csv('data/tips_success.csv')
tips

# tips.groupby(['date', 'version']).agg(['count', 'sum'])

Unnamed: 0,date,version,conversion
0,2023-04-12,night,0
1,2023-03-30,night,0
2,2023-04-12,day,0
3,2023-03-28,day,0
4,2023-04-13,night,0
...,...,...,...
663,2023-04-04,day,0
664,2023-03-30,night,0
665,2023-03-30,day,0
666,2023-04-01,day,0


In [18]:
# separating in subsets of dataframe

tips_night = tips[tips["version"] == "night"]
tips_day = tips[tips["version"] == "day"]

In [19]:
# calculating the sizes of the samples (number of observations)

n_night = len(tips_night)
n_day = len(tips_day)

n_night, n_day

(333, 335)

In [20]:
# defining a list of total conversions per sample
# defining a list of sample sizes (nobs -> number of observatuions)

successes = [tips_night['conversion'].sum(), tips_day['conversion'].sum()]
nobs = [n_night, n_day]

In [21]:
# plugging in the values into proportions_ztest function

z_stat, pval = proportions_ztest(successes, nobs=nobs, alternative = "larger") # it's a one-tailed test, so...


print(f'conversion rate for day shift: {tips_day["conversion"].sum()/n_day}') 
print(f'conversion rate for night shift: {tips_night["conversion"].sum()/n_night}')
print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')

conversion rate for day shift: 0.0955223880597015
conversion rate for night shift: 0.15915915915915915
z statistic: 2.47
p-value: 0.007


> **Conclusions:**  
>- The p value is ??? we can ??? reject the $H{_0}$  
>- The difference between the Variants is ??? than 5% points. The MEI is ???.
>- Based on the results of the calculations we could tell that it makes ??? sense to switch to the night shift.  

**In summary:**  
the `proportions_ztest` uses the provided counts and sample sizes to calculate sample proportions, pooled proportions, and the z-statistic. It then compares this statistic to the standard normal distribution to assess statistical significance. Remember that this test assumes normality and requires large sample sizes (>30) for accurate results

### Let's check it now for the dataset called `tips_too_small_diff.csv`:

In [22]:
# reading data
tips = pd.read_csv('data/tips_too_small_diff.csv')

# separating in subsets of dataframe
tips_night = tips[tips['version'] == 'night']
tips_day = tips[tips['version'] == 'day']

# calculating the sizes of the samples (number of observations)
n_night = len(tips_night)
n_day = len(tips_day)

# defining a list of total conversions per sample
# defining a list of sample sizes (nobs -> number of observatuions)
successes = [tips_night['conversion'].sum(), tips_day['conversion'].sum()]
nobs = [n_night, n_day]

# plugging in the values into proportions_ztest function
z_stat, pval = proportions_ztest(successes, nobs=nobs, alternative='larger')
                                 
print(f'conversion rate for day shift: {tips_day["conversion"].sum()/n_day}')
print(f'conversion rate for night shift: {tips_night["conversion"].sum()/n_night}')
print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')

conversion rate for day shift: 0.12312312312312312
conversion rate for night shift: 0.0955223880597015
z statistic: -1.14
p-value: 0.874


> **Conclusions:**  
>- The p value is ??? than the 5% significance level, hence the test ??? reject the $H{_0}$  
>- The MEI is ???
>- Based on the results of the calculations we ??? switch to the night shift.  

------------------------

                            >>> This is probably a good time for a break ;) <<<

----------------------

# 3. `scipy` - t-Test showcase

*This is gonna be a short python presentation of the t-Test that is comparing means.*

SciPy (stands for Scientific Python) is a Python library used for scientific computing and technical computing.  SciPy uses [NumPy](https://www.w3schools.com/python/numpy/default.asp) underneath.
  
  
  
>  **We are going to use the `ttest_ind` function from the `stats` module in the [scipy python package](https://scipy.org/)**

## Example - Comparing means 


Instead of conversion rates for tips we could also measure how much tips on average in euros we are getting per day. In this case a t-test will be a perfect choice for statistics.

*If p-value > 0.05 we can not reject the null hypothesis that e.g. different shifts have same tip averages.*

*But if it is less, we can reject the null hypothesis and conclude that the two samples of data have different true means.*

#### Option 1:
In this scenario we imagine that in several days we worked both shifts and every day we computed how many tips we got **on average per bill each day**. Below you can see how to compare the results of the test: 

In [23]:
import scipy.stats as sps

# reading data
tips = pd.read_csv('data/tips_means_2.csv')
tips.head()

Unnamed: 0,day,night
0,3.9264,4.983045
1,4.7804,5.064333
2,3.9586,4.638667
3,4.8727,5.148778
4,6.425889,5.630667


In [24]:
tips['night'].mean(), tips['day'].mean()

(np.float64(5.775934560532062), np.float64(5.221549213656715))

In [25]:
# separating in subsets of dataframe
tips_day = tips['day']
tips_night = tips['night']

# plugging in the values into ttest_ind function
test_statistic, pvalue = sps.ttest_ind(tips_night, tips_day, alternative='greater')

print(f'average tips for day shift: {tips_day.mean()}')
print(f'average tips for night shift {tips_night.mean()}')

print(f'z statistic: {test_statistic:.2f}')
print(f'p-value: {pvalue:.6f}')

print(f'tips change in percent: {(tips_night.mean()/tips_day.mean()-1)*100:.2f} %')

average tips for day shift: 5.221549213656715
average tips for night shift 5.775934560532062
z statistic: 1.47
p-value: 0.073847
tips change in percent: 10.62 %


>**Conclusion:**  
> there is ??? statistically significant difference between the two sample means because the **p-value ??? 0.05**

#### Option 2:
We also measured **daily total tips** in euros per shift. We could use the t-test to compare averages of these samples.

In [26]:
# reading data
tips = pd.read_csv('data/tips_daily.csv')

# separating in subsets of dataframe
tips_day = tips['day']
tips_night = tips['night']

# plugging in the values into ttest_ind function
test_statistic, pvalue = sps.ttest_ind(tips_night, tips_day, alternative='greater')

print(f'average daily total tips for day shift: {tips_day.mean()}')
print(f'average daily total tips for night shift {tips_night.mean()}')

print(f'z statistic: {test_statistic:.2f}')
print(f'p-value: {pvalue:.3f}')

print(f'tips change in percent: {(tips_night.mean()/tips_day.mean()-1)*100:.2f} %')

average daily total tips for day shift: 51.393350000000005
average daily total tips for night shift 61.25133333333333
z statistic: 1.78
p-value: 0.040
tips change in percent: 19.18 %


**Conclusion:** 
- there is ??? statistically significant difference between the two sampple means (p-value ??? 0.05)
- The difference between the Variants is ??? than 15% . The MEI is ??.

# 4. Online calculators

For this encounter we will use the **t-test calculator** as an example.  

While searching for online calculators and formulas for performing hypothesis testing you might require different inputs. That is related to the statistic that is used for the calculation. However all the concepts still follow the same thinking process with the hypotheses and the p value.


*This is an example of an online calculator that uses t statistics:*
*- [t-test online calculator](https://www.medcalc.org/calc/comparison_of_means.php)*



## Exercise: Parameter changes in a statistical test

#### **Question:** How is the **p-value** in a t-test affected by... ?

(hint: use this [online calculator](https://www.medcalc.org/calc/comparison_of_means.php) to help you)

In [27]:
sample_stats = {'sample_1' : [tips['night'].mean(), tips['night'].std(), tips['night'].count()],
'sample_2' : [tips['day'].mean(), tips['day'].std(), tips['day'].count()]
}
pd.DataFrame(sample_stats, index=['mean', 'standard deviation', 'sample size'])

Unnamed: 0,sample_1,sample_2
mean,61.251333,51.39335
standard deviation,19.589261,23.168388
sample size,30.0,30.0


#####   *1. the difference between the means of the two groups*

 	 `Answer: ...`

#####  *2. the variance (or the standard deviation) of the two samples*

 	 `Answer: ...`

#####   *3. the size of the samples*

  	`Answer: ...`

# Collection of Samples

### Online calculators for sample size

*Similarly to the test statistics, there are also online calculators for computing sample size of data points that you should gather:*
*- [abtestguide sample size online calculator](https://abtestguide.com/abtestsize/)*

*The amount of data that is needed to collect depends on all, the **significance level** and the **test power**, and the **MEI** we decide upon before running the A/B test. The smaller the difference between the two groups we aim to detect, the more data we need to collect.*

### Which parameters affect the required sample size?

#####   *1. relative MEI*

`Answer: ...`

#####   *2. test power*

`Answer: ...`

#####   *3. significance level*

`Answer: ...`

**In our scenario:**  
*How can we estimate the number of weeks needed to run the experiment?*

<details><summary><font size=4, color="skyblue">more about T-Test (Student's t-Test):</font></summary>

- **Purpose**: The t-test is used to compare **sample means** from two groups and determine if their differences are statistically significant.

- **Scenarios**:
  - When comparing means of two independent groups (e.g., treatment vs. control groups).
  - When the population variance is **unknown**.
  
- **Steps**:
  1. Calculate the sample means for both groups.
  2. Compute the **t-statistic** using the formula:
     # $ t = \frac{{\bar{x}_1 - \bar{x}_2} - Δ}{{\sqrt{\frac{{s_1^2}}{{n_1}} + \frac{{s_2^2}}{{n_2}}}}} $
     
     where:
     
     - $\bar{x}_1$ and $\bar{x}_2$ are the sample means
     - Δ is the hypothesized difference between the population means (0 if testing for equal means)
     - $s_1$ and $s_2$ are the sample standard deviations
     - $n_1$ and $n_2$ are the sample sizes  
    
    
  3. Determine the **degrees of freedom** (usually $n_1 + n_2 - 2$).
  4. Compare the t-statistic to the critical t-value based on the desired significance level.
  
- **Output**: A p-value indicating whether the means are significantly different.

</details>

<details><summary><font size=4, color="skyblue">more about Z-Test (Z-Score Test):</font></summary>

- **Purpose**: The z-test is used to compare a **sample mean** to a known population mean or to compare proportions.
- **Scenarios**:
  - When the population variance is **known**.
  - When dealing with large sample sizes.
- **Steps**:
  1. Calculate the sample mean.
  2. Compute the **z-score** using the formula:
    
    # $ z = \frac{{\bar{x} - \mu}}{{\frac{{\sigma}}{{\sqrt{n}}}}} $

     where:
     - $\bar{x}$ is the sample mean.
     - $mu$ is the population mean.
     - $sigma$ is the population standard deviation.
     - $n$ is the sample size.
  3. Compare the z-score to the critical z-value based on the desired significance level.
- **Output**: A p-value indicating whether the sample mean is significantly different from the population mean.

In summary, both tests help us make statistical inferences about means, but the choice depends on the available information and assumptions about the population. The t-test is more versatile, while the z-test assumes known population parameters. 

</details>

