# AB Testing
- In A/B tests, we need to compare 2 variations so to have 2 different hypotheses: 
- **Null Hypothesis** is as follows: there is no difference between variations A and B. In other words, it posits the hypothesis that there is no statistically significant difference in the performances of variations A and B.

$$H_0: \mu_0 = \mu_1$$

- **Alternative Hypothesis** is the hypothesis where we expect the new variation to perform better. In other words, we are talking about a real difference:

$$H_0: \mu_0 \neq \mu_1$$

- Once the hypotheses are formulated, statistical evidence is required to support them. Statistical evidence is obtained through the analysis of the collected data and the application of appropriate statistical methods.


In [41]:
import pandas as pd
import numpy as np

import statsmodels.stats.api as sms
from scipy.stats import shapiro, mannwhitneyu
np.set_printoptions(legacy='1.25')


## A. Case Study - I

<b>Business Problem:</b> Increasing Revenue through Variant Optimization

<b>Background:</b>

A company launched an A/B test with two variants on its website in order to increase revenue. The test randomly assigned users to either Variant A or Variant B and tracked the income generated by each user. The experiment data is stored in an Excel file and includes user IDs, the variant they were exposed to, and the revenue brought by each user.

<b>Objective:</b>

The objective of the A/B test is to determine which variant (Variant A or Variant B) leads to higher revenue. By identifying the more effective variant, you aim to optimize the website design or user experience to increase overall revenue.

<b>Business Questions:</b>

- Which variant generated higher revenue during the A/B test?

- Is the difference in revenue between the variants statistically significant?

- What insights can be derived from the revenue data to optimize the website and drive revenue growth?

In [18]:
df = pd.read_csv("../data/ab-revenue-tests.csv")
df = df.drop_duplicates(keep="first")
df.reset_index(drop=True, inplace=True)

### 1. Data Cleaning and Analysis
- Outlier Removals: 
    - There was an extreme increase in the Revenue value from the 99th percentile to the 100th percentile.
    - To ensure better results in statistical tests, these extreme values should be removed from the dataset. 
    - To support this, the t-based confidence interval of the "Revenue" values should be observed before and after removing the outliers from the dataset.

In [32]:
df["REVENUE"].describe([0, 0.05, 0.50, 0.95, 0.99, 1])

count    7933.000000
mean        0.125359
std         2.602527
min         0.000000
0%          0.000000
5%          0.000000
50%         0.000000
95%         0.000000
99%         2.233600
100%      196.010000
max       196.010000
Name: REVENUE, dtype: float64

In [33]:
# Before outlier removal
conf_int = sms.DescrStatsW(df['REVENUE']).tconfint_mean()

print(f'95% confidence interval for Revenue: {conf_int}')

95% confidence interval for Revenue: (0.06808022416157984, 0.1826370328660264)


In [35]:
class OutlierHandler():
    def __init__(self):
        pass
    def get_outlier_thresholds(self, df, col_name, lower_quantile=0.25, upper_quantile=0.75):
        """
        Calculate the lower and upper outlier thresholds for a given variable in the dataframe.

        Parameters:
            dataframe (pandas.DataFrame): The dataframe containing the variable.
            variable (str): The name of the variable for which outlier thresholds will be calculated.

        Returns:
            tuple: A tuple containing the lower and upper outlier thresholds.
        """
        quartile_1 = df[col_name].quantile(lower_quantile)
        quartile_3 = df[col_name].quantile(upper_quantile)
        interquantile_range = quartile_3 - quartile_1
        up_limit = quartile_3 + 1.5 * interquantile_range
        low_limit = quartile_1 - 1.5 * interquantile_range
        return low_limit.round(), up_limit.round()


    def replace_with_thresholds(self, df, col_name, low_limit, up_limit):
        """
        Replace the outliers in the given variable of the dataframe with the lower and upper thresholds.

        Parameters:
            dataframe (pandas.DataFrame): The dataframe containing the variable.
            variable (str): The name of the variable for which outliers will be replaced.

        Returns:
            None  
        """
        df.loc[(df[col_name] < low_limit), col_name] = low_limit
        df.loc[(df[col_name] > up_limit), col_name] = up_limit
        return df
out_handler = OutlierHandler()
low_limit, up_limit = out_handler.get_outlier_thresholds(df, "REVENUE", lower_quantile=0.01, upper_quantile=0.99)       
print(low_limit, up_limit)
df = out_handler.replace_with_thresholds(df, "REVENUE", low_limit, up_limit )

-3.0 6.0


In [36]:
# After outlier handler
conf_int = sms.DescrStatsW(df['REVENUE']).tconfint_mean()

print(f'95% confidence interval for Revenue: {conf_int}')

95% confidence interval for Revenue: (0.04618289998844694, 0.06783701681478008)


Based on the statement that outliers have been replaced, and the t-based confidence interval of the Revenue values is deemed acceptable, we can proceed to the next step of the analysis. Removing or handling outliers appropriately is crucial to ensure that extreme values do not unduly influence the statistical results. With the outlier issue addressed, the data is now more robust, and we can proceed with hypothesis definitation and testing and further analysis with confidence. 

### 2. Defining the Hypothesis for the A/B Test

In this section, we will define the hypothesis for the A/B test. As mentioned earlier, the A/B test aims to compare the revenue of two variations, A and B, and determine if there is a statistically significant difference between them.

In [37]:
df.groupby('VARIANT_NAME').agg({'REVENUE': 'mean'})

Unnamed: 0_level_0,REVENUE
VARIANT_NAME,Unnamed: 1_level_1
control,0.065094
variant,0.048899


Based on the above results, it is evident that the revenue values of the control group are better than those of the variant group. However, at this point, we do not know if this difference is merely a result of random chance. To test this, we can define our hypotheses as follows:

$$ H_0: \text{There is no statistically significant difference between the revenue of} \\
\text{the control and variant groups.} $$

$$ H_1: \text{There is a statistically significant difference between the revenue of} \\
\text{the control and variant groups.} $$

### 3. Conducting the Hypothesis Test

- To test the hypotheses, we will use an appropriate statistical test based on the nature of the data and assumptions. 
    - One of the assumptions is the normality assumption. To test this assumption, we will apply the **Shapiro-Wilk** test.
- The **Shapiro-Wilk** test is a widely used test for assessing the **normality of a distribution**. It tests the null hypothesis that the data is normally distributed against the alternative hypothesis that the data does not follow a normal distribution. If the p-value obtained from the Shapiro-Wilk test is greater than the chosen significance level (typically 0.05), we will fail to reject the null hypothesis and conclude that the data can be assumed to be normally distributed.

In [39]:
test_stat, pvalue = shapiro(df.loc[df["VARIANT_NAME"] == "variant", "REVENUE"])
print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

test_stat, pvalue = shapiro(df.loc[df["VARIANT_NAME"] == "control", "REVENUE"])
print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))


Test Stat = 0.0877, p-value = 0.0000
Test Stat = 0.0992, p-value = 0.0000


- Based on the results of the Shapiro-Wilk test, which produced a p-value of 0.000 for both the control and variant groups, we reject the null hypothesis of normality. Therefore, the revenue data for both groups does not follow a normal distribution.
- As the normality assumption is violated, we will opt for a **non-parametric** test, specifically the **Mann-Whitney U** test, to compare the means of the control and variant groups. 
- The Mann-Whitney U test is suitable for data that does not meet the normality assumption and is used to assess if there is a statistically significant difference between two independent groups.

In [42]:
test_stat, pvalue = mannwhitneyu(df.loc[df["VARIANT_NAME"] == "variant", "REVENUE"],
                                 df.loc[df["VARIANT_NAME"] == "control", "REVENUE"])

print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

Test Stat = 7850692.0000, p-value = 0.5129


- The Mann-Whitney U test resulted in a p-value of 0.5129. Based on this p-value, we fail to reject the null hypothesis, indicating that there is no statistically significant difference in revenue between the two variants.
- Regarding which variant generated higher revenue during the A/B test, the statistical test does not provide evidence of a significant distinction. Both the control and variant groups seem to perform similarly in terms of revenue.