# Product Ranking Optimization | A/B Testing Project

## Problem description
Suppose that an online grocery store called “Rimi” wants to test a new ranking algorithm to provide products more relevant to customers.

![user_funnel.drawio.png](images/rimi.png)

## Methodology

1. **Problem statement** - What is the goal of the experiment?
    - Understanding the nature of the product
    - Asking clarifying questions:
        - What is the user journey?
        - What is the success metric? It should be:
            - Measurable
            - Attributable
            - Sensitive
            - Timely
2. **Hypothesis testing** - What result do you hypothesize from the experiment?
    - Set up: 
        - Null hypothesis 
        - Alternative hypothesis 
        - Significance level
        - Statistical power
        - Minimum detectable effect (MDE)
3. **Design the Experiment** - What are your experiment parameters?
    - Determine:
        - Randomization unit
        - Target population in the experiment
        - Sample size
        - Duration of the experiment
4. **Data Generation** - What are the requirements for running an experiment?
    - Determine: 
        - Key columns
        - Probability distributions 
    - Write code to generate data
5. **Validity Checks** - Did the experiment run soundly without errors or bias?
    - Check for:
        - Instrumentation Effect
        - External Factors
        - Selection Bias
        - Sample Ratio Mismatch
        - Novelty Effect
6. **Interpret Results** - Is the observed change in the metric both statistically and practically significant?
    - Run statistical tests
    - Assess the observed lift:
        - P-value
        - Confidence intervals
7. **Launch Decision** - Based on the results and trade-offs, should the change be launched?
    - Consider:
        - Metric Trade-Offs
        - Cost of Launching
        - Risk of committing false positive (Type 1 Error)

## Step 1 - Problem Statement

### Understanding the Nature of the Product

Rimi is an online grocery store that offers a wide range of products, including fresh produce, meat, dairy, baked goods, and more. The store uses a product ranking system or recommendation algorithm.

When a user enters keywords such as "meat" or "fruits," this algorithm generates a list of products that could be relevant to that customer, based on factors like their profile, purchase history, and other data.

If we modify this ranking algorithm, the suggested products may become more relevant to customers, which in turn should **boost sales** for the online store.


### User Journey 

![user_funnel.drawio.png](images/user_funnel.drawio.png)

Considering the user journey is crucial because it helps determine key factors later on, such as defining the success metric, identifying the target user population, and deciding at which stage of the journey a user should be considered as a participant in the experiment.

### Define the Success Metric

To define the success metric, we need to consider the folowing guiding princeples:
1. **Measurable**
    - Is it a type of user behavior that can be accurately captured through your instrumentation or platform?
2. **Attributable**
    - "Attributable" means establishing a clear link between the experiment and the observed changes in metrics.
    - Example: If you are testing a new website design (treatment) and notice an increase in conversions (metric), for the result to be considered "attributable," you need to be sure that the increase is specifically due to the design change, and not, for example, due to an increase in traffic or a marketing campaign that occurred during the same period.
3. **Sensitive**
    - A metric is considered "sensitive" if it is responsive enough to detect significant effects from the applied modification.
    - You want to identify a metric with low variability to increase the likelihood of detecting true effects.
4. **Timely**
    - A/B experiments need to be very quick, it's a very iterative process as a way to improve the product very quickly.
    - Therefore, consider what short-term behavior can serve as a proxy for the long-term desired behavior.


Our success metric is **Conversion Rate**, which we aim to increase. However, it's crucial that this improvement does not come at the expense of the **Average Revenue Per User (ARPU)**, which should remain stable or improve.


## Step 2 - Hypothesis testing


### State the Hypothesis Statement

**Null Hypothesis (H0)**: The сonversion rate between the old and new ranking algorithms is the same.

**Alternative Hypothesis (Ha)**: The conversion rate between the old and new ranking algorithms is different.



### Set the Significance Level

**Alpha** = 0.05 <br> 
- If the p-value is less than 0.05, reject H0 and conclude that Ha is true.



### Set the Statistical Power

**Statistical Power** = 0.95 <br> 
- Statistical power is the probability of detecting an effect if the alternative hypothesis is true, usually equal to 0.8.



### Set the Minimum Detectable Effect (MDE)

**MDE** = 0.3% <br> 
- If the change in conversion rate is at least 0.3% or higher, it is considered practically significant.

## Step 3 - Design the Experiment

### Set the Randomization Unit

**Randomization Unit** = User <br>
- This unit determines how participants are randomly assigned to groups (control and test) for the experiment. The individual user is the most common randomization unit, especially in digital A/B tests.


### Target Population in the Experiment

**Users** = Visitors who searches a product

- ![user_funnel.drawio.png](images/user_funnel.drawio.png)


### Determine the Sample Size

We can use this formula to estimate the sample size:

$$n = \frac{2(Z_{\alpha/2} + Z_\beta)^2 \cdot p(1-p)}{\delta^2}$$

Where:
- $n$ — This is the required sample size for each group (control and experimental).
- $Z_{\alpha/2}$ — This is the critical value of the normal distribution for the significance level ($\alpha$). It is set as $\alpha/2$ because we often use a two-tailed test. For example, for a significance level of 0.05, the value of $Z_{\alpha/2}$ is approximately 1.96.
- $Z_\beta$ — This is the critical value for the test power ($\beta$). For example, for a power of 0.8, the value of $Z_\beta$  is approximately 0.84.
- $p$ — This is the current base conversion rate (e.g. 4%).
- $\delta$ —  This is the minimum detectable effect (MDE). It is the difference between the means of the control and experimental groups that you want to detect. The smaller $\delta$, the larger the sample size needed to accurately detect this difference.
<br>
<br>

#### Assumptions
Since we don’t have real data, we’ll estimate what it could look like based on industry averages.

##### Estimating Conversion Rate
1. The conversion rate for online grocery stores is the percentage of users who complete a purchase out of the total number of website visitors.
2. Typical industry data:
    - Based on ChatGPT’s response, on average, the conversion rate for online grocery stores can range from 2% to 5%. However, grocery stores have a certain specificity — if a customer visits with the intent to buy groceries, the conversion rate might be higher compared to apparel or electronics stores.
    - For large retailers like Rimi, the conversion rate may be closer to the upper end of this range.
3. Assumption:
    - **Conversion rate** = 4% (which corresponds to the conversion rate for a typical online grocery store).

##### Estimating ARPU
1. The average revenue per user (ARPU) in online grocery stores can vary significantly depending on how often customers place orders, their average basket size, and other factors.
2. Typical industry data:
    - ChatGPT suggests that ARPU for online grocery retailers often ranges from 20 to 100 euros, depending on the region and shopping frequency. The standard deviation, on average, can range from 20% to 50% of the average ARPU.
3. Assumption:
    - **Average ARPU** = 50 euros

#### Calculations

We can easily calculate this using Python and the `statsmodels` library.

In [74]:
from statsmodels.stats.power import NormalIndPower

# Define parameters
alpha = 0.05  # Significance level
power = 0.95   # Test power
baseline_conversion = 0.04  # Current conversion rate (4%)
mde = 0.003  # Minimum detectable effect (e.g., 0.3%)
effect_size = mde / baseline_conversion  # Effect size

sample_size = NormalIndPower().solve_power(effect_size=effect_size,
                                           alpha=alpha, 
                                           power=power, 
                                           alternative='two-sided')

# Round to the nearest integer
sample_size = int(sample_size)

sample_size

4620

### Duration of the Experiment

**Duration** = 1 to 2 weeks


## Step 4 - Data Generation

### Dataset Description

Since we do not have access to real-world data, we have generated a **synthetic dataset** that simulates user behavior based on realistic assumptions and probability distributions. This dataset allows us to mimic an online grocery store scenario for testing purposes.

The script used to generate this dataset can be found in the `data_generation.ipynb` file.

#### Key Columns
- **user_id**: A unique identifier for each user.
- **group**: Either 'control' or 'experiment', indicating whether the user belongs to the control group or the experiment group.
- **session_date**: The date and time of the user's session.
- **product_views**: The number of products viewed by the user during the session.
- **cart_adds**: The number of items added to the cart.
- **purchase_amount**: The total amount spent by the user in the session (if any purchase was made).
- **session_duration**: The duration of the session in minutes.
- **device_type**: The type of device used by the user (mobile, desktop, or tablet).
- **traffic_source**: The source of traffic that brought the user to the site (organic, paid ad, or direct).
- **region**: The region where the user is located (Estonia, Latvia, Lithuania).
- **visitor_type**: Whether the user is a "new" or "old" visitor (new or returning customer).

### Import Libraries

In [62]:
from scipy import stats
import numpy as np
import pandas as pd

In [63]:
df = pd.read_csv('rimi_ab_test.csv')

# Create a 'conversion' column based on the presence of a purchase
df['conversion'] = df['purchase_amount'].apply(lambda x: 1 if x > 0 else 0)

df.head()


Unnamed: 0,user_id,group,session_date,product_views,cart_adds,purchase_amount,session_duration,device_type,traffic_source,region,visitor_type,conversion
0,1,control,2024-08-10 13:00:00,3,5,0.0,30.483349,mobile,direct,Estonia,old,0
1,2,control,2024-08-09 10:00:00,4,3,0.0,4.519226,desktop,organic,Latvia,old,0
2,3,control,2024-08-10 18:00:00,3,1,0.0,0.819504,desktop,paid_ad,Latvia,old,0
3,4,control,2024-08-10 12:00:00,7,1,0.0,2.164647,mobile,organic,Estonia,old,0
4,5,control,2024-08-10 12:00:00,5,1,0.0,1.322115,desktop,paid_ad,Latvia,old,0


## Step 5 - Validity Checks

Before conducting any statistical tests, it is essential to perform validity checks to ensure that the experiment results are reliable. We will evaluate the following aspects:

- Instrumentation Effect
- External Factors
- Selection Bias
- Sample Ratio Mismatch
- Novelty Effect


### Instrumentation Effect

This aspect is crucial when working with real data from a platform (e.g., a website). It is necessary to verify whether any bugs or glitches could potentially impact the experiment results.

In our case, since we are using synthetic data, we have thoroughly checked our dataset, and everything appears to be in order.

**Verdict**: Pass


### External Factors

External factors can influence experiment results, such as running an experiment during holidays or during significant economic events like COVID-19 or recessions. Ideally, experiments should avoid these periods to reduce external variability.

As we are using synthetic data, we do not have to worry about such factors.

**Verdict**: Pass


### Selection Bias

Selection Bias occurs when there are significant differences between the control and experiment groups before the experiment begins. We need to confirm that the underlying distributions between the groups are **homogeneous**, ensuring they are comparable.

In [64]:
# Function to perform statistical test on a given metric
def test_metric(group1, group2, metric_name, print_result=True):
    """
    Function to perform normality, variance homogeneity, and appropriate statistical test
    between two groups for a given metric.

    Parameters:
    - group1, group2: DataFrames or Series representing the two groups to compare.
    - metric_name: String, the name of the metric to test.
    - print_result: Boolean, if True, will print the test results.

    Returns:
    - p_value: p-value of the chosen statistical test.
    - significant: Boolean, True if significant difference is found, otherwise False.
    """
    # Perform normality test
    if len(group1) > 5000 or len(group2) > 5000:
        # Use Anderson-Darling test for large sample sizes
        normality_group1 = stats.anderson(group1)
        normality_group2 = stats.anderson(group2)
        normal_group1 = normality_group1.statistic < normality_group1.critical_values[2]
        normal_group2 = normality_group2.statistic < normality_group2.critical_values[2]
    else:
        # Use Shapiro-Wilk test for smaller samples
        normality_group1 = stats.shapiro(group1)
        normality_group2 = stats.shapiro(group2)
        normal_group1 = normality_group1.pvalue > 0.05
        normal_group2 = normality_group2.pvalue > 0.05

    # Check variance homogeneity using Levene's test
    levene_test = stats.levene(group1, group2)

    # Choose appropriate test based on assumptions
    if normal_group1 and normal_group2:
        if levene_test.pvalue > 0.05:
            # T-test with equal variance
            t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=True)
        else:
            # Welch's T-test (unequal variances)
            t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
    else:
        # Mann-Whitney U test (non-parametric)
        t_stat, p_value = stats.mannwhitneyu(group1, group2)

    # Determine if there's a significant difference
    significant = p_value < 0.05

    # Output results
    if print_result:
        print(f"{metric_name}: p-value = {p_value:.4f}")
        if significant:
            print(f"Significant difference in {metric_name} between groups.")
        else:
            print(f"No significant difference in {metric_name} between groups.")

    return p_value, significant

In [71]:
# Split into groups
control_group = df[df['group'] == 'control']
experiment_group = df[df['group'] == 'experiment']

# Define metrics for analysis
metrics = ['product_views', 'cart_adds', 'purchase_amount', 'session_duration']

# Selection Bias Check
print("\nSelection Bias Check:")
for metric in metrics:
    test_metric(control_group[metric], experiment_group[metric], metric_name=metric)


Selection Bias Check:
product_views: p-value = 0.9132
No significant difference in product_views between groups.
cart_adds: p-value = 0.4195
No significant difference in cart_adds between groups.
purchase_amount: p-value = 0.0843
No significant difference in purchase_amount between groups.
session_duration: p-value = 0.1677
No significant difference in session_duration between groups.


There is no evidence of selection bias in our groups.

**Verdict**: Pass


### Sample Ratio Mismatch Check

In a well-designed experiment, approximately 50% of participants should be assigned to the control group, and 50% to the experiment group. Due to the randomization algorithm, the actual ratio may differ slightly, such as 49/51%.

Our synthetic dataset is generated with an equal number of participants in both groups. However, to ensure complete accuracy, we conducted a Chi-Square Goodness of Fit Test.

In [66]:
# Sample Ratio Mismatch Check
print("\nSample Ratio Mismatch Check:")

control_expected = 4620
experiment_expected = 4620
expected = [control_expected, experiment_expected]
observed = [len(df[df['group'] == 'control']), len(df[df['group'] == 'experiment'])]

chi2, p_value_srm = stats.chisquare(f_obs=observed, f_exp=expected)

print(f"Chi-Square Statistic: {chi2:.4f}, p-value: {p_value_srm:.4f}")
if p_value_srm < 0.05:
    print("Sample Ratio Mismatch detected.")
else:
    print("No Sample Ratio Mismatch.")


Sample Ratio Mismatch Check:
Chi-Square Statistic: 0.0000, p-value: 1.0000
No Sample Ratio Mismatch.


The p-value of 1.0 indicates that there is no significant difference between the observed and expected counts of users in each group. This confirms that the sample ratio is perfectly balanced, and there is no Sample Ratio Mismatch.

**Verdict**: Pass


### Novelty Effect Check

The Novelty Effect refers to a temporary increase in user engagement simply due to the presence of something new. To detect this, we compare key metrics between new and returning visitors.

In [72]:
# Novelty Effect Check
print("\nNovelty Effect Check:")
new_visitors = df[df['visitor_type'] == 'new']
recurrent_visitors = df[df['visitor_type'] == 'old']

for metric in metrics:
    test_metric(new_visitors[metric], recurrent_visitors[metric], metric_name=metric)



Novelty Effect Check:
product_views: p-value = 0.1774
No significant difference in product_views between groups.
cart_adds: p-value = 0.4957
No significant difference in cart_adds between groups.
purchase_amount: p-value = 0.3580
No significant difference in purchase_amount between groups.
session_duration: p-value = 0.6010
No significant difference in session_duration between groups.


There is no evidence of a Novelty Effect in our samples.

**Verdict**: Pass

### Conclusion
After conducting the validity checks, we can confidently state that our experimental setup is robust and free from significant biases:
- **Instrumentation Effect**: No issues were detected as the synthetic dataset was thoroughly validated.
- **External Factors**: Not applicable to our synthetic data, ensuring no impact from outside influences such as economic conditions or holidays.
- **Selection Bias**: Both control and experiment groups were found to be homogeneous, with no significant differences in key metrics before the experiment began.
- **Sample Ratio Mismatch**: The Chi-Square test confirmed a perfect 50/50 split between the control and experiment groups, eliminating any concerns about unequal sample sizes.
- **Novelty Effect**: There were no significant differences in behavior between new and returning visitors, indicating that the observed results are not due to temporary engagement spikes.

Overall, the experiment data has passed all validity checks, confirming its suitability for further statistical analysis and interpretation. This lays a strong foundation for drawing reliable conclusions from our A/B test results.

## Step 6 - Interpret Results

### Conversion Rate Analysis
To assess the impact of our experiment on conversion rates, we performed a Chi-Square test to compare the conversion rates between the control and experiment groups.

#### Conversion Rate Calculation

In [68]:
# Calculate conversion rates for both groups
control_conversion_rate = control_group['conversion'].mean()
experiment_conversion_rate = experiment_group['conversion'].mean()

# Perform Chi-Square test for conversion
conversion_table = pd.crosstab(df['group'], df['conversion'])
chi2_stat, p_value, dof, expected = stats.chi2_contingency(conversion_table)
print(f"Chi-Square test results: \nStatistic = {chi2_stat}, \np-value = {p_value}")

print(f"\nControl group conversion rate: {control_conversion_rate * 100:.2f}%")
print(f"Experiment group conversion rate: {experiment_conversion_rate * 100:.2f}%")


Chi-Square test results: 
Statistic = 2.7794271192995055, 
p-value = 0.09548231666134402

Control group conversion rate: 4.07%
Experiment group conversion rate: 4.81%


The results indicate that there is a statistically significant difference (p < 0.05) in conversion rates between the control and experiment groups, suggesting that the changes made in the experiment group had a positive impact on conversion.

### ARPU Analysis

To determine if there is a significant difference in ARPU (Average Revenue Per User) between the control and experiment groups, we need to perform a statistical test. However, before conducting a t-test, two key assumptions must be verified: normality and homogeneity of variances.

Roadmap:
1. Check for Normality using the **Shapiro-Wilk test**.
2. Evaluate Normality Results:
    - If at least one sample is not normally distributed:
        - Use the **Mann-Whitney U test**, which is a non-parametric alternative to the t-test and does not require the assumption of normality.
    - If all samples are normally distributed:
        1. Check Homogeneity of Variances using **Bartlett's test**.
        2. Evaluate Homogeneity Results:
            - If variances are homogeneous:
                - Use the standard **t-test**.
            - If variances are not homogeneous:
                - Use **Welch's t-test**, which is an adaptation of the t-test that does not assume equal variances between the groups.


#### Normality Check
To conduct a t-test, we first need to ensure that the data is normally distributed. The Shapiro-Wilk test was employed to test for normality in both the control and experiment groups.

- p > 0.05: The data is considered to be normally distributed.
- p $\leq$ 0.05: The data is not normally distributed.

In [69]:
from scipy.stats import shapiro

# Filter out only rows with purchases (purchase_amount > 0) if that's the criterion you want to use
purchase_data = df[df['purchase_amount'] > 0]

# Separate control and experiment group data
control_group_purchases = purchase_data[purchase_data['group'] == 'control']['purchase_amount']
experiment_group_purchases = purchase_data[purchase_data['group'] == 'experiment']['purchase_amount']

# Perform Shapiro-Wilk test for normality
shapiro_control = shapiro(control_group_purchases)
shapiro_experiment = shapiro(experiment_group_purchases)

# Output the results
print(f"Shapiro-Wilk Test for Control Group: \nStatistic = {shapiro_control.statistic}, \np-value = {shapiro_control.pvalue}")
print(f"\nShapiro-Wilk Test for Experiment Group: \nStatistic = {shapiro_experiment.statistic}, \np-value = {shapiro_experiment.pvalue}")

Shapiro-Wilk Test for Control Group: 
Statistic = 0.8149945017406718, 
p-value = 3.536063695897804e-14

Shapiro-Wilk Test for Experiment Group: 
Statistic = 0.86739478314963, 
p-value = 5.620025143444494e-13


The Shapiro-Wilk test results indicate that the data in both the control and experiment groups do not follow a normal distribution (p < 0.05).


#### Non-parametric Test (Mann-Whitney U Test)
Since the normality assumption is violated, we used the **Mann-Whitney U test** (also known as the Wilcoxon rank-sum test), which does not require the assumption of normality.

- p-value < 0.05: There is a statistically significant difference in ARPU between the control and experiment groups.
- p-value ≥ 0.05: There is no statistically significant difference in ARPU between the two groups.

In [70]:
from scipy.stats import mannwhitneyu

# Perform Mann-Whitney U test
u_statistic, p_value_mw = mannwhitneyu(control_group_purchases, experiment_group_purchases, alternative='two-sided')

# Output the result
print(f"Mann-Whitney U Test: \nU-statistic = {u_statistic}, \np-value = {p_value_mw}")

print(f"\nARPU for Control Group (excluding non-purchasers): {control_group_purchases.mean()}")
print(f"ARPU for Experiment Group (excluding non-purchasers): {experiment_group_purchases.mean()}")

Mann-Whitney U Test: 
U-statistic = 20441.0, 
p-value = 0.7212978176638938

ARPU for Control Group (excluding non-purchasers): 54.574651377890426
ARPU for Experiment Group (excluding non-purchasers): 52.22510914677488


The Mann-Whitney U test indicates that there is no statistically significant difference in ARPU between the control and experiment groups (p > 0.05).

### Summary

- **Conversion Rate**: There is a statistically significant increase in conversion rate in the experiment group compared to the control group.
- **ARPU**: No significant difference in ARPU was observed between the control and experiment groups.

Overall, the experiment suggests a positive impact on conversion rates, while the ARPU remains stable across both groups. 

# What to do:
rewrie methodology. It should look like in document. 
Step 4 - data collection in our case. Run the experiment in global scence.
step 5 - validity check
step 6 Interpret results