# Product Ranking Optimization | A/B Testing Project

## Problem description
Suppose that an online grocery store called “Rimi” wants to test a new ranking algorithm to provide products more relevant to customers.

![user_funnel.drawio.png](images/rimi.png)

## Methodology

1. **Problem statement** - What is the goal of the experiment?
    - Understanding the nature of the product
    - Asking clarifying questions:
        - What is the user journey?
        - What is the success metric? It should be:
            - Measurable
            - Attributable
            - Sensitive
            - Timely
2. **Hypothesis testing** - What result do you hypothesize from the experiment?
    - Set up: 
        - Null hypothesis 
        - Alternative hypothesis 
        - Significance level
        - Statistical power
        - Minimum detectable effect (MDE)
3. **Design the Experiment** - What are your experiment parameters?
    - Determine:
        - Randomization unit
        - Target population in the experiment
        - Sample size
        - Duration of the experiment
4. **Run the Experiment** - What are the requirements for running an experiment?
    - Set up the necessary instrumentation to:
        - Collect data 
        - Analyze the results
    - Avoid peeking p-values
5. **Validity Checks** - Did the experiment run soundly without errors or bias?
    - Check for:
        - Instrumentation Effect
        - External Factors
        - Selection Bias
        - Sample Ratio Mismatch
        - Novelty Effect
6. **Interpret Results** - Is the observed change in the metric both statistically and practically significant?
    - Assess the observed lift:
        - P-value
        - Confidence intervals
7. **Launch Decision** - Based on the results and trade-offs, should the change be launched?
    - Consider:
        - Metric Trade-Offs
        - Cost of Launching
        - Risk of committing false positive (Type 1 Error)

## Step 1 - Problem Statement

### Understanding the Nature of the Product

Rimi is an online grocery store that offers a wide range of products, including fresh produce, meat, dairy, baked goods, and more. The store uses a product ranking system or recommendation algorithm.

When a user enters keywords such as "meat" or "fruits," this algorithm generates a list of products that could be relevant to that customer, based on factors like their profile, purchase history, and other data.

If we modify this ranking algorithm, the suggested products may become more relevant to customers, which in turn should **boost sales** for the online store.


### User Journey 

![user_funnel.drawio.png](images/user_funnel.drawio.png)

Considering the user journey is crucial because it helps determine key factors later on, such as defining the success metric, identifying the target user population, and deciding at which stage of the journey a user should be considered as a participant in the experiment.

### Define the Success Metric

To define the success metric, we need to consider the folowing guiding princeples:
1. **Measurable**
    - Is it a type of user behavior that can be accurately captured through your instrumentation or platform?
2. **Attributable**
    - "Attributable" means establishing a clear link between the experiment and the observed changes in metrics.
    - Example: If you are testing a new website design (treatment) and notice an increase in conversions (metric), for the result to be considered "attributable," you need to be sure that the increase is specifically due to the design change, and not, for example, due to an increase in traffic or a marketing campaign that occurred during the same period.
3. **Sensitive**
    - A metric is considered "sensitive" if it is responsive enough to detect significant effects from the applied modification.
    - You want to identify a metric with low variability to increase the likelihood of detecting true effects.
4. **Timely**
    - A/B experiments need to be very quick, it's a very iterative process as a way to improve the product very quickly.
    - Therefore, consider what short-term behavior can serve as a proxy for the long-term desired behavior.


Our success metric is **Conversion Rate**, which we aim to increase. However, it's crucial that this improvement does not come at the expense of the **Average Revenue Per User (ARPU)**, which should remain stable or improve.


## Step 2 - Hypothesis testing


### State the Hypothesis Statement

**Null Hypothesis (H0)**: The сonversion rate between the old and new ranking algorithms is the same.

**Alternative Hypothesis (Ha)**: The conversion rate between the old and new ranking algorithms is different.



### Set the Significance Level

**Alpha** = 0.05 <br> 
- If the p-value is less than 0.05, reject H0 and conclude that Ha is true.



### Set the Statistical Power

**Statistical Power** = 0.95 <br> 
- Statistical power is the probability of detecting an effect if the alternative hypothesis is true, usually equal to 0.8.



### Set the Minimum Detectable Effect (MDE)

**MDE** = 0.3% <br> 
- If the change in conversion rate is at least 0.3% or higher, it is considered practically significant.

## Step 3 - Design the Experiment

### Set the Randomization Unit

**Randomization Unit** = User <br>
- This unit determines how participants are randomly assigned to groups (control and test) for the experiment. The individual user is the most common randomization unit, especially in digital A/B tests.


### Target Population in the Experiment

**Users** = Visitors who searches a product

- ![user_funnel.drawio.png](images/user_funnel.drawio.png)


### Determine the Sample Size

We can use this formula to estimate the sample size:

$$n = \frac{2(Z_{\alpha/2} + Z_\beta)^2 \cdot p(1-p)}{\delta^2}$$

Where:
- $n$ — This is the required sample size for each group (control and experimental).
- $Z_{\alpha/2}$ — This is the critical value of the normal distribution for the significance level ($\alpha$). It is set as $\alpha/2$ because we often use a two-tailed test. For example, for a significance level of 0.05, the value of $Z_{\alpha/2}$ is approximately 1.96.
- $Z_\beta$ — This is the critical value for the test power ($\beta$). For example, for a power of 0.8, the value of $Z_\beta$  is approximately 0.84.
- $p$ — This is the current base conversion rate (e.g. 4%).
- $\delta$ —  This is the minimum detectable effect (MDE). It is the difference between the means of the control and experimental groups that you want to detect. The smaller $\delta$, the larger the sample size needed to accurately detect this difference.
<br>
<br>

#### Assumptions
Since we don’t have real data, we’ll estimate what it could look like based on industry averages.

##### Estimating Conversion Rate
1. The conversion rate for online grocery stores is the percentage of users who complete a purchase out of the total number of website visitors.
2. Typical industry data:
    - Based on ChatGPT’s response, on average, the conversion rate for online grocery stores can range from 2% to 5%. However, grocery stores have a certain specificity — if a customer visits with the intent to buy groceries, the conversion rate might be higher compared to apparel or electronics stores.
    - For large retailers like Rimi, the conversion rate may be closer to the upper end of this range.
3. Assumption:
    - **Conversion rate** = 4% (which corresponds to the conversion rate for a typical online grocery store).
    - **Standard Deviation** = 0.1%

##### Estimating ARPU
1. The average revenue per user (ARPU) in online grocery stores can vary significantly depending on how often customers place orders, their average basket size, and other factors.
2. Typical industry data:
    - ChatGPT suggests that ARPU for online grocery retailers often ranges from 20 to 100 euros, depending on the region and shopping frequency. The standard deviation, on average, can range from 20% to 50% of the average ARPU.
3. Assumption:
    - **Average ARPU** = 50 euros
    - **Standard Deviation** = 15 euros (which corresponds to 30% of the average).

#### Calculations

We can easily calculate this using Python and the `statsmodels` library.

In [63]:
from statsmodels.stats.power import NormalIndPower

# Define parameters
alpha = 0.05  # Significance level
power = 0.95   # Test power
baseline_conversion = 0.04  # Current conversion rate (4%)
mde = 0.003  # Minimum detectable effect (e.g., 0.3%)
effect_size = mde / baseline_conversion  # Effect size

sample_size = NormalIndPower().solve_power(effect_size=effect_size,
                                           alpha=alpha, 
                                           power=power, 
                                           alternative='two-sided')

# Round to the nearest integer
sample_size = int(sample_size)

sample_size

4620

### Duration of the Experiment

**Duration** = 1 to 2 weeks


## Step 4 - Run the Experiment

### Dataset Description

Since we don't have real data, we've made a **synthetic dataset** that simulates user behavior based on realistic assumptions and probability distributions.

You can find the script that generated this dataset in the `data_generation.ipynb` file.

#### Key Columns
- **user_id**: A unique identifier for each user.
- **group**: Either 'control' or 'experiment', indicating whether the user belongs to the control group or the experiment group.
- **session_date**: The date and time of the user's session.
- **product_views**: The number of products viewed by the user during the session.
- **cart_adds**: The number of items added to the cart.
- **purchase_amount**: The total amount spent by the user in the session (if any purchase was made).
- **session_duration**: The duration of the session in minutes.
- **device_type**: The type of device used by the user (mobile, desktop, or tablet).
- **traffic_source**: The source of traffic that brought the user to the site (organic, paid ad, or direct).
- **region**: The region where the user is located (Estonia, Latvia, Lithuania).
- **visitor_type**: Whether the user is a "new" or "old" visitor (new or returning customer).

In [66]:
import pandas as pd
from scipy.stats import ttest_ind, chi2_contingency

df = pd.read_csv('rimi_ab_test.csv')

# Создаем колонку 'conversion' на основе наличия покупки
df['conversion'] = df['purchase_amount'].apply(lambda x: 1 if x > 0 else 0)

control_group = df[df['group'] == 'control']
experiment_group = df[df['group'] == 'experiment']

Conversion rate

In [67]:
# Chi-Square тест для конверсий
conversion_table = pd.crosstab(df['group'], df['conversion'])
chi2_stat, p_value, dof, expected = chi2_contingency(conversion_table)
print(f"Chi-Square test: \nstatistic = {chi2_stat}, \np-value = {p_value}")

control_conversion_rate = control_group['conversion'].mean()
experiment_conversion_rate = experiment_group['conversion'].mean()

print(f"\nControl group conversion rate: {control_conversion_rate * 100:.2f}%")
print(f"Experiment group conversion rate: {experiment_conversion_rate * 100:.2f}%")

Chi-Square test: 
statistic = 5.314168250574651, 
p-value = 0.021152689102897446

Control group conversion rate: 4.00%
Experiment group conversion rate: 5.02%


ARPU 

A t-test can be used to compare ARPU if the data follows a normal distribution.