# **A/B Testing**

- Evaluating the performance of a Deep Neural Network (DNN) model using **A/B testing** is a powerful technique, particularly when deploying a model into a live production environment.

- Unlike offline evaluation metrics (like accuracy, precision, recall, AUC, or loss) calculated on a static test dataset, A/B testing measures the model's performance based on its *actual impact on user behavior and business metrics* in a real-world setting.

# **A/B Testing in DNN**

- A/B testing for DNNs involves **comparing two versions of a system or feature** where one version (Version B) incorporates the new or updated DNN model, while the other version (Version A - the control group) uses the existing model, a simpler baseline, or no model at all.

- Real users or traffic are randomly split into two (or more) groups, with each group exposed to one of the versions.

- Key metrics are then measured and compared between the groups to determine if the new DNN model (Version B) has a statistically significant different impact than the control (Version A).

1.  **Real-World Performance:** Offline metrics on a fixed dataset might not fully reflect how the model will perform on live, dynamic data or how its predictions will interact with users.
2.  **Measures Business Impact:** A/B testing directly links the model's performance to key business objectives and user engagement metrics (e.g., click-through rate, conversion rate, time spent on a page, revenue).
3.  **Captures User Behavior:** Deploying a new model can subtly change user behavior in ways not predicted by offline evaluation. A/B tests capture these effects.
4.  **Identifies Unforeseen Issues:** Performance bottlenecks, latency issues, or negative user experiences caused by the new model are revealed in a controlled live environment before a full rollout.

# **How to Conduct an A/B Test for DNN Performance:**

1.  **Define Your Goal and Key Metric:** Clearly state what you want to achieve and how you will measure it. This **metric should be a business or user-facing outcome influenced by the model's predictions** (e.g., "Increase click-through rate on recommended items," "Reduce the number of fraudulent transactions detected per user," "Increase the average session duration").

2.  **Define Your Variations (A and B):**
    * **Version A (Control):** This is the **baseline**. It could be the current model in production, a heuristic, or a random approach.

    * **Version B (Treatment):** This is the system incorporating the new or updated DNN model. The DNN's predictions are used to influence the user experience or system outcome.

3.  **Split Your Traffic Randomly:** Divide incoming users or requests randomly into at least two groups (A and B). The split percentage (e.g., 50/50, 90/10, depending on traffic volume and risk) is crucial. **Randomization** ensures that the two groups are as similar as possible, minimizing confounding factors.

4.  **Implement the Variations in Production:** Set up your system to route **Group A's traffic to the control logic** and **Group B's traffic to the logic using the new DNN model.**

5.  **Collect Data:** Lo**g the interactions, outcomes, and the defined key metric** for users in both Group A and Group B over a sufficient period. This period needs to be long enough to gather statistically significant data, accounting for daily/weekly cycles or other temporal variations.

6.  **Analyze the Results:** Compare the key metric between Group A and Group B. **Perform statistical analysis (e.g., t-tests, z-tests, chi-squared tests)** to determine if the observed difference in the metric is statistically significant at a chosen confidence level (e.g., 95%). This tells you if the difference is likely due to the model change or just random chance.

7.  **Make a Decision:** Based on the statistical significance and the magnitude of the change in the key metric, decide whether to:
    * Roll out Version B to all users (if it's significantly better).
    * Stick with Version A (if B is worse or not significantly better).
    * Iterate on the DNN model and run another A/B test.

# **Key Metrics to Monitor (Examples):**


The specific metrics depend on the application of the DNN:

* **Recommender Systems:** Click-Through Rate (CTR), Conversion Rate, Session Duration, Number of Items Purchased, User Retention.

* **Ad Targeting:** CTR, Conversion Rate, Cost Per Acquisition (CPA), Return on Ad Spend (ROAS).

* **Search Ranking:** Click-Through Rate of top results, Dwell Time, Number of Queries.

* **Fraud Detection:** Number of False Positives (impact on legitimate users), True Positive Rate (actual fraud caught) in a live setting.

# **Advantages of Using A/B Testing:**


* Provides the most reliable measure of real-world performance and business impact.
* Reduces the risk of negative consequences from deploying a poorly performing model to the entire user base.
* Offers strong evidence for the value of the new model.

# **Disadvantages and Considerations:**




* Requires a production environment and the infrastructure to split traffic and collect data.

* Can be time-consuming, as you need to wait for enough data to achieve statistical significance.

* Requires careful planning and execution to ensure valid results (proper randomization, choosing the right metrics, determining sufficient sample size and duration).

* If the new model performs significantly worse, it can negatively impact the user experience and business metrics for the duration of the test in the treatment group.

In conclusion, while offline metrics are essential for model development and initial evaluation, A/B testing is the gold standard for evaluating the true impact and performance of a DNN model in a live production setting. It provides the confidence needed to make informed decisions about deploying new machine learning models.

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout, BatchNormalization
from tensorflow.keras.utils import to_categorical
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [3]:

# Normalize the pixel values from 0-255 to 0-1
x_train = x_train.reshape(-1, 28 * 28).astype('float32') / 255.0
x_test = x_test.reshape(-1, 28 * 28).astype('float32') / 255.0

In [4]:
# Convert labels to one-hot encoding
num_classes = 10
y_train_one_hot = to_categorical(y_train, num_classes)
y_test_one_hot = to_categorical(y_test, num_classes)

print(f"Training data shape: {x_train.shape}")
print(f"Training labels shape: {y_train_one_hot.shape}")
print(f"Testing data shape: {x_test.shape}")
print(f"Testing labels shape: {y_test_one_hot.shape}")

Training data shape: (60000, 784)
Training labels shape: (60000, 10)
Testing data shape: (10000, 784)
Testing labels shape: (10000, 10)


In [5]:

# --- Define and Train Model A (Our Baseline Candidate) ---
print("\n--- Training Model A (Baseline) ---")
model_a = Sequential([
    Flatten(input_shape=(28 * 28,)),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(num_classes, activation='softmax')
])


--- Training Model A (Baseline) ---


  super().__init__(**kwargs)


In [6]:
model_a.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])


In [7]:
history_a = model_a.fit(x_train, y_train_one_hot,
                        epochs=10, # Using a fixed number of epochs for this example
                        batch_size=32,
                        validation_data=(x_test, y_test_one_hot))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 5ms/step - accuracy: 0.8684 - loss: 0.4373 - val_accuracy: 0.9594 - val_loss: 0.1335
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 5ms/step - accuracy: 0.9667 - loss: 0.1082 - val_accuracy: 0.9726 - val_loss: 0.0875
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 5ms/step - accuracy: 0.9786 - loss: 0.0714 - val_accuracy: 0.9734 - val_loss: 0.0839
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 6ms/step - accuracy: 0.9837 - loss: 0.0514 - val_accuracy: 0.9772 - val_loss: 0.0772
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 5ms/step - accuracy: 0.9863 - loss: 0.0432 - val_accuracy: 0.9779 - val_loss: 0.0768
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 5ms/step - accuracy: 0.9893 - loss: 0.0343 - val_accuracy: 0.9756 - val_loss: 0.0857
Epoch 7/10


In [9]:
print("\n--- Evaluating Model A ---")
loss_a, accuracy_a = model_a.evaluate(x_test, y_test_one_hot, verbose=0)
print(f"Model A Test Loss: {loss_a:.4f}")
print(f"Model A Test Accuracy: {accuracy_a:.4f}")


--- Evaluating Model A ---
Model A Test Loss: 0.0829
Model A Test Accuracy: 0.9812


In [8]:
# --- Define and Train Model B (Our New Candidate with BN and Dropout) ---
print("\n--- Training Model B (New Candidate) ---")
model_b = Sequential([
    Flatten(input_shape=(28 * 28,)),
    Dense(128),
    BatchNormalization(), # Added Batch Normalization
    keras.layers.Activation('relu'),
    Dropout(0.3), # Added Dropout

    Dense(64),
    BatchNormalization(), # Added Batch Normalization
    keras.layers.Activation('relu'),
    Dropout(0.3), # Added Dropout

    Dense(num_classes, activation='softmax')
])


--- Training Model B (New Candidate) ---


In [10]:
# Using a different optimizer for Model B as another difference
model_b.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

In [11]:
history_b = model_b.fit(x_train, y_train_one_hot,
                        epochs=10, # Using the same number of epochs for comparison
                        batch_size=32,
                        validation_data=(x_test, y_test_one_hot))

Epoch 1/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 6ms/step - accuracy: 0.8113 - loss: 0.6302 - val_accuracy: 0.9554 - val_loss: 0.1423
Epoch 2/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 5ms/step - accuracy: 0.9163 - loss: 0.2813 - val_accuracy: 0.9634 - val_loss: 0.1260
Epoch 3/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 5ms/step - accuracy: 0.9303 - loss: 0.2352 - val_accuracy: 0.9696 - val_loss: 0.1087
Epoch 4/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 5ms/step - accuracy: 0.9387 - loss: 0.2098 - val_accuracy: 0.9723 - val_loss: 0.1041
Epoch 5/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 6ms/step - accuracy: 0.9441 - loss: 0.2039 - val_accuracy: 0.9738 - val_loss: 0.0973
Epoch 6/10
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 5ms/step - accuracy: 0.9462 - loss: 0.1904 - val_accuracy: 0.9746 - val_loss: 0.0939
Epoch 7/10

In [12]:
print("\n--- Evaluating Model B ---")
loss_b, accuracy_b = model_b.evaluate(x_test, y_test_one_hot, verbose=0)
print(f"Model B Test Loss: {loss_b:.4f}")
print(f"Model B Test Accuracy: {accuracy_b:.4f}")


--- Evaluating Model B ---
Model B Test Loss: 0.0928
Model B Test Accuracy: 0.9770


In [13]:
model_a.save('model_a.keras')

In [14]:
model_b.save('model_b.keras')

In [None]:







# --- How these models fit into A/B Testing ---

# After training model_a and model_b and evaluating them offline
# using metrics like accuracy, loss, etc., you would then
# consider deploying one or both of them.

# In a real-world A/B test scenario:
# 1. Deploy both `model_a` and `model_b` to your serving infrastructure.
# 2. In live application, randomly split incoming user traffic (e.g., 50% to Group A, 50% to Group B).
# 3. For users in Group A, predictions would be made using `model_a`.
# 4. For users in Group B, predictions would be made using `model_b`.
# 5. Measure real-world metrics (e.g., click-through rate, conversion rate, time on site)
#    associated with the predictions from each model for their respective user groups.
# 6. Collect data on these real-world metrics over a period of time.
# 7. Perform statistical analysis on the collected data to determine if
#    there's a statistically significant difference in the real-world metric between Group A and Group B.
# 8. Based on the A/B test results, Decide which model to roll out to all users.

# A/B testing logic itself exists in the deployment and experimentation platform.

# You could save these models for deployment:
# model_a.save('model_a.keras')
# model_b.save('model_b.keras')

# Sample code - To Validate if a new design for a recommendation widget (Version B) leads to a higher click-through rate (CTR) compared to the current design (Version A).



In [15]:
import numpy as np
import pandas as pd
from scipy import stats # For statistical test
import statsmodels.api as sm # More specific stats models

In [16]:
# --- Sample Details / Simulation Parameters ---
np.random.seed(42) # for reproducibility

In [17]:
n_users = 2000 # Total number of users in the A/B test simulation
split_ratio = 0.5 # 50/50 split between Group A and Group B

In [18]:
# Assume the true underlying click probabilities for each version (unknown in a real test)
# Set the probability for Momdel A and B to simulate a potential difference
true_prob_a = 0.10  # True click probability for Version A (Control)
true_prob_b = 0.12  # True click probability for Version B (Treatment - hoping for 12%)

In [19]:
alpha = 0.05 # Significance level (commonly 0.05)

In [20]:

# --- Simulate Data Collection ---

# 1. Assign users to groups (A or B)
n_group_a = int(n_users * split_ratio)
n_group_b = n_users - n_group_a

# Create group assignments
group_assignments = ['A'] * n_group_a + ['B'] * n_group_b
np.random.shuffle(group_assignments) # Randomly shuffle assignments

In [22]:
group_assignments.count('A')

1000

In [23]:
group_assignments.count('B')

1000

In [24]:
group_assignments

['B',
 'A',
 'B',
 'A',
 'B',
 'B',
 'A',
 'B',
 'A',
 'B',
 'A',
 'B',
 'B',
 'A',
 'A',
 'A',
 'A',
 'A',
 'B',
 'B',
 'A',
 'B',
 'B',
 'B',
 'B',
 'A',
 'B',
 'A',
 'A',
 'A',
 'A',
 'A',
 'A',
 'A',
 'B',
 'A',
 'B',
 'B',
 'B',
 'A',
 'A',
 'B',
 'A',
 'A',
 'B',
 'A',
 'A',
 'A',
 'A',
 'B',
 'B',
 'A',
 'B',
 'A',
 'A',
 'B',
 'B',
 'B',
 'B',
 'A',
 'B',
 'A',
 'B',
 'B',
 'B',
 'B',
 'A',
 'B',
 'A',
 'B',
 'B',
 'A',
 'B',
 'B',
 'A',
 'A',
 'B',
 'A',
 'A',
 'B',
 'A',
 'A',
 'B',
 'B',
 'B',
 'B',
 'B',
 'A',
 'A',
 'B',
 'B',
 'B',
 'A',
 'A',
 'B',
 'B',
 'A',
 'B',
 'B',
 'B',
 'A',
 'A',
 'A',
 'B',
 'A',
 'A',
 'A',
 'A',
 'A',
 'A',
 'A',
 'B',
 'B',
 'B',
 'A',
 'A',
 'B',
 'A',
 'A',
 'B',
 'B',
 'B',
 'B',
 'A',
 'A',
 'B',
 'A',
 'A',
 'B',
 'A',
 'B',
 'B',
 'A',
 'A',
 'A',
 'B',
 'B',
 'B',
 'B',
 'B',
 'B',
 'A',
 'B',
 'A',
 'B',
 'A',
 'A',
 'B',
 'A',
 'A',
 'B',
 'A',
 'A',
 'B',
 'A',
 'A',
 'A',
 'B',
 'A',
 'B',
 'A',
 'A',
 'A',
 'B',
 'B',
 'B',
 'B'

In [25]:
# 2. Simulate clicks based on true probabilities
# For each user, generate a click (1) or no-click (0) outcome
clicks = []
for group in group_assignments:
    if group == 'A':
        # Simulate click with probability true_prob_a
        clicks.append(np.random.rand() < true_prob_a)
    else:
        # Simulate click with probability true_prob_b
        clicks.append(np.random.rand() < true_prob_b)

In [30]:
clicks

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 Fals

In [33]:

# Create a pandas DataFrame to hold the simulated data
simulated_data = pd.DataFrame({
    'user_id': range(n_users),
    'group': group_assignments,
    'clicked': clicks # True/False values, which can be treated as 1/0
})


In [34]:
simulated_data


Unnamed: 0,user_id,group,clicked
0,0,B,False
1,1,A,False
2,2,B,False
3,3,A,False
4,4,B,False
...,...,...,...
1995,1995,B,False
1996,1996,B,False
1997,1997,A,False
1998,1998,B,True


In [31]:
# Convert boolean clicks to integers (1 for True, 0 for False)
simulated_data['clicked'] = simulated_data['clicked'].astype(int)

print("--- Simulated Data Head ---")
print(simulated_data.head())
print("\n--- Simulated Data Info ---")
simulated_data.info()

--- Simulated Data Head ---
   user_id group  clicked
0        0     B        0
1        1     A        0
2        2     B        0
3        3     A        0
4        4     B        0

--- Simulated Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  2000 non-null   int64 
 1   group    2000 non-null   object
 2   clicked  2000 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 47.0+ KB


## Statistical Analysis

In [35]:

# --- Analyze Simulated Results ---

print("\n--- Analyzing Simulated A/B Test Results ---")

# Calculate observed metrics for each group
observed_results = simulated_data.groupby('group')['clicked'].agg(['count', 'sum', 'mean']).reset_index()
observed_results.columns = ['group', 'total_users', 'total_clicks', 'observed_ctr']

print("\n--- Observed Results per Group ---")
print(observed_results)


--- Analyzing Simulated A/B Test Results ---

--- Observed Results per Group ---
  group  total_users  total_clicks  observed_ctr
0     A         1000            98         0.098
1     B         1000           134         0.134


In [36]:
# Prepare data for statistical test
# We need the count of successes (clicks) and the total number of trials (users) for each group.
clicks_a = observed_results[observed_results['group'] == 'A']['total_clicks'].iloc[0]
total_a = observed_results[observed_results['group'] == 'A']['total_users'].iloc[0]

clicks_b = observed_results[observed_results['group'] == 'B']['total_clicks'].iloc[0]
total_b = observed_results[observed_results['group'] == 'B']['total_users'].iloc[0]

In [39]:
# Perform a Z-test for proportions
# Null Hypothesis (H0): The true click-through rates for Group A and Group B are equal.
# Alternative Hypothesis (H1): The true click-through rate for Group B is different from Group A.
# (Can perform one-sided test to check if B is *better* than A)

# Using statsmodels' proportions_ztest
count = np.array([clicks_a, clicks_b])
nobs = np.array([total_a, total_b])

In [40]:
count

array([ 98, 134])

In [41]:
nobs

array([1000, 1000])

## Two tailed Z-test

In [42]:
z_statistic, p_value = sm.stats.proportions_ztest(count, nobs, alternative='two-sided') # Use 'larger' for one-sided test

print(f"\n--- Statistical Test Results (Z-test for Proportions) ---")
print(f"Z-statistic: {z_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significance level (alpha): {alpha}")


--- Statistical Test Results (Z-test for Proportions) ---
Z-statistic: -2.5138
P-value: 0.0119
Significance level (alpha): 0.05


In [43]:

# --- Interpretation ---

print("\n--- Interpretation ---")
if p_value < alpha:
    print(f"The p-value ({p_value:.4f}) is less than the significance level ({alpha}).")
    print("Reject the null hypothesis.")
    print("Conclusion: There is a statistically significant difference in click-through rates between Group A and Group B.")
    if observed_results[observed_results['group'] == 'B']['observed_ctr'].iloc[0] > observed_results[observed_results['group'] == 'A']['observed_ctr'].iloc[0]:
        print("Version B (New Design) appears to have a statistically significantly HIGHER CTR than Version A (Control).")
    else:
         print("Version B (New Design) appears to have a statistically significantly LOWER CTR than Version A (Control).")

else:
    print(f"The p-value ({p_value:.4f}) is greater than or equal to the significance level ({alpha}).")
    print("Fail to reject the null hypothesis.")
    print("Conclusion: There is NOT enough statistical evidence to conclude a significant difference in click-through rates between Group A and Group B at the {alpha} significance level.")
    print("The observed difference could be due to random chance.")


--- Interpretation ---
The p-value (0.0119) is less than the significance level (0.05).
Reject the null hypothesis.
Conclusion: There is a statistically significant difference in click-through rates between Group A and Group B.
Version B (New Design) appears to have a statistically significantly HIGHER CTR than Version A (Control).


## One tailed Z-test

In [44]:
# one-sided test to check if B is *better* than A

z_statistic_oneside, p_value_oneside = sm.stats.proportions_ztest(count, nobs, alternative='larger')

In [45]:
print(f"\n--- Statistical Test Results (One-sided Z-test for Proportions) ---")
print(f"Null Hypothesis (H0): CTR_B <= CTR_A")
print(f"Alternative Hypothesis (H1): CTR_B > CTR_A")
print(f"Z-statistic: {z_statistic_oneside:.4f}")
print(f"P-value: {p_value_oneside:.4f}")
print(f"Significance level (alpha): {alpha}")


--- Statistical Test Results (One-sided Z-test for Proportions) ---
Null Hypothesis (H0): CTR_B <= CTR_A
Alternative Hypothesis (H1): CTR_B > CTR_A
Z-statistic: -2.5138
P-value: 0.9940
Significance level (alpha): 0.05


In [46]:


# --- Interpretation ---

print("\n--- Interpretation ---")
if p_value_oneside < alpha:
    print(f"The one-sided p-value ({p_value_oneside:.4f}) is less than the significance level ({alpha}).")
    print("Reject the null hypothesis.")
    print("Conclusion: There is statistically significant evidence to conclude that the true click-through rate for Group B is LARGER than for Group A.")
    print("Based on this test, you can be reasonably confident that Version B leads to a higher CTR.")

else:
    print(f"The one-sided p-value ({p_value_oneside:.4f}) is greater than or equal to the significance level ({alpha}).")
    print("Fail to reject the null hypothesis.")
    print("Conclusion: There is NOT enough statistically significant evidence to conclude that the true click-through rate for Group B is LARGER than for Group A.")
    print(f"The observed difference, if any, could reasonably be due to random chance at the {alpha} significance level.")




--- Interpretation ---
The one-sided p-value (0.9940) is greater than or equal to the significance level (0.05).
Fail to reject the null hypothesis.
Conclusion: There is NOT enough statistically significant evidence to conclude that the true click-through rate for Group B is LARGER than for Group A.
The observed difference, if any, could reasonably be due to random chance at the 0.05 significance level.


In [47]:
# For t-test, we need the raw click data for each group
clicks_a_data = simulated_data[simulated_data['group'] == 'A']['clicked']
clicks_b_data = simulated_data[simulated_data['group'] == 'B']['clicked']


In [48]:
clicks_a_data

Unnamed: 0,clicked
1,False
3,False
6,False
8,False
10,False
...,...
1985,False
1987,False
1989,False
1990,False


In [49]:
clicks_b_data

Unnamed: 0,clicked
0,False
2,False
4,False
5,False
7,True
...,...
1994,True
1995,False
1996,False
1998,True


In [51]:
observed_results

Unnamed: 0,group,total_users,total_clicks,observed_ctr
0,A,1000,98,0.098
1,B,1000,134,0.134


In [50]:
# For Chi-square test, we need a contingency table (counts of clicked/not clicked per group)

# Count not clicked = total users - total clicks
not_clicked_a = observed_results[observed_results['group'] == 'A']['total_users'].iloc[0] - observed_results[observed_results['group'] == 'A']['total_clicks'].iloc[0]
not_clicked_b = observed_results[observed_results['group'] == 'B']['total_users'].iloc[0] - observed_results[observed_results['group'] == 'B']['total_clicks'].iloc[0]



In [53]:
not_clicked_a, not_clicked_b

(np.int64(902), np.int64(866))

In [54]:
contingency_table = np.array([[observed_results[observed_results['group'] == 'A']['total_clicks'].iloc[0], not_clicked_a],
                              [observed_results[observed_results['group'] == 'B']['total_clicks'].iloc[0], not_clicked_b]])



In [55]:
contingency_table

array([[ 98, 902],
       [134, 866]])

In [56]:
print("\n--- Contingency Table (for Chi-square test) ---")
print("                Clicked | Not Clicked")
print(f"Group A:        {contingency_table[0][0]}     | {contingency_table[0][1]}")
print(f"Group B:        {contingency_table[1][0]}     | {contingency_table[1][1]}")




--- Contingency Table (for Chi-square test) ---
                Clicked | Not Clicked
Group A:        98     | 902
Group B:        134     | 866


In [57]:
# --- Perform Statistical Tests ---

print(f"\n--- Statistical Test Results (alpha={alpha}) ---")

# 1. Proportions Z-test (from previous example - testing B > A)
# Null Hypothesis (H0): CTR_B <= CTR_A
# Alternative Hypothesis (H1): CTR_B > CTR_A
count = np.array([observed_results[observed_results['group'] == 'A']['total_clicks'].iloc[0], observed_results[observed_results['group'] == 'B']['total_clicks'].iloc[0]])
nobs = np.array([observed_results[observed_results['group'] == 'A']['total_users'].iloc[0], observed_results[observed_results['group'] == 'B']['total_users'].iloc[0]])
z_statistic, p_value_ztest = sm.stats.proportions_ztest(count, nobs, alternative='larger')

print(f"\nProportions Z-test (One-sided: H1: CTR_B > CTR_A):")
print(f"  Z-statistic: {z_statistic:.4f}")
print(f"  P-value: {p_value_ztest:.4f}")
print(f"  Conclusion: {'Reject H0' if p_value_ztest < alpha else 'Fail to reject H0'}")




--- Statistical Test Results (alpha=0.05) ---

Proportions Z-test (One-sided: H1: CTR_B > CTR_A):
  Z-statistic: -2.5138
  P-value: 0.9940
  Conclusion: Fail to reject H0


In [59]:
clicks_a_data

Unnamed: 0,clicked
1,False
3,False
6,False
8,False
10,False
...,...
1985,False
1987,False
1989,False
1990,False


In [60]:
# Convert boolean data to numeric type (int or float) for t-test
clicks_a_data_num = clicks_a_data.astype(int) # or .astype(float)
clicks_b_data_num = clicks_b_data.astype(int) # or .astype(float)

## Independent t-test

In [61]:
# 2. Independent Samples T-test
# Compares the means of two independent groups. In binary data, the mean is the proportion/rate.
# Null Hypothesis (H0): Mean_A = Mean_B (i.e., CTR_A = CTR_B)
# Alternative Hypothesis (H1): Mean_A != Mean_B (i.e., CTR_A != CTR_B) (two-sided test)
t_statistic, p_value_ttest = stats.ttest_ind(clicks_a_data_num, clicks_b_data_num) # Default is two-sided

print(f"\nIndependent Samples T-test (Two-sided: H1: CTR_A != CTR_B):")
print(f"  T-statistic: {t_statistic:.4f}")
print(f"  P-value: {p_value_ttest:.4f}")
print(f"  Conclusion: {'Reject H0' if p_value_ttest < alpha else 'Fail to reject H0'}")




Independent Samples T-test (Two-sided: H1: CTR_A != CTR_B):
  T-statistic: -2.5165
  P-value: 0.0119
  Conclusion: Reject H0


## One Tailed t-test - Less

In [63]:
# Note: For a one-sided t-test (e.g., H1: Mean_B > Mean_A), you would use alternative='greater' in ttest_ind.
t_statistic_one_sided, p_value_ttest_one_sided = stats.ttest_ind(clicks_a_data_num, clicks_b_data_num, alternative='less') # 'less' tests if mean of first sample (A) is less than second (B)
print(f"\nIndependent Samples T-test (One-sided: H1: CTR_B > CTR_A):")
print(f"  T-statistic: {t_statistic_one_sided:.4f}")
print(f"  P-value: {p_value_ttest_one_sided:.4f}")
print(f"  Conclusion: {'Reject H0' if p_value_ttest_one_sided < alpha else 'Fail to reject H0'}")




Independent Samples T-test (One-sided: H1: CTR_B > CTR_A):
  T-statistic: -2.5165
  P-value: 0.0060
  Conclusion: Reject H0


## Chi-square Test for Independence

In [66]:
 # 3. Chi-square Test for Independence
# Tests for association between two categorical variables (Group and Clicked).
# Null Hypothesis (H0): Group and Clicked are independent (i.e., CTR is the same across groups).
# Alternative Hypothesis (H1): Group and Clicked are not independent (i.e., CTR differs across groups).
chi2_statistic, p_value_chi2, dof, expected = stats.chi2_contingency(contingency_table)

print(f"\nChi-square Test for Independence (Two-sided: H1: CTR differs between groups):")
print(f"  Chi2-statistic: {chi2_statistic:.4f}")
print(f"  P-value: {p_value_chi2:.4f}")
print(f"  Degrees of Freedom (dof): {dof}")
print(f"  Conclusion: {'Reject H0' if p_value_chi2 < alpha else 'Fail to reject H0'}")




Chi-square Test for Independence (Two-sided: H1: CTR differs between groups):
  Chi2-statistic: 5.9730
  P-value: 0.0145
  Degrees of Freedom (dof): 1
  Conclusion: Reject H0


In [67]:
# --- Overall Interpretation ---
print("\n--- Overall Interpretation Summary ---")
print(f"Significance level (alpha): {alpha}")

if p_value_ztest < alpha and p_value_ttest < alpha and p_value_chi2 < alpha:
     print("\nAll tests indicate a statistically significant difference in click-through rates between Group A and Group B.")
     print("Specifically, the one-sided Z-test supports that Group B's CTR is significantly LARGER than Group A's.")
elif p_value_ztest < alpha:
     print("\nThe one-sided Z-test indicates a statistically significant improvement (B > A). Other tests might not be designed for one-sided checks or have different assumptions.")
else:
     print("\nBased on the tests performed, there is not enough statistically significant evidence at the {alpha} level to conclude that Group B has a significantly higher CTR than Group A.")
     print("The observed difference could be due to random chance.")




--- Overall Interpretation Summary ---
Significance level (alpha): 0.05

Based on the tests performed, there is not enough statistically significant evidence at the {alpha} level to conclude that Group B has a significantly higher CTR than Group A.
The observed difference could be due to random chance.


# Independent Samples T-test (scipy.stats.ttest_ind)

- This test is typically used to compare the means of a continuous variable between two independent groups.

- In this case, our variable (clicked) is binary (0 or 1). The mean of a binary variable is equivalent to the proportion of '1's (the click-through rate).
ttest_ind(clicks_a_data, clicks_b_data) performs a two-sided test by default, checking if the mean of Group A is significantly different from the mean of Group B.

- The p-value tells you the probability of observing the difference in sample means if the true population means were equal.

# Chi-square Test for Independence (scipy.stats.chi2_contingency):

- This test is used to determine if there is a **significant association between two categorical variables**.

- In an A/B test comparing proportions, the two categorical variables are the Group (A or B) and the Outcome (Clicked or Not Clicked).

- It requires a **contingency table** which shows the counts of observations in each combination of categories (e.g., Number of Group A users who clicked, Number of Group B users who did not click, etc.). Construct this table from our simulated_data.

- The **null hypothesis** is that the variables are independent (meaning the click rate is the same regardless of the group). The **alternative hypothesis** is that they are not independent (meaning the click rate differs between the groups).
Like the default t-test, the standard chi-square test is two-sided.

# Choosing the Right Test:

- For comparing proportions in A/B testing, the **Proportions Z-test** is often the most direct and appropriate, especially for large sample sizes.
  - It can easily be adapted for one-sided tests (which is common when you're only looking for improvement).

- The **Chi-square test** is also very common and suitable for comparing proportions as it tests for the **association between the group and the outcome counts.**

- The **T-test** can be used for proportions, but the **Z-test** is more specifically designed for this purpose and is asymptotically equivalent to the Chi-square test for a 2x2 table.

- In practice, all three tests will often yield very similar p-values in standard A/B testing scenarios with large sample sizes. The choice might depend on convention, the specific question being asked (one-sided vs. two-sided), and the software being used.

