# Evaluating the Impact of a New Landing Page on Conversions

## Introduction  
In today’s data-driven business landscape, understanding what drives customer behavior is crucial for optimizing key performance indicators (KPIs) such as conversion rates, customer acquisition costs, and overall revenue. Techniques like A/B testing are invaluable when businesses seek to determine whether changes, such as a new landing page, positively impact conversions.

In this notebook, I present a straightforward approach to assess whether introducing a new landing page leads to an increase in user conversions. 
1. Simple Conversion Analysis: We begin by comparing conversion rates between the treatment (new landing page) and control groups to see if there's an initial indication that the new page is performing better.
2. One-Tailed T-Test (A/B Test): Next, we perform a one-tailed t-test to statistically assess whether the observed differences in conversion rates between the two groups are significant.
3. Regression Analysis: Finally, we use both univariate and multivariate logistic regression to examine the relationship between landing page types and conversion rates. This approach allows us to control for additional variables, providing a more nuanced understanding of the factors influencing customer behavior.

In [46]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # basic visualizations 
import seaborn as sns # advanced visualizations
import scipy.stats as stats
import random
random.seed(42) #We are setting the seed to assure you get the same answers
import warnings
warnings.filterwarnings('ignore')

## Data Overview
The dataset used in this analysis includes the following columns:
- user_id: Unique identifier for each user.
- time: Timestamp of the user's interaction.
- group: Indicates whether the user was part of the control group or the treatment group.
- landing_page: Specifies whether the user landed on the old page or the new page.
- converted: Indicates whether the user converted or not.   


In [2]:
#read the data and display the top 5 rows
df = pd.read_excel("C:/Users/lkaur/Desktop/ab_test.xlsx")

In [3]:
def check_df(dataframe):
    print("*********Shape*********")
    print(dataframe.shape)
    print("*********Column names*********")
    print(dataframe.columns)
    print("*********Datatypes*********")
    print(dataframe.dtypes)
    print("*********Head*********")
    print(dataframe.head())
    print("*********Tail*********")
    print(dataframe.tail())
    print("*********Description*********")
    print(dataframe.describe())


check_df(df)
    

*********Shape*********
(294478, 5)
*********Column names*********
Index(['user_id', 'time', 'group', 'landing_page', 'converted'], dtype='object')
*********Datatypes*********
user_id          int64
time            object
group           object
landing_page    object
converted        int64
dtype: object
*********Head*********
   user_id             time      group landing_page  converted
0   851104  00:11:48.600000    control     old_page          0
1   804228  00:01:45.200000    control     old_page          0
2   661590  00:55:06.200000  treatment     new_page          0
3   853541  00:28:03.100000  treatment     new_page          0
4   864975  00:52:26.200000    control     old_page          1
*********Tail*********
        user_id             time      group landing_page  converted
294473   751197  00:28:38.600000    control     old_page          0
294474   945152  00:51:57.100000    control     old_page          0
294475   734608  00:45:03.400000    control     old_page          0

### Missing Values and Duplicates: 
We then assess the presence of missing values and duplicate records that could affect our analysis.

In [4]:
df.isnull().sum()

user_id         0
time            0
group           0
landing_page    0
converted       0
dtype: int64

In [5]:
df.duplicated().sum()

0

### Data Preparation: 
- Re-labeling the group column and encoding it to a binary numerical column with 0 for control and 1 for treatment.
- 
Re-labeling thelanding_pager column and encoding it to a binary numerical column with 0 forold_pagee and 1 fornew_pagee.

In [8]:
#encoding it
labels = {"treatment": 1, "control":0}
df["group"] = df["group"].map(labels)
#converting it into integer
df["group"] = df["group"].astype(int)

#encoding it
labels = {"new_page": 1, "old_page":0}
df["landing_page"] = df["landing_page"].map(labels)
#converting it into integer
df["landing_page"] = df["landing_page"].astype(int)

In [9]:
df

Unnamed: 0,user_id,time,group,landing_page,converted
0,851104,00:11:48.600000,0,0,0
1,804228,00:01:45.200000,0,0,0
2,661590,00:55:06.200000,1,1,0
3,853541,00:28:03.100000,1,1,0
4,864975,00:52:26.200000,0,0,1
...,...,...,...,...,...
294473,751197,00:28:38.600000,0,0,0
294474,945152,00:51:57.100000,0,0,0
294475,734608,00:45:03.400000,0,0,0
294476,697314,00:20:29,0,0,0


### Initial Exploration
I started by examining the dataset for any discrepancies between the users assigned to the treatment group and those who actually landed on the new page. A mismatch was identified, indicating potential issues in the data that required further exploration. This step was crucial to ensure the validity of subsequent analysis.

In [11]:
#Does the number of new_page and treatment match?
n_treat = df[df["group"] == 1].shape[0]
n_new_page = df[df["landing_page"] == 1].shape[0]
difference = n_treat - n_new_page

pd.DataFrame({
    'N treatment': [n_treat],
    'N new_page': [n_new_page],
    'Difference': [difference]
})

Unnamed: 0,N treatment,N new_page,Difference
0,147276,147239,37


There is mismatch between number of users assigned to treatment and the number of those landed on treatment page. This might indicate a problem with the data and needs further exploration.

### Cleaning and Preparing the Data
After identifying the mismatched rows, in the following steps I filtered the dataset to focus on users who correctly landed on the pages they were assigned to. I also removed any duplicate entries to ensure that each user's conversion outcome was only counted once. This cleaning process resulted in a refined dataset that accurately reflected the intended experimental design.

In [12]:
df[(df["group"] == 1) & (df["landing_page"] == 0)]

Unnamed: 0,user_id,time,group,landing_page,converted
308,857184,00:34:59.800000,1,0,0
327,686623,00:26:40.700000,1,0,0
357,856078,00:29:30.400000,1,0,0
685,666385,00:11:54.800000,1,0,0
713,748761,00:47:44.400000,1,0,0
...,...,...,...,...,...
293773,688144,00:34:50.500000,1,0,1
293817,876037,00:15:09,1,0,1
293917,738357,00:37:55.700000,1,0,0
294014,813406,00:25:33.200000,1,0,0


In [13]:
df_mismatch = df[(df["group"] == 1) & (df["landing_page"] == 0)
               |(df["group"] == 0) & (df["landing_page"] == 1)]

n_mismatch = df_mismatch.shape[0]

percent_mismatch = round(n_mismatch / len(df) * 100, 2)
print(f'Number of mismatched rows: {n_mismatch} rows')
print(f'Percent of mismatched rows: {percent_mismatch} percent')

Number of mismatched rows: 3893 rows
Percent of mismatched rows: 1.32 percent


As you can see, there are 3893 rows where treatment does not match with new_page or control does not match with old_page, i cannot be sure if this row truly received the new or old page. So I will remove these rows.

In [14]:
df2 = df[(df["group"] == 1) & (df["landing_page"] == 1)
        |(df["group"] == 0) & (df["landing_page"] == 0)]

len(df2)

290585

In [15]:
df2.head()

Unnamed: 0,user_id,time,group,landing_page,converted
0,851104,00:11:48.600000,0,0,0
1,804228,00:01:45.200000,0,0,0
2,661590,00:55:06.200000,1,1,0
3,853541,00:28:03.100000,1,1,0
4,864975,00:52:26.200000,0,0,1


In [16]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 1) == (df2['landing_page'] == 1)) == False].shape[0]

0

In [18]:
# Addtionally there should be iser_id per row in the dataframe 
df2.user_id.nunique()

290584

In [19]:
# number of repeated ids in df2
len(df2) - df2.user_id.nunique()

1

In [21]:
#drop the duplicated row
df2 = df2.drop_duplicates("user_id") 

In [22]:
# Douple Check that it is actually dropped
len(df2) - df2.user_id.nunique()

0

## Conversion Rate Analysis 1 - Probability
The first approach would be to assess if the treatment was better than control using the conversion rates.

In [26]:
# The probability of an individual converting regardless of the page they receive
df2.converted.mean() * 100

11.959708724499627

In [27]:
#Given that an individual was in the control group, what is the probability they converted?
#Given that an individual was in the treatment group, what is the probability they converted?
# Calculate mean conversion rate for control group
control_mean_conversion = df2[df2["group"] == 0]["converted"].mean() * 100

# Calculate mean conversion rate for treatment group
treatment_mean_conversion = df2[df2["group"] == 1]["converted"].mean() * 100

print(f"Control group mean conversion rate: {control_mean_conversion:.2f}%")
print(f"Treatment group mean conversion rate: {treatment_mean_conversion:.2f}%")

Control group mean conversion rate: 12.04%
Treatment group mean conversion rate: 11.88%


In [28]:
#What is the probability that an individual received the new page?
pd.DataFrame(df2.landing_page.value_counts(normalize = True) * 100)

Unnamed: 0_level_0,proportion
landing_page,Unnamed: 1_level_1
1,50.006194
0,49.993806


**Findings**  
The overall conversion rate was approximately 11.96%, with the control group converting at 12.04% and the treatment group at 11.88% (with almost equal chance of a user getting old or new page). These preliminary findings suggested no significant difference in conversion rates between the old and new pages.

## Conversion Rate Analysis 2 - A/B Test
To rigorously assess whether the new landing page was better than the old one, I conducted an A/B test. Using a one-tailed hypothesis test, I checked if the new page led to significantly higher conversions.

### One-tailed hypothesis test

In [30]:
# Calculate conversion rates for old and new pages
control_group = df2[df2["landing_page"] == 0]
treatment_group = df2[df2["landing_page"] == 1]

p_old = control_group["converted"].mean()
p_new = treatment_group["converted"].mean()

# Perform one-tailed hypothesis test (right-tailed)
alpha = 0.05
z_score = (p_new - p_old) / (p_old * (1 - p_old) / len(control_group))**0.5
p_value = 1 - stats.norm.cdf(z_score)
print("P_value is", p_value)
if p_value < alpha:
    print("Reject null hypothesis: The new page is better.")
else:
    print("Fail to reject null hypothesis: No significant evidence that the new page is better.")

P_value is 0.9677388921534111
Fail to reject null hypothesis: No significant evidence that the new page is better.


### One-tailed hypothesis test with bootstrap sampling 
To further validate the above conclusion, I performed bootstrap sampling with 1,000 iterations.

In [31]:
num_samples = 1000

# Initialize an empty list to store p-values
p_values = []

for _ in range(num_samples):
    # Generate a bootstrap sample
    sample = df2.sample(len(df2), replace=True)
    
    # Calculate conversion rates for old and new pages
    control_group = sample[sample["landing_page"] == 0]
    treatment_group = sample[sample["landing_page"] == 1]
    p_old = control_group["converted"].mean()
    p_new = treatment_group["converted"].mean()
    
    # Perform one-tailed hypothesis test
    z_score = (p_new - p_old) / (p_old * (1 - p_old) / len(control_group))**0.5
    p_value = 1 - stats.norm.cdf(z_score)
    p_values.append(p_value)

# Count how many p-values are less than 0.05
significant_samples = sum(p < 0.05 for p in p_values)

print(f"Significant samples (p < 0.05): {significant_samples}/{num_samples}")

Significant samples (p < 0.05): 10/1000


**Findings**  
The A/B test results (with and without sampling) consistently showed that there was no significant improvement in conversions with the new page.

## Conversion Rate Analysis 3 - Regression

We transition from the classical A/B testing methods to a more sophisticated approach—logistic regression. This method allowsed to explore the relationship between the page a user lands on and their likelihood of conversion, while also accounting for potential confounding factors like the country in which a user resides.  


### Logistic Regression - Univariate
Given that the outcome variable, conversion, is binary (converted or not), logistic regression was the appropriate choice. To ensure the model could interpret the data correctly, I first added an intercept column to our dataset. The intercept represents the baseline probability of conversion when all other variables are held constant. The independent variable here was the column 'Group'.

In [32]:
# Creat the intercept 
df2["intercept"] = 1
df2.head()

Unnamed: 0,user_id,time,group,landing_page,converted,intercept
0,851104,00:11:48.600000,0,0,0,1
1,804228,00:01:45.200000,0,0,0,1
2,661590,00:55:06.200000,1,1,0,1
3,853541,00:28:03.100000,1,1,0,1
4,864975,00:52:26.200000,0,0,1,1


In [33]:
# Instantiate and fit the regression model
!pip install statsmodels
import statsmodels.api as sm
model = sm.Logit(df2['converted'], df2[['intercept','group']])
result = model.fit()
result.summary()




[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Optimization terminated successfully.
         Current function value: 0.366118
         Iterations 6


0,1,2,3
Dep. Variable:,converted,No. Observations:,290584.0
Model:,Logit,Df Residuals:,290582.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 19 Aug 2024",Pseudo R-squ.:,8.077e-06
Time:,16:56:08,Log-Likelihood:,-106390.0
converged:,True,LL-Null:,-106390.0
Covariance Type:,nonrobust,LLR p-value:,0.1899

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-1.9888,0.008,-246.669,0.000,-2.005,-1.973
group,-0.0150,0.011,-1.311,0.190,-0.037,0.007


**Findings**  
After fitting the logistic regression model, the results were telling. The p-value for the group variable (indicating whether the user was in the treatment or control group) was 0.190. This is above the conventional threshold of 0.05, suggesting that the page a user lands on (new or old) does not significantly affect their likelihood of conversion. This finding aligns with the earlier results from the A/B testing, confirming that the new landing page does not drive more conversions.

### Logistic Regression - Multivariate
Recognizing that user behavior could be influenced by factors beyond just the landing page, I introduced the user’s country as an additional variable in the model based on the hypothesis that cultural or regional differences might play a role in conversion rates. After merging country data with the primary dataset, dummy variables are created for the countries—Canada (CA), the United Kingdom (UK), and the United States (US).

In [34]:
# Read the country data
countries = pd.read_csv("C:/Users/lkaur/Downloads/countries_abtest/countries_ab.csv")
countries.head()

Unnamed: 0,id,country
0,834778,UK
1,928468,US
2,822059,UK
3,711597,UK
4,710616,UK


In [35]:
# Merge the countries dataframe with df2 
countries.columns = ["user_id", "country"]
countries["user_id"] = countries["user_id"].astype(int)
df3 = df2.merge(countries, on = "user_id", how = "left")
df3.head()

In [42]:
# creating dummies for country and landing_page columns 
df3[['CA','UK','US']] = pd.get_dummies(df3['country'])
df3[['CA', 'UK', 'US']] = df3[['CA', 'UK', 'US']].astype(int)
df3.head()

Unnamed: 0,user_id,time,group,landing_page,converted,intercept,country,CA,UK,US
0,851104,00:11:48.600000,0,0,0,1,US,0,0,1
1,804228,00:01:45.200000,0,0,0,1,US,0,0,1
2,661590,00:55:06.200000,1,1,0,1,US,0,0,1
3,853541,00:28:03.100000,1,1,0,1,US,0,0,1
4,864975,00:52:26.200000,0,0,1,1,US,0,0,1


In [43]:
# lest see if there is a relation between country and conversion
pd.pivot_table(data = df3, index = "country", values = "converted").sort_values(by = "converted", ascending = False) * 100

Unnamed: 0_level_0,converted
country,Unnamed: 1_level_1
UK,12.059449
US,11.95468
CA,11.53183


**Initial Findings**  
Interestingly, the relationship between country and conversion rate have only minor differences. This suggests that country might have a negligible impact on conversion, but this should be confirmed through regression.

In [44]:
# Instantiate and fit the regression model with country as an additional variable: 'CA' is a baseline
model = sm.Logit(df3['converted'], df3[['intercept','landing_page', 'UK','US']])
result = model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6


0,1,2,3
Dep. Variable:,converted,No. Observations:,290584.0
Model:,Logit,Df Residuals:,290580.0
Method:,MLE,Df Model:,3.0
Date:,"Mon, 19 Aug 2024",Pseudo R-squ.:,2.323e-05
Time:,17:04:00,Log-Likelihood:,-106390.0
converged:,True,LL-Null:,-106390.0
Covariance Type:,nonrobust,LLR p-value:,0.176

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-2.0300,0.027,-76.249,0.000,-2.082,-1.978
landing_page,-0.0149,0.011,-1.307,0.191,-0.037,0.007
UK,0.0506,0.028,1.784,0.074,-0.005,0.106
US,0.0408,0.027,1.516,0.130,-0.012,0.093


In [45]:
# exponentiate the parameters to inteprete the result
np.exp(result.params)

intercept       0.131332
landing_page    0.985168
UK              1.051944
US              1.041599
dtype: float64

**Findings**  
The results were consistent: none of the coefficients, except the intercept, were statistically significant. This reinforces the earlier conclusion made that the treatment page had no substantial impact on conversion rates, even when accounting for the user’s country.

## Conclusion
This analysis, conducted through three distinct methods—Simple Conversion Analysis, One-Tailed T-Test (A/B Test), and logistic regression—reached a unanimous conclusion: the new landing page does not lead to a higher conversion rate. This insight provides clear guidance for decision-making, affirming that, at least in this context, the new landing page does not offer an advantage over the existing one.

## Shortcomings
While this analysis offers valuable insights, it's important to acknowledge certain shortcomings, particularly the delayed examination of p-values. By not evaluating p-values step by step as the data was processed, we risk overlooking early indications of statistical significance. This approach can obscure the real-time understanding of when and how variables become significant, potentially leading to missed insights.

Moreover, the omission of step-by-step p-value analysis can also introduce risks related to overfitting, where the model becomes too complex to generalize well to new data. Adding variables such as the time of conversion might have provided additional insights but would also have made the model more complicated and less interpretable, potentially leading to less actionable outcomes. 

Nonetheless, this comprehensive approach provides a solid foundation for making informed decisions about landing page designs and their potential impact on key business metrics.