In [72]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import warnings
warnings.filterwarnings('ignore')

paths = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        x = os.path.join(dirname, filename)
        paths.append(x)

df = pd.read_csv(paths[1])

## **1. Cleaning and Organising Data**

**<p style="font-size: 18px;">a) Quick Look at the Data</p>**

In [73]:
df.head(5)

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


<p style="font-size: 16px;">The data we are exploring was from a website that performed an AB Test on its users to assess the effectives of two landing pages. From the columns we can see that users are already seperated into groups : control and treatement with columns which track the user id, the landing page, and the conversions.</p

In [74]:
df.shape

(294480, 5)

<p style="font-size: 16px;">We have nearly 300,000 rows of data for the 5 columns.</p>

In [75]:
df.describe()

Unnamed: 0,user_id,converted
count,294480.0,294480.0
mean,787973.538896,0.119658
std,91210.917091,0.324562
min,630000.0,0.0
25%,709031.75,0.0
50%,787932.5,0.0
75%,866911.25,0.0
max,945999.0,1.0


<p style="font-size: 16px;">The describe method provides summary statistics of the numerical columns. We can see that the user id is considered a numerical column, but we can just not ignore it as it won't affect our analysis. Looking at the summary statistics for the converted column, we can see the range which is between 0 and 1, and a mean value of 0.119. This makes sense as a conversion is noted as a value of 1 and a non conversion is 0.</p>

**<p style="font-size: 18px;"> Handling Null and Duplicate Values</p>**

<p style="font-size: 16px;"> This code iterates over the each column and sums the number of null values.</p>

In [76]:
#Check for NA Values - Print list of columns and number of nan values.
df_columns_mask = df.isna().any(axis=0)
columns = df.columns[df_columns_mask]

if len(columns) == 0:
    print("No NaN values found in the DataFrame.")
else:
    for col in columns:
        print(f"Column {col} has {df[col].isna().sum()} NaN values")


No NaN values found in the DataFrame.


<p style="font-size: 16px;">For this test, we want to analyse the user experience the first time they enter the two landing pages, so we write the code below to make sure there is only one unique instance of each individual user. This code will drop any successive occurence of the previous user ids.</p>

In [77]:
# removed duplicate user_id values.
print(df.shape)
df = df.drop_duplicates(subset= 'user_id', keep= False)
print(df.shape)

(294480, 5)
(286690, 5)


<p style="font-size: 16px;">We can see the numbers of rows decreased from 294,480 to 286,690. This means duplicated user ids were succesfully removed.</p>

**<p style="font-size: 18px;">Rearanging Table by Landing Pages and Groups</p>**

<p style="font-size: 16px;">In order to see how the users were split among the different test groups in the AB test experiment, we can group the data according to landing pages and group type.</p>

In [78]:
#count observations for each landing page.
grouped = df.groupby(['landing_page', 'group']).agg({'landing_page': 'size'})
grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,landing_page
landing_page,group,Unnamed: 2_level_1
new_page,treatment,143397
old_page,control,143293


<p style="font-size: 16px;"> We can see that the groups were pretty evenly split.</p>

In [79]:
grouped = df.groupby(['landing_page','group']).agg({'converted':'sum'})
grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,converted
landing_page,group,Unnamed: 2_level_1
new_page,treatment,17025
old_page,control,17220


<p style="font-size: 16px;">The old page had slightly more conversions. The code below shows the frequencies in percentage format.</p>

In [80]:
grouped = df.groupby('landing_page').agg({'landing_page': 'size'}) / len(df) * 100
grouped


Unnamed: 0_level_0,landing_page
landing_page,Unnamed: 1_level_1
new_page,50.018138
old_page,49.981862


<p style="font-size: 16px;">The percentages reflect the previous count. There was a near 50% split between the two pages.</p>

In [81]:
grouped = df.groupby(['group','landing_page']).agg({'converted': 'mean'})
grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,converted
group,landing_page,Unnamed: 2_level_1
control,old_page,0.120173
treatment,new_page,0.118726


<p style="font-size: 16px;"> By observing the conversion rates, we see that there was only a slight performance difference. The old page had a conversion rate of 12%. In comparison the new page achieved a conversion rate of 11.8%. If this sample data is proven significant,this might suggest more work needs to be done to optimise the website for increased conversion performance.</p>

## **2. Testing for Significance**

**<p style="font-size: 18px;">H0: There is no significant difference between the sample mean and population mean.</p>**

<p style="font-size: 16px;">This means any observed difference is due to random sampling variability.</p>

**<p style="font-size: 18px;">H1: There is a significant difference between the sample and population mean.</p>**

<p style="font-size: 16px;">This indicates that any observed difference is not just due to chance, suggesting a true effect or difference exists.</p>

**<p style="font-size: 18px;">Using Power Analysis to Compare Two Means**</p>

<p style="font-size: 16px;">We choose to perform a two sample Z-test because of the large sample size.
The z-test helps to test if our current sample mean of is representative of the general userbase.</p>

<p style="font-size: 16px;">According to the Central Limit Theorem, the sampling distribution of the mean approximates a normal distribution for large samples, even if the population distribution is not normal.</p>

<p style="font-size: 16px;">Using a the Power Analysis method, we will be able to find an adequate sample size for the Z-test which ensures that if there is a significant relationship between the two samples, are test will be able to prove it.</p>

**<p style="font-size: 18px;">What is Power Analysis?</p>**

<p style="font-size: 16px;">Power analysis is a statistical technique used to determine the likelihood that a test will detect an effect, assuming that the effect truly exists. It helps researchers decide whether a test is adequate to detect a statistically significant effect in a hypothesis test.</p>

**Parameters required are:**

- **Power (1 - β):** The probability of safely rejecting the null (we choose an 80% confidence interval).

- **Effect size:** Using Cohen's formula we can calculate the effect size knowing that we want a 1% difference in conversion rate.
- **Sample size:** The number of participants or observations (we do not yet know the sample size).

- **Significance level (α):** The probability of falsely rejecting the null (alpha = 0.05).

<p style="font-size: 16px;">Knowing three of these parameters allow us to determine the other, luckily enough, we know three of them and just need to know the sample size.</p>

In [82]:
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.power import NormalIndPower

# We input our parameters ~ the minimum detectable effect we want is 1% conversion increase (p2 - p1).
p1 = 0.13
p2 = 0.12
#power parameter.
power = 0.80
#alpha parameter
alpha = 0.05

In [83]:
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportions_ztest

# Calculating the effect size with Cohen's formula.
effect_size = (p1 - p2) / ((p1 * (1 - p1) + p2 * (1 - p2)) / 2) ** 0.5
# Calculate the required sample size
analysis = NormalIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, power=power, alpha=alpha, ratio=1.0, alternative='two-sided')
print(f'Effect size: {effect_size:.2f}')

print(f'Required sample size per group: {sample_size:.2f}')

print(f'Required sample size per group: {sample_size:.2f}')

Effect size: 0.03
Required sample size per group: 17165.46
Required sample size per group: 17165.46


<p style="font-size: 16px;">Using Cohen's formula we found that our desired effect size is 0.03.</p>
<p style="font-size: 16px;">Are required sample size is around 17,165.</p>
<p style="font-size: 16px;">We are now able to perform our Z-Test to test for significance.</p>

In [84]:
# Set the random seed for reproducibility.
np.random.seed(45)

# Creates a sample of size 17,165 for each group and reset the index to get a new dataframe with our two data samples.
sample_df = (df.groupby(['group'])
         .apply(lambda x: x.sample(n=17165, replace=False))
         .reset_index(drop=True))

In [85]:
sample_df

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,807689,47:50.8,control,old_page,0
1,817401,00:36.8,control,old_page,1
2,652424,28:57.7,control,old_page,0
3,912117,41:07.2,control,old_page,0
4,671687,03:04.9,control,old_page,0
...,...,...,...,...,...
34325,724420,47:05.8,treatment,new_page,0
34326,646390,07:56.9,treatment,new_page,0
34327,838051,35:19.1,treatment,new_page,0
34328,702806,07:35.5,treatment,new_page,0


<p style="font-size: 16px;"> We can see that instead of 300,000 rows, we now have 34,330 rows we can sample test.</p>

In [86]:
# Collect the total observations and total converted per group
sample_df = (sample_df.groupby('group')
       .agg(total_observations=('user_id', 'size'),
            total_converted=('converted', 'sum'))
       .reset_index())

In [91]:
# Extract counts
conv = sample_df['total_converted'].values
n = sample_df['total_observations'].values

# Conducts Z-Test to compare the mean of the two samples and see if they differ significantly
z_stat, p_value = proportions_ztest(count=conv, nobs=n)

In [92]:
print("Z-statistic:", z_stat)
print("P-value:", p_value)

Z-statistic: -0.43399652860477944
P-value: 0.664290961882086


<p style="font-size: 16px;">The P-Value has a value of 0.66, so we can't reject the null hypothesis.</p>
<p style="font-size: 16px;">This means are findings might be due to random sample variability.</p>

## **Hypothesis Results**

<p style="font-size: 16px;">There is no significant difference between the two landing pages in terms of conversion rates. We should keep the original landing page until further improvements on the new landing page shows significant improvement in terms of conversion rates.</p>