# A/B Test Analysis
We're going to conduct an Independent Samples T-test to analyse our A/B test. An Indepdent Samples T-test compares the differences between two means of two different samples

In [17]:
import pandas as pd
import numpy as np
import scipy.stats as stats

Export your results to a .csv file and save it to you github repository. Import your .csv file, inspect it, and clean it where neccesary.

In [18]:
# If on Github, load your data
df_A = pd.read_csv('./group_A.csv', sep=';')
df_B = pd.read_csv('./group_B.csv', sep=';')

df_columns =['q1', 'q2', 'q3', 'q4', 'q5']

drop_columns = ['Start time', 'Completion time']
df_A = df_A.drop(columns=[*drop_columns , "ID"] )
df_B = df_B.drop(columns=[*drop_columns, 'Id'])

df_A.columns  = df_B.columns =  df_columns
 



In [19]:
# EDA A
df_A.info() # Is your data in the right format?
df_A.head() # Quick EDA. No? Clean it, you only want the rows and columns containing likert-score data, saved as integers.



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   q1      26 non-null     int64
 1   q2      26 non-null     int64
 2   q3      26 non-null     int64
 3   q4      26 non-null     int64
 4   q5      26 non-null     int64
dtypes: int64(5)
memory usage: 1.1 KB


Unnamed: 0,q1,q2,q3,q4,q5
0,5,5,6,6,5
1,7,6,6,5,6
2,6,4,7,5,5
3,5,6,5,5,6
4,6,5,4,5,6


In [20]:
# EDA B
df_B.info() # Is your data in the right format?
df_B.head() # Quick EDA. No? Clean it, you only want the rows and columns containing likert-score data, saved as integers.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   q1      25 non-null     int64
 1   q2      25 non-null     int64
 2   q3      25 non-null     int64
 3   q4      25 non-null     int64
 4   q5      25 non-null     int64
dtypes: int64(5)
memory usage: 1.1 KB


Unnamed: 0,q1,q2,q3,q4,q5
0,5,5,6,6,5
1,7,7,7,6,6
2,7,4,6,7,5
3,6,5,6,6,5
4,6,5,4,5,6


The rest we leave for tomorrow when we actually have our data. But if you are eager to play around a bit you can simply refresh the survey and fill in a couple of responses to create an A and a B version.

Now, let's start analysing our gathered data! This block we won't dive into inferential statistics since it can get complex quite fast; we'll do that in Year 2, block A. For now, you just need to know that we need to test whether the data is normally distributed and whether the variances of both samples are equal. Otherwise, our statistical tests would not be valid and we can therefore not say that the results we're seeing are due to chance. What we are going to statistically ascertain is whether there is a statistically significant different in the mean of a given variable for version A or B. 

In [21]:
from scipy import stats

# Run Shapiro-Wilk test for normality for each question in df_A and df_B
shapiro_results_a = {}
shapiro_results_b = {}
for column in df_A.columns:
    shapiro_results_a[column] = stats.shapiro(df_A[column])
    shapiro_results_b[column] = stats.shapiro(df_B[column])

# Run Levene's test to check equality of variances
homogeneity = stats.levene(df_A['q1'], df_B['q1'], df_A['q2'], df_B['q2'], df_A['q3'], df_B['q3'], df_A['q4'], df_B['q4'], df_A['q5'], df_B['q5'])

# Print the Shapiro-Wilk and Levene's test results
for column in shapiro_results_a.keys():
    print(f"Shapiro-Wilk test for {column} in df_A:")
    print(f"  p-value: {round(shapiro_results_a[column].pvalue, 4)}")
    if shapiro_results_a[column].pvalue > 0.05:
        print("  The data is normally distributed.")
    else:
        print("  The data is not normally distributed. Consider a bootstrapped version.")

for column in shapiro_results_b.keys():
    print(f"Shapiro-Wilk test for {column} in df_B:")
    print(f"  p-value: {round(shapiro_results_b[column].pvalue, 4)}")
    if shapiro_results_b[column].pvalue > 0.05:
        print("  The data is normally distributed.")
    else:
        print("  The data is not normally distributed. Consider a bootstrapped version.")

print(f"Levene's test for equality of variances:")
print(f"  p-value: {round(homogeneity.pvalue, 4)}")
if homogeneity.pvalue > 0.05:
    print("  The groups have equal variances.")
else:
    print("  The groups do not have equal variances. Consider a bootstrapped version.")


Shapiro-Wilk test for q1 in df_A:
  p-value: 0.0089
  The data is not normally distributed. Consider a bootstrapped version.
Shapiro-Wilk test for q2 in df_A:
  p-value: 0.0018
  The data is not normally distributed. Consider a bootstrapped version.
Shapiro-Wilk test for q3 in df_A:
  p-value: 0.0169
  The data is not normally distributed. Consider a bootstrapped version.
Shapiro-Wilk test for q4 in df_A:
  p-value: 0.0006
  The data is not normally distributed. Consider a bootstrapped version.
Shapiro-Wilk test for q5 in df_A:
  p-value: 0.0098
  The data is not normally distributed. Consider a bootstrapped version.
Shapiro-Wilk test for q1 in df_B:
  p-value: 0.0
  The data is not normally distributed. Consider a bootstrapped version.
Shapiro-Wilk test for q2 in df_B:
  p-value: 0.0009
  The data is not normally distributed. Consider a bootstrapped version.
Shapiro-Wilk test for q3 in df_B:
  p-value: 0.0001
  The data is not normally distributed. Consider a bootstrapped version.
Sha

Now that is in the right format and we know the column names. Replace 'A' with the column name which holds your original baseline version; A. Replace 'B' with the column name which holds the result of your improved version; B.

In [22]:
# Run Independent Samples T-test when assumptions are not violated.



for question in df_columns:
    # Run Independent Samples T-test
    t_stat, p_value = stats.ttest_ind(df_A[question], df_B[question])
    
    # Print the results for each question
    print(f"Results for {question}: T-statistic = {t_stat}, P-value = {p_value}")
    if p_value < 0.05:
        print("The results are significant; the versions are different enough to exclude chance as the driver.")
        print("This indicates that the version (A or B) with a higher/lower average score works better/worse, statistically speaking.")
    else:
        print("The results are not significant; the changes don't have a real measurable effect.")
        print("This might mean the version is no better, or perhaps the questions don't effectively measure the intended effect.")
    print("\n")
# # Print the results
# print(f"The results are significant if the p-value is significant, which means smaller than 0.05", 
# results,
# "\n", 
# "If the results are significant, that means that the version are different enough to exclude chance for being the driver. So if you version has a higher/lower average score and is statistically significant, then it works better/worse. If the results are not significant then the changes don't have a real measureable effect so maybe it's no better or maybe the questions don't really measure the effect and you should consider rephrase or removing them.There's more to it but that for inferential statistics in year 2.")

Results for q1: T-statistic = -1.828533995513482, P-value = 0.07355962674200757
The results are not significant; the changes don't have a real measurable effect.
This might mean the version is no better, or perhaps the questions don't effectively measure the intended effect.


Results for q2: T-statistic = 0.23357135135475796, P-value = 0.8162908423033319
The results are not significant; the changes don't have a real measurable effect.
This might mean the version is no better, or perhaps the questions don't effectively measure the intended effect.


Results for q3: T-statistic = -1.8117504102457893, P-value = 0.07615559139945592
The results are not significant; the changes don't have a real measurable effect.
This might mean the version is no better, or perhaps the questions don't effectively measure the intended effect.


Results for q4: T-statistic = 0.3523108509902171, P-value = 0.7261159989065984
The results are not significant; the changes don't have a real measurable effect.
This

In [23]:
# Run Bootstrapped Independent Samples T-test when assumptions are violated
rng = np.random.default_rng() # create random sampling

results = stats.ttest_ind(a = df_A, b = df_B,
                          random_state = rng)


for question in df_columns:
    # Run Independent Samples T-test
    t_stat, p_value = stats.ttest_ind(df_A[question], df_B[question] , random_state=rng)
    
    # Print the results for each question
    print(f"Results for {question}: T-statistic = {t_stat}, P-value = {p_value}")
    if p_value < 0.05:
        print("The results are significant; the versions are different enough to exclude chance as the driver.")
        print("This indicates that the version (A or B) with a higher/lower average score works better/worse, statistically speaking.")
    else:
        print("The results are not significant; the changes don't have a real measurable effect.")
        print("This might mean the version is no better, or perhaps the questions don't effectively measure the intended effect.")
    print("\n")



Results for q1: T-statistic = -1.828533995513482, P-value = 0.07355962674200757
The results are not significant; the changes don't have a real measurable effect.
This might mean the version is no better, or perhaps the questions don't effectively measure the intended effect.


Results for q2: T-statistic = 0.23357135135475796, P-value = 0.8162908423033319
The results are not significant; the changes don't have a real measurable effect.
This might mean the version is no better, or perhaps the questions don't effectively measure the intended effect.


Results for q3: T-statistic = -1.8117504102457893, P-value = 0.07615559139945592
The results are not significant; the changes don't have a real measurable effect.
This might mean the version is no better, or perhaps the questions don't effectively measure the intended effect.


Results for q4: T-statistic = 0.3523108509902171, P-value = 0.7261159989065984
The results are not significant; the changes don't have a real measurable effect.
This

Great, that was our first t-test. Save the results to your learning log in the week 8 and interpret them there. Were they what you expected? What are you going to change to improve your design if neccesary. 