# A/B Test Analysis
We're going to conduct an Independent Samples T-test to analyse our A/B test. An Indepdent Samples T-test compares the differences between two means of two different samples

In [5]:
import pandas as pd
import numpy as np
import scipy.stats as stats

Export your results to a .csv file and save it to you github repository. Import your .csv file, inspect it, and clean it where neccesary.

In [14]:
import pandas as pd

# Load data from Excel files
df_A = pd.read_excel("A_B_Test_A.xlsx")
df_B = pd.read_excel("A_B_Test_B.xlsx")

# EDA for dataframe A
print("Info for dataframe A:")
print(df_A.info())  # Check data format
print("\nFirst few rows of dataframe A:")
print(df_A.head())  # Quick EDA

# EDA for dataframe B
print("\nInfo for dataframe B:")
print(df_B.info())  # Check data format
print("\nFirst few rows of dataframe B:")
print(df_B.head())  # Quick EDA


Info for dataframe A:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column                                       Non-Null Count  Dtype
---  ------                                       --------------  -----
 0   I understood what I could use the app for    10 non-null     int64
 1   I found the application intuitive to use     10 non-null     int64
 2   I thought the application was useful         10 non-null     int64
 3   I enjoyed the application                    10 non-null     int64
 4   The buttons in the app are self explanatory  10 non-null     int64
dtypes: int64(5)
memory usage: 528.0 bytes
None

First few rows of dataframe A:
   I understood what I could use the app for  \
0                                          7   
1                                          5   
2                                          5   
3                                          5   
4                                          6   

   I

The rest we leave for tomorrow when we actually have our data. But if you are eager to play around a bit you can simply refresh the survey and fill in a couple of responses to create an A and a B version.

Now, let's start analysing our gathered data! This block we won't dive into inferential statistics since it can get complex quite fast; we'll do that in Year 2, block A. For now, you just need to know that we need to test whether the data is normally distributed and whether the variances of both samples are equal. Otherwise, our statistical tests would not be valid and we can therefore not say that the results we're seeing are due to chance. What we are going to statistically ascertain is whether there is a statistically significant different in the mean of a given variable for version A or B. 

In [18]:
import pandas as pd
from scipy import stats

# Load data from Excel files
df_A = pd.read_excel("A_B_Test_A.xlsx")
df_B = pd.read_excel("A_B_Test_B.xlsx")

# List of questions
questions = [
    "I understood what I could use the app for",
    "I found the application intuitive to use",
    "I thought the application was useful",
    "I enjoyed the application",
    "The buttons in the app are self explanatory"
]

# Iterate through each question
for question in questions:
    print(f"Analyzing question: {question}")
    
    # Run the Shapiro-Wilk test for normality on both groups
    normal_a = stats.shapiro(df_A[question])
    normal_b = stats.shapiro(df_B[question])

    # Check the equality of variances using Levene's test
    homogeneity = stats.levene(df_A[question], df_B[question])

    # Print the results
    print(f"Shapiro-Wilk test for normality - Group A: {normal_a}")
    print(f"Shapiro-Wilk test for normality - Group B: {normal_b}")
    print(f"Levene's test for equality of variances: {homogeneity}")

    # Perform further analysis based on the results of the tests
    # (e.g., t-test or bootstrapped version)
    # Add your analysis code here
    print("\n")  # For better readability between questions


Analyzing question: I understood what I could use the app for
Shapiro-Wilk test for normality - Group A: ShapiroResult(statistic=0.8946101069450378, pvalue=0.19099095463752747)
Shapiro-Wilk test for normality - Group B: ShapiroResult(statistic=0.8588257431983948, pvalue=0.0739126205444336)
Levene's test for equality of variances: LeveneResult(statistic=0.0, pvalue=1.0)


Analyzing question: I found the application intuitive to use
Shapiro-Wilk test for normality - Group A: ShapiroResult(statistic=0.8731635212898254, pvalue=0.10880515724420547)
Shapiro-Wilk test for normality - Group B: ShapiroResult(statistic=0.7940640449523926, pvalue=0.012278728187084198)
Levene's test for equality of variances: LeveneResult(statistic=12.32876712328767, pvalue=0.0024937323027699816)


Analyzing question: I thought the application was useful
Shapiro-Wilk test for normality - Group A: ShapiroResult(statistic=0.8782697916030884, pvalue=0.12464962154626846)
Shapiro-Wilk test for normality - Group B: Shap

Now that is in the right format and we know the column names. Replace 'A' with the column name which holds your original baseline version; A. Replace 'B' with the column name which holds the result of your improved version; B.

In [20]:
import pandas as pd
from scipy import stats

# Load data from Excel files
df_A = pd.read_excel("A_B_Test_A.xlsx")
df_B = pd.read_excel("A_B_Test_B.xlsx")

# List of questions
questions = [
    "I understood what I could use the app for",
    "I found the application intuitive to use",
    "I thought the application was useful",
    "I enjoyed the application",
    "The buttons in the app are self explanatory"
]

# Iterate through each question
for question in questions:
    print(f"Analyzing question: {question}")
    
    # Run the Shapiro-Wilk test for normality on both groups
    normal_a = stats.shapiro(df_A[question])
    normal_b = stats.shapiro(df_B[question])

    # Check the equality of variances using Levene's test
    homogeneity = stats.levene(df_A[question], df_B[question])

    # Print the results
    print(f"Shapiro-Wilk test for normality - Group A: {normal_a}")
    print(f"Shapiro-Wilk test for normality - Group B: {normal_b}")
    print(f"Levene's test for equality of variances: {homogeneity}")

    # Check if assumptions are met for the t-test
    if normal_a.pvalue > 0.05 and normal_b.pvalue > 0.05 and homogeneity.pvalue > 0.05:
        # Conduct the t-test
        results = stats.ttest_ind(df_A[question], df_B[question])
        
        # Print the results of the t-test
        print(f"T-test results: {results}")
        
        # Interpret the results
        if results.pvalue < 0.05:
            print("The results are significant, indicating a statistically significant difference between versions A and B.")
            if results.statistic > 0:
                print("Version B has a higher average score than version A.")
            else:
                print("Version A has a higher average score than version B.")
        else:
            print("The results are not significant, suggesting no statistically significant difference between versions A and B.")
            # Add further interpretation here based on the lack of significance
    else:
        print("Assumptions for t-test are violated. Consider running a non-parametric test or examining the data further.")
    
    print("\n")  # For better readability between questions


Analyzing question: I understood what I could use the app for
Shapiro-Wilk test for normality - Group A: ShapiroResult(statistic=0.8946101069450378, pvalue=0.19099095463752747)
Shapiro-Wilk test for normality - Group B: ShapiroResult(statistic=0.8588257431983948, pvalue=0.0739126205444336)
Levene's test for equality of variances: LeveneResult(statistic=0.0, pvalue=1.0)
T-test results: TtestResult(statistic=-0.4285714285714289, pvalue=0.6733199381664248, df=18.0)
The results are not significant, suggesting no statistically significant difference between versions A and B.


Analyzing question: I found the application intuitive to use
Shapiro-Wilk test for normality - Group A: ShapiroResult(statistic=0.8731635212898254, pvalue=0.10880515724420547)
Shapiro-Wilk test for normality - Group B: ShapiroResult(statistic=0.7940640449523926, pvalue=0.012278728187084198)
Levene's test for equality of variances: LeveneResult(statistic=12.32876712328767, pvalue=0.0024937323027699816)
Assumptions for 

In [23]:
import pandas as pd
from scipy import stats

# Create dataframes for tests A and B
data_A = {
    "I found the application intuitive to use": [4, 3, 4, 3, 7, 7, 5, 6, 6, 3],
    "I thought the application was useful": [5, 1, 4, 2, 6, 6, 5, 7, 6, 5],
    "I enjoyed the application": [3, 1, 3, 2, 5, 7, 4, 6, 5, 4],
    "The buttons in the app are self explanatory": [3, 7, 6, 2, 5, 6, 4, 6, 7, 6]
}

data_B = {
    "I found the application intuitive to use": [6, 6, 7, 7, 6, 7, 6, 6, 6, 5],
    "I thought the application was useful": [5, 5, 7, 5, 5, 7, 6, 6, 6, 6],
    "I enjoyed the application": [5, 4, 7, 7, 6, 7, 6, 6, 6, 6],
    "The buttons in the app are self explanatory": [6, 5, 7, 5, 5, 7, 6, 6, 6, 6]
}

df_A = pd.DataFrame(data_A)
df_B = pd.DataFrame(data_B)

# List of questions
questions = [
    "I found the application intuitive to use",
    "I thought the application was useful",
    "I enjoyed the application",
    "The buttons in the app are self explanatory"
]

# Iterate through each question
for question in questions:
    print(f"Analyzing question: {question}")
    
    # Conduct the Mann-Whitney U test
    results = stats.mannwhitneyu(df_A[question], df_B[question], alternative='two-sided')
        
    # Print the results of the Mann-Whitney U test
    print(f"Mann-Whitney U test results: {results}")
    
    # Interpret the results
    if results.pvalue < 0.05:
        print("The results are significant, indicating a statistically significant difference between versions A and B.")
    else:
        print("The results are not significant, suggesting no statistically significant difference between versions A and B.")
    
    print("\n")  # For better readability between questions


Analyzing question: I found the application intuitive to use
Mann-Whitney U test results: MannwhitneyuResult(statistic=25.5, pvalue=0.05819135594349073)
The results are not significant, suggesting no statistically significant difference between versions A and B.


Analyzing question: I thought the application was useful
Mann-Whitney U test results: MannwhitneyuResult(statistic=33.0, pvalue=0.1917180227189862)
The results are not significant, suggesting no statistically significant difference between versions A and B.


Analyzing question: I enjoyed the application
Mann-Whitney U test results: MannwhitneyuResult(statistic=17.0, pvalue=0.012134269282682535)
The results are significant, indicating a statistically significant difference between versions A and B.


Analyzing question: The buttons in the app are self explanatory
Mann-Whitney U test results: MannwhitneyuResult(statistic=41.5, pvalue=0.5226099817831149)
The results are not significant, suggesting no statistically significant d

In [24]:
import pandas as pd
from scipy import stats

# Create dataframes for tests A and B
data_A = {
    "I understood what I could use the app for": [7, 5, 5, 5, 6, 6, 4, 7, 7, 6],
    "I found the application intuitive to use": [4, 3, 4, 3, 7, 7, 5, 6, 6, 3],
    "I thought the application was useful": [5, 1, 4, 2, 6, 6, 5, 7, 6, 5],
    "I enjoyed the application": [3, 1, 3, 2, 5, 7, 4, 6, 5, 4],
    "The buttons in the app are self explanatory": [3, 7, 6, 2, 5, 6, 4, 6, 7, 6]
}

data_B = {
    "I understood what I could use the app for": [6, 7, 7, 5, 6, 7, 6, 6, 6, 7],
    "I found the application intuitive to use": [6, 6, 7, 7, 6, 7, 6, 6, 6, 5],
    "I thought the application was useful": [5, 5, 7, 5, 5, 7, 6, 6, 6, 6],
    "I enjoyed the application": [5, 4, 7, 7, 6, 7, 6, 6, 6, 6],
    "The buttons in the app are self explanatory": [6, 5, 7, 5, 5, 7, 6, 6, 6, 6]
}

df_A = pd.DataFrame(data_A)
df_B = pd.DataFrame(data_B)

# List of questions
questions = [
    "I understood what I could use the app for",
    "I found the application intuitive to use",
    "I thought the application was useful",
    "I enjoyed the application",
    "The buttons in the app are self explanatory"
]

# Iterate through each question
for question in questions:
    print(f"Analyzing question: {question}")
    
    # Conduct the Mann-Whitney U test
    results = stats.mannwhitneyu(df_A[question], df_B[question], alternative='two-sided')
        
    # Print the results of the Mann-Whitney U test
    print(f"Mann-Whitney U test results: {results}")
    
    # Interpret the results
    if results.pvalue < 0.05:
        print("The results are significant, indicating a statistically significant difference between versions A and B.")
    else:
        print("The results are not significant, suggesting no statistically significant difference between versions A and B.")
    
    print("\n")  # For better readability between questions


Analyzing question: I understood what I could use the app for
Mann-Whitney U test results: MannwhitneyuResult(statistic=36.0, pvalue=0.2786190401034454)
The results are not significant, suggesting no statistically significant difference between versions A and B.


Analyzing question: I found the application intuitive to use
Mann-Whitney U test results: MannwhitneyuResult(statistic=25.5, pvalue=0.05819135594349073)
The results are not significant, suggesting no statistically significant difference between versions A and B.


Analyzing question: I thought the application was useful
Mann-Whitney U test results: MannwhitneyuResult(statistic=33.0, pvalue=0.1917180227189862)
The results are not significant, suggesting no statistically significant difference between versions A and B.


Analyzing question: I enjoyed the application
Mann-Whitney U test results: MannwhitneyuResult(statistic=17.0, pvalue=0.012134269282682535)
The results are significant, indicating a statistically significant dif

In [25]:
import numpy as np
from scipy import stats

# List of questions
questions = [
    "I understood what I could use the app for",
    "I found the application intuitive to use",
    "I thought the application was useful",
    "I enjoyed the application",
    "The buttons in the app are self explanatory"
]

# Create a random number generator for bootstrapping
rng = np.random.default_rng()

# Iterate through each question
for question in questions:
    print(f"Analyzing question: {question}")
    
    # Run Bootstrapped Independent Samples T-test when assumptions are violated
    results = stats.ttest_ind(df_A[question],
                              df_B[question],
                              random_state=rng)
    
    # Print the results
    print(f"The results are significant if the p-value is less than 0.05: {results.pvalue < 0.05}") 

    if results.pvalue < 0.05:
        if results.statistic > 0:
            print("Version B has a significantly higher average score than version A.")
        else:
            print("Version A has a significantly higher average score than version B.")
    else:
        print("The results are not significant, indicating no statistically significant difference between versions A and B for this question.")
    
    print("\n")  # For better readability between questions


Analyzing question: I understood what I could use the app for
The results are significant if the p-value is less than 0.05: False
The results are not significant, indicating no statistically significant difference between versions A and B for this question.


Analyzing question: I found the application intuitive to use
The results are significant if the p-value is less than 0.05: True
Version A has a significantly higher average score than version B.


Analyzing question: I thought the application was useful
The results are significant if the p-value is less than 0.05: False
The results are not significant, indicating no statistically significant difference between versions A and B for this question.


Analyzing question: I enjoyed the application
The results are significant if the p-value is less than 0.05: True
Version A has a significantly higher average score than version B.


Analyzing question: The buttons in the app are self explanatory
The results are significant if the p-value

Great, that was our first t-test. Save the results to your learning log in the week 8 and interpret them there. Were they what you expected? What are you going to change to improve your design if neccesary. 