In [1]:
# Module Import
from Functions import *

# Brief Intro

## Normality tests

Normality tests should be conducted for all series that are introduced to the tests to establish if the proper testing methods are parametric or non parametric. The first test is heuristic with visualizing the dataset with a QQ plot and the second one is a shapiro test. If either method indicates any series are not normally distributed then non parametric methods are selected.

## Parametric methods

### Pearson r correlation: 
The most widely used correlation statistic used to measure the degree of the relationship between two linearly related
variables.
Note: For the Pearson r correlation, both variables should be normally distributed (normally distributed variables have
a bell-shaped curve).

### Two sample t-test: 
To test for differences between two groups of users
(e.g. males vs females). The t-test can be used to determine if the means of
two sets of data are significantly different from each other.

### One way analysis of variance (ANOVA) test: 
A technique that can be used to compare means of two or more samples. 
Typically, however, the oneway ANOVA is used to test for differences among at least three groups,
since the two-group case can be covered by a t-test.

## Non parametric methods

### Spearman rank correlation: 
A non-parametric test that is used to measure the degree of association between two variables.
Note: The Spearman rank correlation test (vs to Pearson) does not carry any
assumptions about the distribution of the data

### Wilcoxon-Mann-Whitney (WMW) rank sum test: 
Non parametric method to test
for differences between two groups of users (e.g. males vs females).

### Kruskal–Wallis test: 
Non parametric method for testing whether samples
originate from the same distribution. Kruskal-Wallis can accommodate more
than two groups, extending Wilcoxon-Mann-Whitney. The parametric equivalent
of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA).

# Examples

### Example workflow for difference in means

In [2]:
#import the Bounce Rate Datasets
xls1 = pd.ExcelFile('Data/Bounce_Rate_2018.xlsx')
xls2 = pd.ExcelFile('Data/Bounce_Rate_2019.xlsx')
Bounce_Rate_2018 = pd.read_excel(xls1, 'Dataset1')['Bounce Rate']
Bounce_Rate_2019 = pd.read_excel(xls2, 'Dataset1')['Bounce Rate']

  warn("Workbook contains no default style, apply openpyxl's default")


In [3]:
#Make a Series Object for each variable
Bounce_Rate_2018.rename({"Bounce Rate" :'Bounce Rate 2018'},inplace=True)
Bounce_Rate_2019.rename({"Bounce Rate" :'Bounce Rate 2019'},inplace=True)

In [4]:
means_test(Bounce_Rate_2018, Bounce_Rate_2019)


Shapiro Test Statistics=0.988, p=0.593
Variable Bounce Rate looks Gaussian (fail to reject H0)

Shapiro Test Statistics=0.968, p=0.023
Variable Bounce Rate does not look Gaussian (reject H0)
Statistics=14.142, p=0.000
Bounce Rate  mean is bigger than that of  Bounce Rate
The difference between the datasets is significant (reject H0)


In [5]:
correlation_test(Bounce_Rate_2018,Bounce_Rate_2019)


Shapiro Test Statistics=0.988, p=0.593
Variable Bounce Rate looks Gaussian (fail to reject H0)

Shapiro Test Statistics=0.968, p=0.023
Variable Bounce Rate does not look Gaussian (reject H0)
Statistics=-0.184, p=0.081
The two datasets are not corelated


(-0.18383289657199248, 0.0811049728206936)

In [6]:
normality_test(Bounce_Rate_2018)


Shapiro Test Statistics=0.988, p=0.593
Variable Bounce Rate looks Gaussian (fail to reject H0)


In [7]:
#import the User Datasets and convert to Series
xls = pd.ExcelFile('Data/Channel_Sessions.xlsx')
Organic_Search = pd.read_excel(xls, 'Organic Search').rename(columns={"Sessions" :'Organic Search'}).drop('Day Index', axis=1).squeeze()
Direct = pd.read_excel(xls, 'Direct').rename(columns={"Sessions" :'Direct'}).drop('Day Index', axis=1).squeeze()
Referral = pd.read_excel(xls, 'Referral').rename({"Sessions" :'Referral'}).drop('Day Index', axis=1).squeeze()
Social = pd.read_excel(xls, 'Social').rename(columns={"Sessions" :'Social'}).drop('Day Index', axis=1).squeeze()
Affiliates = pd.read_excel(xls, 'Affiliates').rename(columns={"Sessions" :'Affiliates'}).drop('Day Index', axis=1).squeeze()
Paid_Search = pd.read_excel(xls, 'Paid Search').rename(columns={"Sessions" :'Paid Search'}).drop('Day Index', axis=1).squeeze()

In [8]:
# Create a dict of means and names
means_dict = {"Organic_Search" :Organic_Search, "Direct":Direct, "Referral":Referral, "Social":Social, "Affiliates":Affiliates, "Paid_Search": Paid_Search}

In [9]:
multiple_means_test(means_dict)


Shapiro Test Statistics=0.095, p=0.000
Variable Organic Search does not look Gaussian (reject H0)
 

Shapiro Test Statistics=0.095, p=0.000
Variable Direct does not look Gaussian (reject H0)
 

Shapiro Test Statistics=0.102, p=0.000
Variable Sessions does not look Gaussian (reject H0)
 

Shapiro Test Statistics=0.095, p=0.000
Variable Social does not look Gaussian (reject H0)
 

Shapiro Test Statistics=0.108, p=0.000
Variable Affiliates does not look Gaussian (reject H0)
 

Shapiro Test Statistics=0.123, p=0.000
Variable Paid Search does not look Gaussian (reject H0)
 
The difference between the datasets is significant (reject H0)
Making Pairwise tests:
Pair:  ('Organic_Search', 'Direct')

Shapiro Test Statistics=0.095, p=0.000
Variable Organic Search does not look Gaussian (reject H0)

Shapiro Test Statistics=0.095, p=0.000
Variable Direct does not look Gaussian (reject H0)
Statistics=11.384, p=0.000
Organic Search  mean is bigger than that of  Direct
The difference between the datas