# Chapter 3: Statistical Experiments and Significance Testing


The goal of a statistical experiment is to answer to hypothesis by gathering data and apply statistical tools to draw inferences. Here are the different concepts and tools for designing experiments, analyzing data, and drawing conclusions:
* A/B testing
* Hypothesis Test
* Resampling
* Statistical significance and p-Values
* t-Tests
* Multiple testing
* Degrees of freedom
* ANOVA 
* Chi-square test
* Multi-arm bandit algorithm
* Power and sample size

### A/B testing

A/B testing is a method for comparing two versions of a variable to determine which one performs better according to a certain metric. For example, testing if jacket red performes better that jacket blue in terms of the number of jackets bought. Here the key elements of an A/B testing:

* Variable 
    * The treatment under study: color of the jacket, drugs, web headlines, the color of a layout.
* Control Group and treatment Group 
    * The subjects (buyer, patient etc) are randomly divided into two groups: the control group (A) receives the original version, while the treatment group (B) receives the variant.
* Test statistic (or metric)
    * The metric or set of metrics used to evaluate the performance of each version. It can be binary variable such as click-through rates or conversion rates and or continuous variable (profit, pages visited).
* Statistical Analysis
    * After the test is conducted, statistical methods are used to analyze the results and determine whether the differences in outcomes between the two versions are statistically significant.


#### Applications
A/B testing is mainly applied in a web context:

* Marketing
    * Optimizing campaign strategies: email content, and sending times.
* Website Optimization
    * Improving user experience.
* Pricing Strategies
    * Evaluating customer response to different pricing models.

#### Limitations

* Sample Size
    * A/B tests require sufficiently large sample sizes to make any differences between the two groups statistically significant.


### Hypothesis tests (significance tests)

Hypothesis tests are statistical method to determine whether a result is due to random chance. In other words, the primary role of significance tests, is to prevent us from being fooled by random variations. The process involves comparing two hypotheses, the null hypothesis and the alternative hypothesis:

* Null Hypothesis (H0)
    * It posits that there is no difference, and any difference is solely due to chance. Based on sample evidence, the test aims to either reject the null hypothesis - the difference is to extreme for only being to chance - or failed to reject - the difference is could likeky originate from random chance.
    
*   Alternative Hypothesis (H1)
    * It is the hypothesis in case of rejection of the null hypothesis. The null and alternative hypotheses must account for all possibilities.

*   One-way Test
    * The test looks only for a difference in one direction. Is A greater than B? If the null hypothesis is rejected: it is unlikly that chance produces a result where A is as greater to B that the one observed. Therefore, A seems better than BL It is useful for drug testing, where only a better result matters. 

*   Two-way Test 
    * This test looks for any significant difference, regardless of direction. Is A greater or lower than B? If the null hypothesis is rejected: it is unlikly that chance produce a result where A is as superieur or as inferior to B than the one observed.

#### Applications
Hypothesis testing finds applications in numerous fields  such as:

* A/B test
    * An hypothesis test serves as a next step in analyzing the results of an A/B test, or any randomized experiment, to rigorously evaluate the observed differences between groups A and B.
* Medical research 
    * Testing whether a new drug is more effective than existing treatments. On-way test where the null hypothesis states that there is no difference between the two drugs, while the alternative hypothesis states the new does better.
* Sales optimization
    * Determining if a change in packaging leads to improved sales. 


#### Limitations
Despite its broad utility, hypothesis testing comes with certain limitations:

* Risk of errors
    * There is a risk of making Type I errors (falsely rejecting the null hypothesis) or Type II errors (failing to reject a false null hypothesis).

* Directional bias in one-way tests
    * While one-way tests can be powerful for testing specific hypotheses, they may overlook significant effects in the opposite direction.





### Resampling

Resampling involves drawing repeated samples from observed data, mainly through:

* Bootstrap: Estimating the reliability of statistics by sampling with replacement (see chapter 2).
* Permutation Tests: Testing hypotheses by shuffling data labels to simulate the null hypothesis

#### Permutation Test

Permutation test is a non-parametric statistical method used for hypothesis testing. Indeed, this test asses the significance of a difference between groups without assuming a specific underlying data distribution. Here the main steps:

1. Gather the values from the different groups (e.g. A and B) in one dataset
2. Draw randomly the values without replacements to form new groups (A',B') of the original size (A,B). 
3. Measure the test statistic understudy (e.g. the mean difference: mean(A')-mean(B')) and record it.
4. Repeat R times the steps 2 and 3 to obtain a permutation distribution of the test statistic 
5. Compared the original metric mean(A)-mean(B) to the permutation distribution

It the original metric lies within the permutation distribution, the difference mean(A)-mean(B) could be obtained by random process. Otherwise, the difference is too extreme to be generated randomly. Thus the difference is statistically significant.


#### Applications
Permutation tests can be applied to various types of data — whether numeric or binary — and are not constrained by equal sample sizes or the assumption of normal distribution in data. This flexibility positions resampling as a more universally strategy for hypothesis testing.


#### Limitations

Permutation tests are insightful but come with limitations compared to conventional statistical tests:

* Computational demands
    * They can be computationally intensive, especially with large datasets.
* Theoretical Insights
    * Lacks insights into the theoretical distribution of data, limiting broader inferences beyond the observed dataset.









### Statistical Significance and P-Values
Statistical significance is a measure that tells us how likely it is that an effect observed in a dataset occurs due to chance. Thus, it helps to decide if the results are unusual enough to reject the null hypothesis. For that, we often use the p-value, which quantifies the probability of obtaining results at least as extreme as the ones observed, assuming that the null hypothesis is true. There are different tools to obtain it:

* Permutation Test
    * From the null hypothesis, it generates the metric of interest by random chance. Then, we calculate the proportion of times that the permutation test produces a metric equal to or greater than the observed one.
* Chance Model
    * A chance model, such as a binomial or chi-square distribution, can simulate the distribution of the data in case they were purely due to chance. Then, we calculate the proportion of times that the chance produces data equal to or greater than the observed one.

To accept or reject the null hypothesis based on the p-value, we often use a threshold named the significance level, denoted as alpha (α). The threshold is commonly set at 5% or 1%, which is arbitrary. If the p-value is above the predefined threshold, we conclude that the observed values are likely to be produced by chance and the effect is not statistically significant.

#### Applications
Significance testing and p-values are mainly used to prevent researchers and data scientists from being misled by random chance. It is a metric among others that can help in the decision process, but it should not be the sole factor relied upon.

#### Limitations
* Practical Significance
    * P-values do not provide any information about the importance of the effect under study.

* Misinterpretation
    * There's a common misconception that a p-value tells you the probability that the null hypothesis is true or false. However, it merely indicates the probability of observing the data if the null hypothesis were true.

* Arbitrary Thresholds
    * The common threshold of 5% for declaring statistical significance is arbitrary and can lead to the neglect of potentially important findings that don't meet this cutoff.

### t-tests

The t-tests are  used to determine if there is a significant difference between the means of two groups. It has been shown to be a good approximation of the permutation tests for calculating the p-value for an A/B test.

The t-test relies on the t-statistic, which is calculated from sample data as:
t-statistic =[mean(A)-mean(B)]/[s/n**0.5]
with s the pooled standard deviation of the two samples, and n is the sample size (assuming equal sizes for simplicity). This statistic follows a t-distribution. Thus, the p-value corresponds to the probability that the t-distribution produces a value greater than the absolute value of the calculated t-statistic (for a one-tailed test).


#### Applications
* Academic Research
    * Understanding how far a sample mean deviates from a hypothesized population mean can help researchers in fields from psychology to economics validate theories and models.

* Product Testing 
    * Compare the effectiveness or quality of different product versions/



#### Limitations with a Focus on the T-Statistic
* Distribution assumptions
    * The accuracy of the t-statistic depends on the assumption that the data follows a normal distribution. Deviations from normality can lead to incorrect t-statistic calculations and conclusions.

* Sensitivity to sample size
    * The t-statistic's reliability increases with sample size. With very small samples, the t-statistic might not accurately reflect the population parameters, leading to Type I or Type II errors.

* Dependency on Variance Homogeneity
    * When comparing two groups, the calculation of the t-statistic assumes homogeneity of variances. If this assumption is violated, the resulting t-statistic might not accurately reflect the true difference between groups, although adjustments like Welch's correction exist to mitigate this issue.

* Multiple testing
    * In research studies or data mining projects where there is a high degree of multiplicity, such as multiple comparisons, analysis of numerous variables, or the use of various models, there is an increase risk of mistakenly identifying findings as statistically significant merely due to random chance.

### ANOVA

It is a statistical method used to compare the means of two or more samples to understand if at least one of the sample means significantly differs from the others. ANOVA operates under the null hypothesis that all group means are equal, and any observed differences are due to chance. The method extends the t-test, which is limited to comparing two groups. Here the main methods:

* F-tests
    * It calculates an F-statistic based on the ratio of variance between the groups to the variance within the groups. A significant F-statistic suggests that at least one group mean significantly differs from the others. The F-test is advantageous because it can handle multiple groups simultaneously, reducing the risk of Type I errors by multiple testing.

* Two-Way ANOVA
    * While a basic (one-way) ANOVA compares means across a single factor, Two-Way ANOVA allows for the examination of two independent factors simultaneously. 


### Applications

* Feature Selection
    * It can help in identifying which variables significantly impact a target variable, allowing for the reduction of model complexit.

* Quality Control
    * ANOVA can compare the effects of different process parameters or strategies on product quality or operational efficiency.

### Limitations
* Assumption of Normality
    * It assumes that the data for each group are drawn from a normal distribution. 

* Homogeneity of Variances
    * ANOVA requires that the variances within each of the groups being compared are approximately equal. 

* Independence of Observations
    * The method assumes that the observations within each group are independent of each other. In some study designs, such as repeated measures or clustered samples, this assumption may not be met.

* Limited to Mean Comparisons
    * ANOVA focuses on differences in means among groups. It does not provide information on median differences or other aspects of distribution shapes, which might be of interest in some analyses.

* Multiple Comparisons Issue
    * While ANOVA can indicate that there is a significant difference among group means, it does not specify where these differences lie. 

### Chi-Square test

The Chi-Square test is used to evaluate whether the observed frequencies of a categorical variable deviates significantly from the expected frequencies. The test calculates a Chi-Square statistic, which measures the discrepancy between observed and expected frequencies. From it, the p-Value can be computed in two different ways:
* Using  Chi-Square distribution
    * The p-Value corresponds to the probability that the chi-square distribution produces a value greater than the absolute value of the calculated t-statistic
* Using a permutation test
    * The p-value corresponds to the frequency that the resampled sum of squared deviations exceed the observed.

The chi-square statistic is mainly used for two tests:
* Goodness-of-fit test
    * It determines if a sample data matches a population with a specific distribution. For instance, it can test if the number of individuals with certain characteristics in a sample is consistent with the expected distribution of those characteristics in the general population.
* Test of Independence
    * It assesses whether there is a significant association between two categorical variables. For example, it can be used to determine if there is a relationship between gender (male/female) and preference for a particular type of product (Product A/Product B). Mainly use with contingency table.


#### Application

* Experimental Research
    * It is used for establishing statistical significance between categorical variables in experimental research.
* Data Science
    * It can be used for determining appropriate sample sizes for web experiments, such as A/B testing. By analyzing pilot studies using the Chi-Square test, we can estimate the minimum number of participants needed to achieve statistically significant results.


### Limitations
* Requirement for categorical data
    * It cannot be applied to continuous data.

* Sample size restrictions
    * The test requires a sufficient sample size to ensure reliable results.
    
* Assumption of independence
    * The observations need to be independent of each other. 

* No information on direction
    * While the Chi-Square test can indicate whether variables are related, it does not provide information on the direction or strength of the relationship. 

* Sensitivity to sample size
    * Extremely large sample sizes may lead to significant Chi-Square values for trivial differences, while small sample sizes may lack the power to detect meaningful associations.

### Multi-Arm Bandit Algorithm

Multi-Arm Bandit Algorithm (MABA) is a procedure that optimize the testing of new possibilities. MABA uses the information during the experiment to oriente the distribution process limiting the waste (time and material). For example, in the context of a medical experiment aiming to test the efficiency of four different drugs across 100 subjects, a traditional approach might divide the subjects equally among the drugs, with each drug being administered to 25 subjects. In contrast, the MABA starts by distributing the drugs among subjects but then continuously re-adjusts the allocation based on the observed outcomes, often named the rate reward (e.g., improvement in health indicators, side effects), from the initial distributions.

#### Applications

* Clinical Trials
    * Dynamically allocating patients to different treatment to find the most effective treatment and treat patients effectively during the trial.
* Website Optimization
    * Continuously testing different versions of web page elements such as headlines or layouts.


#### Limitations

* Stationnary Assumptions
    * MABA assumes stationary rate reward distributions, which may not hold in dynamic environments where the probabilities of rewards change over time.





### Power and sample size

The power is the probability of correctly rejecting the null hypothesis or in other words, the probability to detect an effect when there is one. This probability depends on :

* Size effect  
    * It is the magnitude of the effect (or difference). Stronger is the importance of the effect, easier is its detection, thus higher is the probability of correctly detect it. In contrast, the statistical significance tells us that there is a difference but gives no information about the importance of the effect.
* Sample size
    * The larger the sample size, the easier it is to observe smaller differences. For instance, a 10% difference observed in a sample size of 200 yields a numerical difference of 20 (10% of 200), whereas the same percentage difference in a sample size of 20,000 yields a numerical difference of 2,000 (10% of 20,000). T
* Significance level (α)
    * The threshold for rejecting the null hypothesis. A higher α increases the risk of Type I error (rejecting the null hypothesis when it is true) but also increases power. The most common α level is 0.05.

#### Applications

* Designing Studies
    * Before conducting a study, power analysis allows to determine the necessary sample size to detect an effect of interest with a desired power, typically 0.80 or 80%. This ensures that the study is neither overpowered (wasting resources) nor underpowered (risking missing a true effect).

