## Effect Size


Quantifying the difference between two groups can be achieved by using an effect size. A p-value provides information about statistical significance of the difference between the groups, but it doesn't give an insight into the magnitude of the difference. Larger sample sizes often result in a higher likelihood of finding a statistically significant difference, even if the real-world effect is small. Hence, it's crucial to consider effect sizes in addition to p-values, as they provide a clearer picture of the true difference between the groups and are more valuable in practical applications

There are different measures for effect sizes. The most common effect sizes are Cohen's d and Pearson's r.   

Cohen's d measures the size of the difference between two groups while Pearson's r measures the strength of the relationship between two variables.

### Cohen's d -  Standardized Mean Difference
Cohen's d is designed for comparing two groups. It takes the difference between two means and expresses it in standard deviation units. It tells you how many standard deviations lie between the two means.

$$ d =\frac{ \overline x_1 - \overline x_2 }{S}$$

where  $\overline x_1$ and $\overline x_2$ are mean of group 1 and group 2 respectively. $S$ is standard deviation.

The choice of standard deviation in the equation depends on your research design.
We can use:
+  pooled standard deviation that is based on data from both groups,
+ standard deviation from a control group.
+ the standard deviation from the pretest data or posttest.

### Pearson's r - Correlation Coefficient

Pearson's $r$, or the correlation coefficient, measures the extent of a linear relationship between two variables.

The formula is rather complex, so it’s best to use a statistical software to calculate Pearson's r accurately from the raw data.

$$ r_{xy} = \frac{n\sum x_i y_i -\sum x_i \sum y_i}{\sqrt{n\sum x_i^2-(\sum x_i)^2}{\sqrt{n\sum y_i^2-(\sum y_i)^2}}}$$

The main idea of the formula is to compute how much of the variability of one variable is determined by the variability of the other variable. Pearson's r is a standardized scale to measure correlations between variables that makes it unit-free. You can directly compare the strengths of all correlations with each other.

### Interpreting Values

+ Cohen's $d$ can take on any number between 0 and infinity, In general the greater the Cohen's d, the larger the effect size
+ Pearson's $r$ ranges between -1 and 1. The closer the value is to 0, the smaller the effect size. A value closer to -1 or 1 indicates a higher effect size.

General Rule of thumb to quantify whether an effect size is small, medium or large:

**Cohen’s D:**

+ A d of 0.2 or smaller is considered to be a small effect size.
+ A d of 0.5 is considered to be a medium effect size.
+ A d of 0.8 or larger is considered to be a large effect size.


**Pearson Correlation Coefficient:**

+ An absolute value of r around 0.1 is considered a low effect size.
+ An absolute value of r around 0.3 is considered a medium effect size.
+ An absolute value of r greater than .5 is considered to be a large effect size.

# Statistical Power

Statistical power, or sensitivity, is the likelihood of a significance test detecting an effect when there actually is one. In other words, power is the probability that we will correctly reject the null hypothesis.

Let's look at an example to understand this concept. Suppose we have two distributions with minimal overlap, as shown in the first picture below. If we collect a small set of samples from both the green and red distributions and compare their means using hypothesis testing, we might get a small p-value, say 0.0004. This would cause us to correctly reject the null hypothesis that both sample sets came from the same distribution. In other words, if the blue distribution says that all data points came from it, we would reject that hypothesis.

If we keep repeating this experiment multiple times, there's a high probability that each statistical test will correctly give us a small p-value. In other words, there is a high probability that the null hypothesis that all the data came from the same distribution will be correctly rejected.

However, occasionally, we might get a trial like in the second picture below, where the two sample sets appear to come from the same distribution due to overlapping sample points, resulting in a high p-value, like 0.08. This means that even though we know that the data came from two different distributions, we cannot correctly reject the null hypothesis that all the data came from the same distribution. Since these two distributions are far apart and have very little overlap, the probability of correctly rejecting the null hypothesis is high. Thus, power, being the probability that we will correctly reject the null hypothesis, is high in this example.

In summary, when distributions have minimal overlap, the statistical power is high, meaning there is a high likelihood of correctly rejecting the null hypothesis.


<center><img src="./data/p1.png"/></center>
<center><img src="./data/p2.png"/></center>

# Statistical Power: Overlapping Distributions

Now, let's consider a different scenario where we have a large overlap in the distributions, as shown in the first picture below. Most of the time, when we compare the means of these two distributions, we get a high p-value and fail to reject the null hypothesis that the data comes from the same distribution.

However, occasionally, when the sample data points are from the far extremes of the distributions, as shown in the second picture below, we get a small p-value and can correctly reject the null hypothesis that the data comes from the same distribution. Due to the overlap, the probability of correctly rejecting the null hypothesis is low, meaning we have relatively low power.

The good news is that we can increase the power by increasing the number of samples we collect. Power analysis will tell us how many measurements we need to collect to achieve a good amount of power.

In summary, when distributions have a large overlap, the statistical power is low, meaning there is a low likelihood of correctly rejecting the null hypothesis. By increasing the sample size, we can improve the power of our test.

<center><img src="./data/p3.png"/></center>

<center><img src="./data/p4.png"/></center>

Before we learn how to do power analysis. Lets understand why do we need to perform power analysis in detail. 

### Need for Power Analysis

In hypothesis testing, we start with a null hypothesis of no effect and an alternative hypothesis of a true effect. The goal is to collect enough data from a sample to statistically test whether we can reasonably reject the null hypothesis in favor of the alternative hypothesis. In doing so, there's always a risk of making one of two decision errors when interpreting study results:

- **Type I error**: Rejecting the null hypothesis of no effect when it is actually true.
- **Type II error**: Not rejecting the null hypothesis of no effect when it is actually false.

Power is the probability of avoiding a Type II error. The higher the statistical power of a test, the lower the risk of making a Type II error. Power is usually set at 80%. This means that if there are true effects to be found in 100 different studies with 80% power, only 80 out of 100 statistical tests will actually detect them. If we don't ensure sufficient power, our study may not be able to detect a true effect at all. This means that resources like time and money are wasted, and it may even be unethical to collect data from participants.

On the flip side, too much power means our tests are highly sensitive to true effects, including very small ones. This may lead to finding statistically significant results with very little usefulness in the real world. To balance these pros and cons of low versus high statistical power, we should use a **Power Analysis** to set an appropriate level.


# Power Analysis

Power is mainly influenced by sample size, effect size, and significance level. A power analysis can be used to determine the necessary sample size for a study. Having enough statistical power is necessary to draw accurate conclusions about a population using sample data.

Power is affected by several factors, but two main factors are:

- **Overlap:** How much overlap is there between the two distributions we want to identify with our study.
- **Sample Size:** The number of samples we collect from each group.

If we want Power to be 80% and if there is very little overlap, a small sample size will suffice. However, if the overlap is greater between the two distributions, we need a larger sample size to achieve 80% power.

To understand the relationship between overlap and sample size, we need to realize that when we do a statistical test, we usually compare sample means rather than individual measurements. So let's see what happens when we calculate means with different sample sizes.

- If the sample size is small, there is a lot of variation in estimated means for a distribution, making it hard to be confident that any single estimated mean is a good estimate of the population mean, and there is overlap between the estimated means of the two distributions.
- But if the sample size is large, the estimated means are so close to the population mean that they no longer overlap. This suggests a high probability that we correctly reject the null hypothesis that both samples came from the same distribution. With a large sample size, we can achieve high power. Additionally, the central limit theorem states that these results apply to any type of distribution.

A power analysis consists of four main components. If you know or have estimates for any three of these, you can calculate the fourth component:

- **Statistical Power:** The likelihood that a test will detect an effect of a certain size if there is one, usually set at 80% or higher.
- **Sample Size:** The minimum number of observations needed to observe an effect of a certain size with a given power level.
- **Significance Level (alpha):** The maximum risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
- **Expected Effect Size:** The combined effect of standard deviation and means of two distributions due to overlap, captured by Effect size (d). There are many different ways to capture the effect.

Before starting a study, we can use a power analysis to calculate the minimum sample size for a desired power level and significance level, along with an expected effect size. Traditionally, the significance level is set to 5% and the desired power level to 80%. That means we only need to figure out an expected effect size to calculate a sample size from a power analysis.

The `stats.power` module of the statsmodels package in Python contains the required functions for carrying out power analysis for the most commonly used statistical tests such as t-test, normal-based test, F-tests, and Chi-square goodness-of-fit test. Its `solve_power` function takes three of the four components mentioned above as input parameters and calculates the sample size.
