# Effect Size 

## SWBATs

* Illustrate a clear understanding of the terms "Effect" and "Effect Size" in a statistical context.
* Calculate simple (unstandardized) effect size using Python and SciPy
* Interpret results of simple effect size and identify shortcomings of this approach
* Calculate standardized effect size using Cohen's d statistic
* Visualize and Interpret the $d$ value as size of effect


## Introduction 

'Effect size' is used to quantify the *size of the difference* between two groups under observation. Effect sizes are easy to calculate, understood and apply to any measured outcome and is this approach is applicable to a multitude of study domains. It is highly valuable towards quantifying the *effectiveness of a particular intervention, relative to some comparison*. Measuring effect size allows scientists to go beyond the obvious and simplistic, *'Does it work or not?'* to the far more sophisticated, *'How well does it work in a range of contexts?'*. 

Effect size measurement places its emphasis on the effect size only, unlike statistical significance which combines effect size and sample size, thus promoting a more scientific approach towards knowledge accumulation. Effect size is therefore routinely used towards **Meta-Analysis** i.e. for combining and comparing estimates from different studies conducted on different samples. 


### Why do data scientists need to know about 'Effect Size'?

Consider the experiment conducted by Dowson (2000) to investigate time of day effects on children learning: do children learn better in the morning or afternoon? A group of 38 children were included in the experiment. Half were randomly allocated to listen to a story and answer questions about it at 9am, the other half to hear exactly the same story and answer the same questions at 3pm. Their comprehension was measured by the number of questions answered correctly out of 20.

The average score was 15.2 for the morning group and 17.9 for the afternoon group, giving a difference of 2.7. 
**How big a difference is this? **

If the results were measured on a standard scale, such as GCSE grades, interpreting the difference would not be a problem. If the average difference was, say, half a grade or a full grade, most people would have a fair idea of the educational significance of the effect of reading a story at different times of day. However, in many experiments there is no familiar scale available on which to record the outcomes i.e. student comprehension in this case. The experimenter often has to invent a scale or to use (or adapt) an already existing one - but generally most people would be unfimilar with interpretation of this scaler.

One way to deal with this problem is to observe the amount of variation in results i.e. mean and standard deviastion,  to study and contextualise the difference as shown below with two possible scenarios. 

<img src="images/effectsize-a.gif" width = "400"/>


As seen above, If there is little overlap or no overlap at all, it would mean that majority of students in the afternoon group have done better on the test than majority of students in the morning group, then this would seem like a very substantial difference. On the other hand, if the spread of scores were large and the overlap much bigger than the difference between the groups, then the effect might seem less significant as shown below. 


<img src="images/effectsize-b.gif" width = "400"/>


*Because we have an idea of the amount of variation found within a group with standard deviation, we can use this as a measure against which to compare the difference.* This idea is quantified in the calculation of the effect size. If the difference were as in the first graph, it would be very significant. In the case of second graph, on the other hand, the difference might hardly be noticeable.

In data analytics domain, effect size serves three primary goals:

* Communicate **practical significance** of results. An effect might be statistically significant, but does it matter in practical scenarios ?

* Effect size calculation and interpretation allows you to draw **Meta-Analytical** conclusions. This allows you to group together a number of existing studies, calculate the meta-analytic effect size and get the best estimate of the tur effect size of the population. 

* Perform **Power Analysis** , which help determine the number of particicpants (sample size) that a study would require to achieve a certain probability of finding a true effect - if there is one. 


## Calculating effect size in Python 

### Using SciPy for measuring effect size

SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering. The SciPy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc. For our experiment we would use `scipy.stats` package which contains statistical tools and probabilistic descriptions of random processes. Detailed documentation of SciPy is available [here](https://docs.scipy.org/doc/scipy/reference/index.html). 

In [None]:
# Import necessary modules 
from __future__ import print_function, division
import numpy as np

# Import SciPy stats and matplotlib for calculating and visualising effect size
import scipy.stats
import matplotlib.pyplot as pyplot

%matplotlib inline

# seed the random number generator so we all get the same results
np.random.seed(10)

### Example: 
To explore statistics that quantify effect size, let's first look at the difference in height between men and women in USA, based on the mean and standard deviation for male and female heights as given in (BRFSS) Behavioral Risk Factor Surveillance System.

**Males Height**  (Mean = 178 , Standard Deviation = 7.7)

**Female Height** (Mean = 163 , Standard Deviation = 7.3)

We can use `scipy.stats.norm()` to represent the height distributions by passing mean and standard deviation values as arguments towards a normal distribution. 

In [None]:
#Mean height and sd for males
male_mean = 178
male_sd = 7.7

male_height = scipy.stats.norm(male_mean, male_sd)

> The result `male_height` is a SciPy `rv` object which represents a **normal continuous random variable**. 

In [None]:
male_height

> **Exercise:** Use the mean and standard deviation for female height and repeat calculations shown above to calculate `female_height` as an `rv` object.

###  Evaluate Probability Density Function (PDF)

A continuous random variable, as calculated above, takes on an uncountably infinite number of possible values. For a discrete random variable X that takes on a finite or countably infinite number of possible values, we determine P(X = x) for all of the possible values of X, and call it the probability mass function (PMF). 

For continuous random variables, as in the case of heights, the probability that X takes on any particular value x is 0. That is, finding P(X = x) for a continuous random variable X is not going to work. Instead, we'll need to find the probability that X falls in some interval (a, b) i.e. we'll need to find **P(a < X < b)**. We'll do that using a **probability density function(PDF)**. 

The following function evaluates the normal (Gaussian) probability density function within 4 standard deviations of the mean. The function ingests an rv object and returns a pair of NumPy arrays.

In [None]:
def evaluate_PDF(rv, x=4):
    
    # Identify the mean and standard deviation of random variable 
    mean, std = rv.mean(), rv.std()

    # Use numpy to calculate evenly spaced numbers over the specified interval (4 sd) and generate 100 samples.
    xs = np.linspace(mean - x*std, mean + x*std, 100)
    
    # Calculate the peak of normal distribution. 
    ys = rv.pdf(xs)

    return xs, ys # Return calculated values

> **Exercise: Use the function above to caculate xs and ys for male and female heights (pass the `rv` object as an argument) and plot the resulting xs and ys for both distributions to visualise the effect size.**  

Let's assume for now that those are the true distributions for the population. As you studied earlier, in real life we never observe the true population distribution.  We generally have to work with a random sample.

## Un-standardized or Simple Effect Size Calculation

An unstandradized effect size simply tries to find the different between two groups by calculating the difference between distru=ibution means. Here;s how we can do it in python. 

We can use `rvs` method from `scipy.stats` to generate 1000 random samples from the population distributions.  Note that these are totally random, totally representative samples, with no measurement error - following our assumption of a true distribution. Visit [this link](https://docs.scipy.org/doc/scipy-1.0.0/reference/tutorial/stats.html) for more details. 

In [None]:
male_sample = male_height.rvs(1000)

The resulting samples are numPy arrays, so we can now easily calculate mean and sd of random samples. 

In [None]:
mean1, std1 = male_sample.mean(), male_sample.std()
mean1, std1

The sample mean is close to the population mean, but not exact, as expected.
> **Exercise: Perform above calculation for female heights to calculate mean and sd of random samples from `female_height` `rv` object**

And the results are similar for the female sample.

Now, there are many ways to describe the magnitude of the difference between these distributions. An obvious one is the difference in the means. 
> **Exercise: Calculate the difference in means of both distributions identified above.**

On average, men are around 15 centimeters taller. For some applications, that would be a good way to describe the difference, but there are still a few problems:

* Without knowing more about the distributions (like the standard deviations or Spread of the distribution) it's hard to interpret whether a difference like 15 cm is a lot or not.

* The magnitude of the difference depends on the units of measure, making it hard to compare across different studies that may be conducted with different units of measurement.

There are a number of ways to quantify the difference between distributions.  A simple option is to express the difference as a percentage of the mean.
> **Exercise: what is the relative difference in means, expressed as a percentage?**

But a problem with relative differences is that you have to choose which mean to express them relative to.

In [None]:
relative_difference = difference_in_means / female_sample.mean()
relative_difference * 100    # percent

### Overlap threshold

As you can see above, there is still difference in results when we express the relative difference, relative to male height and female height. Perhaps we can look for amount of overlap between the two distributions.  To define overlap, we choose a threshold between the two means.  The simple threshold is the midpoint between the means:

In [None]:
simple_thresh = (mean1 + mean2) / 2
simple_thresh

A better, but slightly more complicated threshold is the place where the PDFs cross.

In [None]:
thresh = (std1 * mean2 + std2 * mean1) / (std1 + std2)
thresh

In this example, there's not much difference between the two thresholds.
Now we can count how many men are below the threshold:

In [None]:
male_below_thresh = sum(male_sample < thresh)
male_below_thresh

And how many women are above it:

In [None]:
female_above_thresh = sum(female_sample > thresh)
female_above_thresh

The "overlap" is the total **AUC (Area Under the Curves)** that ends up on the wrong side of the threshold.We can calculate the amount of overlap as shown below. ![auc](images/auc.png)

In [None]:
overlap = male_below_thresh / len(male_sample) + female_above_thresh / len(female_sample)
overlap

Or in more practical terms, you might report the fraction of people who would be misclassified if you tried to use height to guess sex:

In [None]:
misclassification_rate = overlap / 2
misclassification_rate

### Probability of superiority

Another way to quantify the difference between distributions is what's called **"probability of superiority"**, which is a problematic term otherwise, but in this context it's the probability that a randomly-chosen man is taller than a randomly-chosen woman, which makes perfect sense. 

> **Exercise: If we choose a male and a female sample at random, what id the probability that males are taller than females ? **

Overlap (or misclassification rate) as shown above, and "probability of superiority" have two good properties:

* As probabilities, they don't depend on units of measure, so they are comparable between studies.

* They are expressed in operational terms, so a reader has a sense of what practical effect the difference makes.

There is one other common way to express the difference between distributions i.e. the difference in means, standardized by dividing by the standard deviation.



## Standardized effect size

When analysts generally talk about statistics, they are referring to some method of calculating a *standadized* effect size. The standardized effect size statistic would divide effect size by some standardizer :

Effect Size / Standardiser

You would interpret that statistic in terms of standard deviations e.g. The mean height of males in USA is 1.4 standard deviations higher than mean female heights etc. 

The effect size measure we will be learning about in this lesson is Cohen’s d. This measure expresses the size of an effect as a number standard deviations, similar to a z-score in statistics.

### Cohen's $d$

Cohen’s D is one of the most common ways to measure effect size.  As an effect size, Cohen's d is typically used to represent the magnitude of differences between two (or more) groups on a given variable, with larger values representing a greater differentiation between the two groups on that variable. When comparing means in a scientific study, the reporting of an effect size such as Cohen's d is considered complementary to the reporting of results from a test of statistical significance. 

The basic formula to calculate Cohen’s $d$ is:

> ** $d$ = effect size (difference of means) / pooled standard deviation **

The denominator is sometimes referred to as the **standardiser**, and it is important to select the most appropriate one for a given dataset. The pooled standard deviation is the average spread of all data points about their group mean (not the overall mean). It is a weighted average of each group's standard deviation. The weighting gives larger groups a proportionally greater effect on the overall estimate. 



> **Exercise: Write a Python function that takes in two `rv` objects i.e. male and female heights in this case and returns Cohen's d statistic using the formula *mean(1) - mean(2)/sd(pooled)* **

Computing the denominator is a little complicated; in fact, people have proposed several ways to do it.  This implementation uses the "pooled standard deviation", which is a weighted average of the standard deviations of the two groups.

And here's the result for the difference in height between men and women.

In [None]:
CohenEffectSize(male_sample, female_sample)

### Interpreting $d$
Most people don't have a good sense of how big $d=2.0$ is. If you are having trouble visualizing what the result of Cohen’s D means, use these general “rule of thumb” guidelines (which Cohen said should be used cautiously):

Small effect = 0.2
Medium Effect = 0.5
Large Effect = 0.8


“Small” effects are difficult to see with the naked eye. For example, Cohen reported that the height difference between 15-year-old and 16-year-old girls in the US is about this effect size. “Medium” is probably big enough to be discerned with the naked eye, while effects that are “large” can definitely be seen with the naked eye (Cohen calls this “grossly perceptible and therefore large”). For example, the difference in heights between 13-year-old and 18-year-old girls is 0.8. An effect under 0.2 can be considered trivial, even if your results are statistically significant.

Bear in mind that a “large” effect isn’t necessarily better than a “small” effect, especially in settings where small differences can have a major impact. For example, an increase in academic scores or health grades by an effect size of just 0.1 can be very significant in the real world. Its always advisable to consult prior research in order to get an idea of where your findings fit into the bigger context.

Here is as excellent online visualisation tool developed by [Kristoffer Magnusson](http://rpsychologist.com/so) to help interpret the results of cohen's $d$ statistic. 

> **Exercise: visit the following link and interpret Cohen's $d$ statistic seen above by moving the slider: 
[Interpreting Cohen's d effect size: An interactive visualization](http://rpsychologist.com/d3/cohend/)**


## Putting it all together

Here's a function that encapsulates the code we already saw for computing overlap and probability of superiority.

In [None]:
def overlap_superiority(control, treatment, n=1000):
    """Estimates overlap and superiority based on a sample.
    
    control: scipy.stats rv object
    treatment: scipy.stats rv object
    n: sample size
    """
    control_sample = control.rvs(n)
    treatment_sample = treatment.rvs(n)
    thresh = (control.mean() + treatment.mean()) / 2
    
    control_above = sum(control_sample > thresh)
    treatment_below = sum(treatment_sample < thresh)
    overlap = (control_above + treatment_below) / n
    
    superiority = sum(x > y for x, y in zip(treatment_sample, control_sample)) / n

    return overlap, superiority

Here's the function that takes Cohen's $d$, plots normal distributions with the given effect size, and prints their overlap and superiority.

In [None]:
def plot_pdfs(cohen_d=2):
    """Plot PDFs for distributions that differ by some number of stds.
    
    cohen_d: number of standard deviations between the means
    """
    control = scipy.stats.norm(0, 1)
    treatment = scipy.stats.norm(cohen_d, 1)
    xs, ys = evaluate_PDF(control)
    pyplot.fill_between(xs, ys, label='control', color='#ff2289', alpha=0.7)

    xs, ys = evaluate_PDF(treatment)
    pyplot.fill_between(xs, ys, label='treatment', color='#376cb0', alpha=0.7)
    
    o, s = overlap_superiority(control, treatment)
    print('overlap', o)
    print('superiority', s)

Here's an example that demonstrates the function:

In [None]:
plot_pdfs(1)
# Try changing the d value and observe the effect on the outcome below

> Use the functions above to calculate and interpret the effect size with Cohen's $d$' statistic for two groups of your own choice. 

Cohen's $d$ has a few nice properties:

* Because mean and standard deviation have the same units, their ratio is dimensionless, so we can compare $d$ across different studies.

* In fields that commonly use $d$, people are calibrated to know what values should be considered big, surprising, or important.

* Given $d$ (and the assumption that the distributions are normal), you can compute overlap, superiority, and related statistics.

## Summary and Conclusion

In this lesson, we highlighted the importance of calculating and interpreting effect size in Python as a measure of observing real world difference between two groups. You learnt about simple (unstandardized) effect size calculation as difference of means, as well as standardization of this calculation with standard deviation as a standardizer. You also learnt what is Cohen's d statistic and how to use it for pratical purposes. The best way to report effect size often depends on the audience, goals and subjects of study.  There is often a tradeoff between summary statistics that have good technical properties and statistics that are meaningful to a general audience.