## Experimental Design

### 1. Make an Observation

The first step of the scientific method is to observe something that we want to test. During this step, we observe phenomena to help us refine the question that we want to answer.  This might be anything from "does this drug have an affect on headaches?" to "does the color of this button affect the number of sales our website makes in a day?".  Before we can test these things, we need to notice they exist and then come up with a specific question to answer. 

### 2. Examine the Research

Good data scientists work smart before they work hard.  In the case of scientific method, this means seeing what research already exists that may help us answer our question, directly or indirectly.  It could be that someone else has already done an experiment that answers our question--if that's case, we should be aware of that experiment before starting our own, as it could inform our approach to structuring our experiment, or maybe even answer our question outright!  

### 3. Form a Hypothesis

This is the stage that most people remember from learning the scientific method in grade school. In school, we learned that a hypothesis is just an educated guess that we will try to prove by structuring our experiment. In reality, its a bit more complicated than that.  During this stage, we'll formulate 2 hypotheses to test--our educated guess about the outcome is called our **_Alternative Hypothesis_**, while the opposite of it is called the **_Null Hypothesis_**.  This is where the language behind experimental design (and the idea of what we can actually **_prove_** using an experiment) gets a bit complicated--more on this below. 

### 4. Conduct an Experiment

This step is the part of the scientific method that we're most concerned with for this section. We can only test our hypothesis by gathering data from a well-structured experiment.  A well-structured experiment means is one that accounts for all of the mistakes and randomness that could give us mistaken signals as to the effect an intervention has.  Just because we're running an experiment doesn't prove that A causes B, or that there's even a relationship between them! A poorly designed experiment will lead to false conclusions that we haven't considered or controlled for--a well-designed experiment leaves us no choice but to acknowledge that the effects seen in a dependent variable are related to our independent variable.  The world is messy and random--we have to account for this messiness and randomness in experiments, so that we can filter it out and be left only with the things we're actively trying to measure. 

### 5. Analyze Experimental Results

Whether you realize it or not, you've already gotten pretty good at this step! All the work we've done with statistics is usually in service of this goal--looking at the data and understanding what happened. During this step, we tease out relationships, filter out noise, and try to determine if something that happened is **_statistically significant_** or not. 

### 6. Draw Conclusions

This step is the logical end point for our experiment.  We've asked a question, looked at experimental results from others that could be related to our question, made an educated guess, designed an experiment, collected data and analyzed the results.  All that is left is to use the results of our analysis step to evaluate whether we believe our hypothesis was correct or not! While the public generally oversimplifies this step to determining causal relationships (e.g. "my experiment showed that {x} causes {y}"), true scientists rarely make claims so bold.  The reality of this step is that we use our analysis of the data to do one of two things: either **_reject the null hypothesis, or fail to reject the null hypothesis_**.  This is a tricky concept, so we'll explore it in much more detail in a future lesson. 

### The Foundations of a Sound Experiment

All experiments are not created equal--simply following the steps outlined above does not guarantee that the results of any experiment will be meaningful. For instance, there's nothing stopping a person from testing the hypothesis that "wearing a green shirt will make it rain tomorrow!", seeing rain the next day, and rejecting the null hypothesis, thereby incorrectly "proving" that their choice of wardrobe affected the weather.  Good experiments show us that our independent variable {X} has an affect on our dependent variable {Y} because we control for all the other things that could be affecting {Y}, until we are forced to conclude that the only thing that explains what happened to {Y} is {X}!

Although there are many different kinds of experiments, there are some fundamental aspects of experimental design that all experiments have:

#### 1. A Control Group/Random Controlled Trials

One of the most important aspects of a sound experiment is the use of a **_Control Group_**. A Control Group is a cohort that receives no treatment or intervention--for them, it's just business as usual.  In a medical test, this might be a **_placebo_**, such as a sugar pill. In the example of testing the color of a button on a website, this would be customers that are shown the a version of the website with the button color unchanged.  Using a control group allows us to compare the results of doing nothing (our control) with the effects  of doing something (our **_intervention_**).  Without a control group, we have no way of knowing how much of the results we see can be attributed to our intervention, and how much would have happened anyways. 

To make this more obvious, let's consider what we can actually know with confidence after an experiment that doesn't use a control. Let's say that a pharmaceutical company decides to test a new drug that is supposed to reduce the amount of time someone has the flu.  The company gives the drug to all participants in the study.  After analyzing the data, we find that the average length of time a person had the flu was 12 days.  Was the drug effective, or not? Without a control, we don't know how long this flu would have lasted if these people were never given a drug.  It could be that our drug reduced the time of infection down to 12 days.  Then again, it could be that the these people would have gotten better on their own after 12 days, and our drug didn't really do anything--or maybe they would have gotten better in 10 days, and our drug made it worse! By using a control group that gets no drugs and recovers naturally, we can compare the results of our treatment (people that received the experimental flu drug) to our control group (people that recovered naturally).

Note that a control group is only a control group if they are sampled from the same population as our treatment groups! If they aren't the same, then we have no way of knowing how much of the difference between in results should be attributed to our flu drug, and how much should be attributed to the way(s) in which the control group is different.  For instance, our experiment would not be very effective if the average age of one group was much higher or lower than another--if that was the case, how do we know the age difference isn't actually causing the difference in results (or lack thereof) between our control and our treatment groups, instead of our drug?

The main way scientists deal with this is through **_Random Controlled Trials_**.  In a Random Controlled Trial, we have a control group and an intervention (also called treatment) group, where subjects are **_randomly assigned to each_**.  You may have heard the term **_Single-Blind_** and **_Double-Blind_**--these refer to people knowing which groups they are in. In a sound experiment, people should not know if they are in the treatment group or the control group, as that could potentially affect the outcome of the trial! 

A **_Single-Blind_** or **_Blind Trial_** is one where the participant does not know if they are receiving the treatment or a placebo. 

A **_Double-Blind Trial_** is one where the participant does not know if they are receiving the treatment or a placebo, and neither does the person administering the experiment (because their bias could affect the outcomes, too!).  Instead, knowing whether someone received the treatment or a placebo is kept hidden from everyone until after the experiment is over (obviously, _someone_ has to know for recordkeeping purposes, but that person stays away from the actual experiment to avoid contaminating it with bias from that knowledge). 

#### 2. Appropriate Sample Sizes

Randomness is a big problem in experiments--it can lead us to false conclusions by making us think that something doesn't matter when it does, or vice versa. Small sample sizes make us susceptible to the problem of randomness. Large sample sizes protect us from it.  The following scenario illustrates this point:

A person tells you that they can predict the outcome of a fair coin flip. You flip a coin, they call "tails", and they are correct.  Is this enough evidence to accept or reject this person's statement?  What if they got it right 2 times in a row? 5 times in a row? 55 times out of 100?  

This situation illustrates two things that are important for us to understand and acknowledge:

1. No matter how large your sample size, there's always a chance that our results can be attributed to randomness or luck.

1. At some point, we cross a threshold where random chance is small enough that we say "this probably isn't random", and are okay with accepting the results as the result of something other than randomness or luck.

With the situation above, we probably wouldn't assume that this person can predict coin flips after only seeing them get 1 correct.  However, if this person got 970 out of 1000 correct, we probably believe very strongly that this person _can_ predict coin flips, because the odds of guessing randomly and getting 970/1000 correct are very, very small--but not 0!  

Large sample sizes protect us from randomness and variance. A more realistic example would be testing a treatment for HIV.  Less than 1% of the global population carry a protective mutation that makes them resistant to HIV infection.  If our sample size is only 1 person randomly selected from the population, there is a ~1% chance that we may mistakenly attribute successful prevention to the drug we're testing, when the results really happened because we randomly selected a person with this mutation.  However, if our sample size is 100 people per sample, our odds of randomly selecting 100 people with that mutation are $.01^100$. The larger our sample size, the more unlikely it is that we randomly draw people that happen to affect our study.

#### 3. Reproducibility

This one is a big one, and is a bit of a crisis in some parts of the scientific community right now.  Good scientific experiments have **_Reproducible Results_**! This means that if someone else follows the steps you outline for your experiment and performs it themselves, they should get pretty much the same results as you did (allowing for natural variance and randomness). If many different people try reproducing your experiment and don't get the same results, this might suggest that your results may be due to randomness, or to a **_lurking variable_** that was present in your samples that wasn't present in others. Either way, lack of reproducibility often casts serious doubts on the results of a study or experiment. 

This is less of a problem for data scientists, since reproducibility for us usually just means providing the dataset we worked with and the corresponding jupyter notebook.  However, this isn't always the case!   Luckily, we can use code to easily run our experiments multiple times and show reproducibility. When planning experiments, consider running them multiple times to ensure to really help show that your results are sound, and not due to randomness!

## Hypothesis Types

**_Null Hypothesis_**: There is no relationship between A and B 
Example: "There is no relationship between this flu medication and a reduced recovery time from the flu".

The _Null Hypothesis_ is usually denoted as $H_O$

**_Alternative Hypothesis_**: The hypothesis we traditionally think of when thikning of a hypothesis for an experiment
Example: "This flu medication reduces recovery time for the flu."

The _Alternative Hypothesis_ is usually denoted as $H_a$



## P-Values and Alpha Values

No matter what you're experimenting on, good experiments come down down to one question: Is our p-value less than our alpha value? Let's dive into what each of these values represents, and why they're so important to experimental design. 

**_P-value_**: The calculated probability of arriving at this data randomly. 

If we calculate a p-value and it comes out to 0.03, we can interpret this as saying "There is a 3% chance that the results I'm seeing are actually due to randomness or pure luck".  

$\alpha$ **_(alpha value)_**: The marginal threshold at which we're okay with with rejecting the null hypothesis. 

An alpha value can be any value we set between 0 and 1. However, the most common alpha value in science is 0.05 (although this is somewhat of a controversial topic in the scientific community, currently).  

If we set an alpha value of $\alpha = 0.05$, we're essentially saying "I'm okay with accepting my alternative hypothesis as true if there is less than a 5% chance that the results that I'm seeing are actually due to randomness".  

When we conduct an experiment, our goal is calculate a p-value and compare it to our alpha value. If $p < \alpha$, then we **_reject the null hypothesis_** and accept that there is not "no relationship" between our dependent variables.  Note that any good scientist will admit that this doesn't prove that there is a _direct relationship_ between our dependent and independent variables--just that we have enough evidence to the contrary to show that we can no longer believe that there is no relationship between them. 

In simple terms:

$p < \alpha$: Reject the _Null Hypothesis_ and accept the _Alternative Hypothesis_

$p >= \alpha$: Fail to reject the _Null Hypothesis_.  

## Structuring Hypothesis statement

There are many different ways that we can structure a hypothesis statement, but they always come down to this comparison in the end.  In normally distributed data, we calculate p-values from z-scores. This is done a bit differently with discrete data. We may also have **_One-Tail_** and **_Two-Tail_** tests.  

A **_One-Tail Test_** is when we want to know if a parameter from our treatment group is greater than (or less than) a corresponding parameter from our control group.

**_Example One-Tail Hypothesis_**

"$H_a = \mu_1 < \mu_2 $ The treatment group given this weight loss drug will lost more weight on average than the control group that was given a competitor's weight loss drug 

$ H_o = \mu1 >= \mu_2$  The treatment group given this weight loss drug will not lose more weight on average than the control group that was given a competitor's weight loss drug". 

A **_Two-Tail Test_** is for when we want to test if a parameter falls between (or outside of) a range of two given values. 

**_Example Two-Tail Hypothesis_**

$H_a = \mu_1 != \mu_2$ "People in the experimental group that are administered this drug will not lose the same amount of weight as the people in the control group.  They will be heavier or lighter". 

$H_o = \mu_1 = \mu_2$ "People in the experimental group that are administered this drug will lose the same amount of weight as the people in the control group."

#### Steps for a one-sample t-test

##### Step 1: Write your null hypothesis statement
##### Step 2: Write your alternate hypothesis.
##### Step 3: Import necessary libraries and calculate sample statistics:
- The population mean ($\mu$). 
- The sample mean ($\bar{x}$). Calculate from the sample data
- The sample standard deviation ($s$). Calculate from sample data
- Number of observations($n$). This can be calculated from the sample data.
- Degrees of Freedom($df$). Calculate from the sample as df = total no. of observations - 1

##### Step 4: Calculate the t value from given data
    #Calculate Sigma
    t = (x_bar -  mu)/(sigma/np.sqrt(n))
    t
##### Step 5: Find the critical t value.
##### Step 6: Compare t-value with critical t-value to accept or reject the Null hypothesis.

#### code for one sample t-test

    def one_sample_ttest(sample, popmean, alpha):

        # Visualize sample distribution for normality 
        sns.set(color_codes=True)
        sns.set(rc={'figure.figsize':(12,10)})
        sns.distplot(sample)

        # Population mean 
        mu = popmean

        # Sample mean (x̄) using NumPy mean()
        x_bar= sample.mean()

        # Sample Standard Deviation (sigma) using Numpy
        sigma = np.std(sample)

        # Degrees of freedom
        df = len(sample) - 1
    
        #Calculate the critical t-value
        t_crit = stats.t.ppf(1 - alpha, df=df)

        #Calculate the t-value and p-value
        results = stats.ttest_1samp(a= sample, popmean= mu)         

        if (results[0]>t_crit) and (results[1]<alpha):
            print ("Null hypothesis rejected. Results are statistically significant with t-value =", 
                    round(results[0], 2), "critical t-value =", t_crit, "and p-value =", np.round((results[1]), 10))
        else:
            print ("Null hypothesis is True with t-value =", 
                    round(results[0], 2), ", critical t-value =", t_crit, "and p-value =", np.round((results[1]), 10))
                    
#### two sample t-test

    '''
    Calculates the T-test for the means of *two independent* samples of scores.

    This is a two-sided test for the null hypothesis that 2 independent samples
    have identical average (expected) values. This test assumes that the
    populations have identical variances by default.
    '''

    stats.ttest_ind(experimental, control)
    
**scipy library has several options for t-test single line code**
https://docs.scipy.org/doc/scipy/reference/stats.html

#### Welch's t-test (for unequal variances)
    def welch_t(a, b):

        """ Calculate Welch's t statistic for two samples. """

        numerator = a.mean() - b.mean()

        # “ddof = Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, 
        #  where N represents the number of elements. By default ddof is zero.

        denominator = np.sqrt(a.var(ddof=1)/a.size + b.var(ddof=1)/b.size)

        return numerator/denominator

    welch_t(a,b)
    
#### Permutation test for population similarity (null hypothesis - populations not similar)
calculate the mean of both samples and wish to perform a hypothesis test with a 5% confidence interval for whether the two samples belong to the same overall population. In our previous work, we would use a t-test to perform this comparison. The permutation test alternative would be to compare the difference in these sample means to the difference in sample means of all possible combinations of 37-45 splits between our 82 data points. In other words, we compare the difference between our actual sample means to the difference in sample means between all variations of all those 82 points in order to calculate our p-values and determine whether we accept or reject the null-hypothesis.

    diff_mu_a_b = np.mean(a) - np.mean(b)
    combos = permT(a, b)
    print("There are {} possible sample variations.".format(len(combos)))
    num = 0 #Initialize numerator
    for ai, bi in combos:
        diff_mu_ai_bi = np.mean(ai) - np.mean(bi)
        if diff_mu_ai_bi >= diff_mu_a_b:
            num +=1
    p_val = num / len(combos)
    print('P-value: {}'.format(p_val))

#### Permutation Tests and Exploding Combination Sizes - Using Monte Carlo Simulations¶
When conducting permutation tests, the size of potential combination sizes quickly explodes as our original sample sizes grow. As a result, even with modern computers, it is often infeasible or aggregiously resource expensive to attempt to generate these permutation spaces. To cope with this, monte carlo simulations are often used in practice in order to simulate samples from the permutation space.

    diff_mu_a_b = np.mean(a) - np.mean(b)
    num = 0
    denom = 0
    union = a + b
    for i in range(5*10**6):
        #Generate an a
        ai = np.random.choice(union, size=len(a), replace=False)
        #Generate its compliment as b
        bi = union.copy()
        for item in ai:
            bi.remove(item)
        diff_mu_ai_bi = np.mean(ai) - np.mean(bi)
        if diff_mu_ai_bi >= diff_mu_a_b:
            num +=1
        denom += 1
        #Compute difference in means
        if i in [10,100,500,1000, 10**4, 10**5, 10**6, 2*10**6, 5*10**6]:
            print("After {} iterations p-value is: {}".format(i, num/denom))

## Effect Size

In a data analytics domain, effect size calculation serves three primary goals:

* Communicate **practical significance** of results. An effect might be statistically significant, but does it matter in practical scenarios?

* Effect size calculation and interpretation allows you to draw **Meta-Analytical** conclusions. This allows you to group together a number of existing studies, calculate the meta-analytic effect size and get the best estimate of the tur effect size of the population. 

* Perform **Power Analysis**, which help determine the number of participants (sample size) that a study requires to achieve a certain probability of finding a true effect - if there is one. 

#### Un-standardized or Simple Effect Size Calculation
An unstandardized effect size simply tries to find the difference between two groups by calculating the difference between distribution means. Here is how you can do it in python. 

    mean1, std1 = sample1.mean(), sample1.std()
    mean1, std1
    mean2, std2 = sample2.mean(), sample2.std()
    mean2, std2
    difference_in_means = sample1.mean() - sample2.mean()
    difference_in_means
    
#### overlap threshold to determine possibility of misclassification (how many samples overlap)
    simple_thresh = (mean1 + mean2) / 2
    simple_thresh
    sample1_below_thresh = sum(sample1 < thresh)
    sample1_below_thresh
    sample2_above_thresh = sum(sample2 > thresh)
    sample2_above_thresh
    overlap = sample1_below_thresh / len(sample1) + sample2_above_thresh / len(sample2)
    overlap
    misclassification_rate = overlap / 2
    misclassification_rate
   
#### pobability of superiority (likelyhood of x>y)
    sum(x > y for x, y in zip(sample1, sample2)) / len(sample1)
    
#### Cohen's $d$

Cohen’s D is one of the most common ways to measure effect size.  As an effect size, Cohen's d is typically used to represent the magnitude of differences between two (or more) groups on a given variable, with larger values representing a greater differentiation between the two groups on that variable. 

The basic formula to calculate Cohen’s $d$ is:

> ** $d$ = effect size (difference of means) / pooled standard deviation **

    def Cohen_d(group1, group2):

        # Compute Cohen's d.

        # group1: Series or NumPy array
        # group2: Series or NumPy array

        # returns a floating point number 

        diff = group1.mean() - group2.mean()

        n1, n2 = len(group1), len(group2)
        var1 = group1.var()
        var2 = group2.var()

        # Calculate the pooled threshold as shown earlier
        pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)

        # Calculate Cohen's d statistic
        d = diff / np.sqrt(pooled_var)

        return d

#### Interpreting $d$
Most people don't have a good sense of how big $d=2.0$ is. If you are having trouble visualizing what the result of Cohen’s D means, use these general “rule of thumb” guidelines (which Cohen said should be used cautiously):

>**Small effect = 0.2**

>**Medium Effect = 0.5**

>**Large Effect = 0.8**   

## Type I and Type II errors

#### Type I (alpha)
When conducting hypothesis testing, you must choose a confidence level, alpha ($\alpha$) which you will use as the threshold for accepting or rejecting the null hypothesis. This confidence level is also the probability that you reject the null hypothesis when it is actually true. This scenario is a Type 1 error, more commonly known as a **False Positive**. 

#### Type II (beta)
Another type of error is beta ($\beta$), which is the probability that you fail to reject the null hypothesis when it is actually false. Type 2 errors are also referred to as **False Negatives**.

#### Balancing Type I and Type II Errors 
Different scenarios call for scientists to minimize one type of error over another. The two error types are inversely related to one other; reducing type 1 errors will increase type 2 errors and vice versa. 

## Statistical Power

The power of a statistical test is defined as the probability of rejecting the null hypothesis, given that it is indeed false. As with any probability, the power of a statistical test therefore ranges from 0 to 1, with 1 being a perfect test that gaurantees rejecting the null hypothesis when it is indeed false.

#### code to visualize power as sample size increases

    #What does the power increase as we increase sample size?
    powers = []
    cutoff = .99 #Set the p-value threshold for rejecting the null hypothesis
    #Iterate through various sample sizes
    unfair_coin_prob = .75
    for n in range(1,50):
        #Do multiple runs for that number of samples to compare
        p_val = []
        for i in range(200):
            n_heads = np.random.binomial(n, unfair_coin_prob)
            mu = n / 2
            sigma = np.sqrt(n*.5*(1-.5))
            z  = (n_heads - mu) / (sigma / np.sqrt(n))
            p_val.append(st.norm.cdf(np.abs(z)))
        cur_power = sum([1 if p >= cutoff else 0 for p in p_val])/200
        powers.append(cur_power)
    plt.plot(list(range(1,50)), powers)
    plt.title('Power of Statistical Tests of a .75 Unfair Coin by Number of Trials using .99 threshold')
    plt.ylabel('Power')
    plt.xlabel('Number of Coin Flips')

## A/B Testing to assist in Experiment Design

### Step 1: State the Null Hypothesis, $H_0$
### Step 2: State the Alternative Hypothesis, $H_1$
### Step 3: Define Alpha and Beta
- start at .01 or .05 for both and adjust as needed
### Step 4: Calculate N (sample size)
    import scipy.stats as st
    def compute_n(alpha, beta, mu_0, mu_1, var):
        z_alpha = st.norm.ppf(alpha)
        z_beta = st.norm.ppf(beta)
        num = ((z_alpha+z_beta)**2)*var
        den = (mu_1 - mu_0)**2
        return num/den

    alpha = .01 #Part of A/B test design
    beta = .01 #Part of A/B test design
    mu_0 = .76 #Part of A/B test design
    mu_1 = .8 #Part of A/B test design
    var = .1 #sample variance

    compute_n(alpha, beta, mu_0, mu_1, var)

## ANOVA Testing

ANOVA (Analysis of Variance) is a method for generalizing of previous discussion regarding statistical tests to multiple groups. As we will see, ANOVA then partitions our total sum of square of deviations (from the mean) into sum of squares for each of these groups and sum of squares for error.

#### ANOVA table

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

formula = Control_Column ~ C(factor_col1) + factor_col2 + C(factor_col3) + ... + X #C() values are categorical values
lm = ols(formula, df).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

**Higher values of the F-statistic indicate a higher probability of that factor being influential.**


## Goodharts Law and importance for Data Scientists?

[Goodhart's Law](https://en.wikipedia.org/wiki/Goodhart%27s_law) is an observation made by the British Economist Charles Goodhart in 1975.  Charles Goodhart famously said:

> "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."  --Charles Goodhart

Goodhart's Law is something that matters much to Data Scientists because it is our findings and experiments that often drive the policies and decisions made by a company.  Data Science is complex, and often, project managers, CEOs, and other decision makers don't want to know about experimental methodologies or confidence intervals--they just want to know what the best decision they can make is, based on what the data says! It's quite common for decision makers to not realize that setting a target for one metric can negatively affect other metrics in ways that aren't immediately obvious--for instance, pushing employees at a call center to reduce call times could possibly reduce customer satisfaction, because of employees hustling to get off the phone based on the shorter call time "target" handed down from management.  

As a data scientist, it is important to communicate your results clearly to stakeholders--but it is also important to be the voice of reason at times.  This is why communication with stakeholders is important throughout the process of any data science project.  The sooner you know how they plan on using your results, the more you can help them avoid ugly unforseen problems that come from Goodhart's Law--always remember that massive amounts of data are no substitute for _critical thinking_! At the very least, you should get a bit nervous when you see targets being set for certain metrics.  Note that this doesn't necessarily mean "don't set targets"--instead, seek to encourage decision makers to think critically about any unintended consequences these targets could have, and track changes in metrics early and often when new policies or targets are put in place to ensure that unintended consequences are caught early!