# Selecting the right test for the distribution

> Commonly, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis that proposes no relationship between two data sets. The comparison is deemed statistically significant if the relationship between the data sets would be an unlikely realization of the null hypothesis according to a threshold probability—the significance level. Hypothesis tests are used in determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance. The process of distinguishing between the null hypothesis and the alternative hypothesis is aided by identifying two conceptual types of errors. The first type occurs when the null hypothesis is falsely rejected. The second type of error occurs when the null hypothesis is falsely assumed to be true (type 1 and type 2 errors). By specifying a threshold probability ('alpha') on, e.g., the admissible risk of making a type 1 error, the statistical decision process can be controlled.
>
> -- [Wikipedia: Hypothesis Testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing)

## Identify the Distribution & Test

### Bernoulli Distribution -> Binomial Test

Bernoulli distributions model boolean outcomes: click or not, churn or not, purchase or not,  yes/no, on/off, true/false, heads/tails, 1/0.

Common test for this distribution is the binomial test

- Excel: Binom.Dist. The function takes parameters (Number of successes, Trials, Probability of Success, Cumulative). The "Cumulative" parameter takes a boolean True or False, with True giving the Cumulative probability of finding this many successes (a left-tailed test), and False the exact probability of finding this many successes

- Python:  scipy.stats.binom_test(successes, trials, probability_of_success, alternative)

- Example:  A product manager the product take-rate is at least 40% (H0) when presented with the product in the new web page.  50 people viewed the page.  10 people signed up.  Test the product mananger’s claim.

### Normal (a.k.a. Gaussian) Distribution -> Student's T-Test

Normal distributions are bell-shaped like standard normal, but with mean of $\mu$ and a standard deviation of $\sigma$.   (Standard Normal has a mean of 0 and standard deviation of 1.) Examples: Birthweights of newborn babies, distribution of blood pressure.
Common test for this distribution is the Student's t-test

The t-test is used to test whether the means of two independent samples are significantly different, and, if so, how significant are the differences, i.e. it lets you know if those differences could have happened by chance or if the results are repeatable for an entire population.  The sampled data must be normally distributed data.

- Types of t-test:

    - Independent Samples t-test (aka between-samples and unpaired-samples):
        - compare the means for 2 independent groups sampled randomly and separately and have different conditions.
        - H0 (null hypothesis): $\mu_1 = \mu_2$
        - Examples:  Average test score of males and females is different,

    - Paired Samples t-test (aka two samples):
        - compare means from the same group at different times or 2 dependent samples.  Choose this test if you have two measuremets on the same item, person, thing, or unique condition.
        - H0: $\mu d = 0$, i.e. the pairwise difference between the two tests is equal. Pairwise difference is the difference between each paired observation from sample 1 to sample 2, e.g. difference between measure of patient A before treatment and patient A after treatment.  **compute in excel**
        - Examples:  before and after treatment, MPG on 2 different cars driven by the same driver, knee MRI costs at 2 different hospitals, run cars of different manufacturers through the same series of crash tests.

    - One Sample t-test: mean of a single group agains a known mean

- Examples:

    - Do customers who churn pay more each month than those who do not churn (H1)?  (H0: there is no difference in spend betwen those who churn and those who don't)

    - Test on 2 different groups a drug (test group) and a placebo (control group).  Is there a difference in the life expenctancy of the 2 groups?

- T Statistic: a ratio between the difference between two groups and the difference within the groups. A t score of 5 means the groups are 5 times as different from each other as they are within each other.

    - The larger the t statistic, the more difference there is between the groups, the more likely it is that the results are repeatable, i.e. that the difference is significant.

    - The smaller the score, the more similarity.

- P-Value: For a t-test it is the probability of the value of a t-variable falling further from the mean than the value of t that we observed.  P(T <= t).  Can be though of as the probability that the difference measured occurred by chance.

    - The smaller the p-value, the more likely it is that the results are significant. Most common p-value used as the cutoff between significance and chance is 0.05.

- Excel: T.TEST(array1, array2, tails, type)

- Python: stat, p = scipy.stats.ttest_ind(data1, data2)

## Skewness

1. Symmetric

2. Left-skewed: A set of data values in which the mean is generally less than the median. The left tail of the distribution is longer than the right tail of the distribution.

3. Right-skewed: A set of data values in which the mean is generally greater than the median. The right tail of the distribution is longer than the left tail of the distribution.

## Central Limit Theorem:

Regardless of the shape of the distribution of the individual values in the population, as the sample size gets larger, the sampling distribution of the mean can be approximated by a normal distribution.

## Further Reading

- [Statistics How To: Statistics for the rest of us](https://www.statisticshowto.datasciencecentral.com/)
Here are some further resources if you are curious about hypothesis testing and inferential statistics:
- [Percentile vs quantile vs quartile](https://stats.stackexchange.com/questions/156778/percentile-vs-quantile-v-quartile)
- [Twitter thread on interpreting p-values](https://twitter.com/methodsmanmd/status/997482408973922305)
- [Youtube: Z vs t test](https://www.youtube.com/watch?v=5ABpqVSx33I)
- [Youtube: What is a p-value](https://www.youtube.com/watch?v=HTZ8YKgD0MI)