# Inference and Hypothesis testing


![image-3.png](attachment:image-3.png)



<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Inferential-Statistics" data-toc-modified-id="Inferential-Statistics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Inferential Statistics</a></span><ul class="toc-item"><li><span><a href="#Sampling-distributions" data-toc-modified-id="Sampling-distributions-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Sampling distributions</a></span><ul class="toc-item"><li><span><a href="#Standard-Error" data-toc-modified-id="Standard-Error-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Standard Error</a></span></li><li><span><a href="#Confidence-interval" data-toc-modified-id="Confidence-interval-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Confidence interval</a></span></li></ul></li><li><span><a href="#The-central-limit-Theorem" data-toc-modified-id="The-central-limit-Theorem-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>The central limit Theorem</a></span></li><li><span><a href="#The-Bootstrap" data-toc-modified-id="The-Bootstrap-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>The Bootstrap</a></span></li></ul></li><li><span><a href="#Hypothesis-testing" data-toc-modified-id="Hypothesis-testing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Hypothesis testing</a></span><ul class="toc-item"><li><span><a href="#Is-this-coin-fair?" data-toc-modified-id="Is-this-coin-fair?-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Is this coin fair?</a></span></li></ul></li><li><span><a href="#Choosing-statistical-tests" data-toc-modified-id="Choosing-statistical-tests-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Choosing statistical tests</a></span><ul class="toc-item"><li><span><a href="#Z-test" data-toc-modified-id="Z-test-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Z-test</a></span></li><li><span><a href="#t-test:-Brexit" data-toc-modified-id="t-test:-Brexit-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>t-test: Brexit</a></span></li><li><span><a href="#Were-older-people-more-in-favour-of-Brexit?" data-toc-modified-id="Were-older-people-more-in-favour-of-Brexit?-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Were older people more in favour of Brexit?</a></span><ul class="toc-item"><li><span><a href="#Is-my-data-Normal?" data-toc-modified-id="Is-my-data-Normal?-3.3.1"><span class="toc-item-num">3.3.1&nbsp;&nbsp;</span>Is my data Normal?</a></span></li></ul></li><li><span><a href="#t-test:-another-one" data-toc-modified-id="t-test:-another-one-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>t-test: another one</a></span></li></ul></li><li><span><a href="#ANOVA" data-toc-modified-id="ANOVA-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>ANOVA</a></span></li><li><span><a href="#p-hacking" data-toc-modified-id="p-hacking-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>p-hacking</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#References-&amp;-further-materials" data-toc-modified-id="References-&amp;-further-materials-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>References &amp; further materials</a></span></li></ul></div>

In [None]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
%config Inlinebackend.figure_format = 'retina'

import seaborn as sns
sns.set_context('poster')
sns.set(rc={'figure.figsize': (16., 9.)})
sns.set_style('whitegrid')

import pandas as pd
np.random.seed(123)

from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from statsmodels.stats.weightstats import ztest as ztest
from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway

## Inferential Statistics

In real life situations we only have access to samples of data, not to the entire population. Then, how can we draw conclusions about the underlying population as a whole? How confident can we be with this conclusions? The answer lies in the *Inferential Statistics*

### Sampling distributions

Imagine the "real" distribution of salaries in Spain is as follows: 

What if you had to *infer* the mean salary based on a sample?

Repeating this over and over would give us the **sampling distribution** (of the mean, in this case). With that, we can have an idea of how good our estimate is.

**Sample distribution step-by-step**
* **Draw 1000 samples** of size 25 from the population of salaries.
* Record the **average of each sample** in a list
* Plot the **distribution of the averages**
* What is the average of the averages?

#### Standard Error

The theory stablishes that:

$$SE = \hat\sigma /\sqrt{n} $$


The standard error (SE) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation.

![Screenshot%202022-08-08%20at%2012.06.04.png](attachment:Screenshot%202022-08-08%20at%2012.06.04.png)

#### Confidence interval


It is  a range that includes a given fraction of the sampling distribution. Typicall values are 90%, 95%, and 99%. This proposes a range of plausible values for an unknown parameter (for example, the mean). 

The interval has an associated confidence level.  The confidence level represents the frequency (i.e. the proportion) of confidence intervals that contain the true value of the unknown population parameter across many independent experiments. 


**Remember:** 95% confidence interval does not mean 95% probability. (Bayesian confidence intervals can be interpreted that way though)

Taken mostly from [Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval).
See also this blog from [Towardsdatascience](https://towardsdatascience.com/a-complete-guide-to-confidence-interval-and-examples-in-python-ff417c5cb593).

See [this section from Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval#Basic_steps) to get the basic steps:





The theory stablishes that the 95% CI can be obtained as:

$$\left[ \hat\mu -1.96 * SE, \hat\mu +1.96 * SE \right]$$


Other typical two sided confidence levels are obtained by changing the 1.96 factor by another ones:

|  CI |  z* |
|---|---|
| 99%  | 2.576  | 
|  98% |  2.326 |   
|   95%|   1.96|
|90% |1.645|

**Exercise**

Imagine you want to know the success rate of calling a client to sign up for your product. You have called 100 individuals and 10 of them did sign up.

For every individual signing up you make 10 euros. Every call costs you 0.4 euros.



Hint: the confidence interval for a proportion is similar than for the mean, but using this formula for the standard error:

$$SE = \sqrt{\hat{p}(1-\hat{p})/n} $$

where $\hat{p}$ is the empirical fraction of success cases and $n$ is the number of trials

It is important to remember that confidence intervals and standard errors
only quantify sampling error; that is, error due to measuring only part of the
population. The sampling distribution does not account for other sources of
error, notably **sampling bias** and **measurement error**.

**NOTE**: Read this blog from [Towardsdatascience](https://towardsdatascience.com/a-complete-guide-to-confidence-interval-and-examples-in-python-ff417c5cb593) to know how to calculate CI for other statistics.

### The central limit Theorem

Code snippet extracted from [Geeks for Geeks](https://www.geeksforgeeks.org/python-central-limit-theorem/#:~:text=The%20definition%3A,from%20which%20we%20are%20sampling.&text=The%20distribution%20of%20the%20sample,as%20the%20sample%20size%20increases.)

**Central Limit Theorem (CLT)**


Let $X_1,...,X_n$ be a random sample from a distribution with mean $\mu$ and variance $\sigma^2$. Define
$$\bar{X} = \frac{1}{n} \sum_i x_i.$$

The CLT states that as $n$ goes to infinity that
$$\bar{X} \sim \mbox{N}(\mu,\sigma^2/n).$$

Or, put it in simpler words: **regardless of the shape of
the population distribution** of $X$, as the sample size $n$ gets larger,
the sampling distribution of $\bar{X}$ becomes increasingly closer to
normal, with mean $\mu$ and variance $\sigma^2 / n$. (As long as $\mu$ and $\sigma$ are finite quantities.)


### The Bootstrap

We can infer the values of some statistics with tricks as the ones described above and the CTL. But: can we do something more general for *any* statistic?

Using bootstrapping consits on recreating a *fake* sampling distribution by solely having one sample! Let´s use this to calculate in a different way the above estimation for the mean and its CI.

1. Take values *with replacement* from your original sample until you get a new *bootstrapped* sample with the same size as the original.

2. Calculate the statistic you want to compute with this new sample

3. Repeat the process enough (10000) times so that you create a distribution of you statistic.

4. Get the estimate and its CI 

**Exercise**
Use the bootstrap method to estimate the Median of the distribution above and its 90% confidence interval.

In [None]:
# your code here

**Note** Some [libraries in Python](https://github.com/cgevans/scikits-bootstrap) implement more sophisticated versions of Bootstrapping, but the idea is as simple and powerful as it looks like!

## Hypothesis testing

Adapted from [Think Stats](http://greenteapress.com/thinkstats2/thinkstats2.pdf):


The goal of classical hypothesis testing is to answer the question, “Given a
sample and an apparent effect, what is the probability of seeing such an effect
by chance?” Here’s how we answer that question:


* The first step is to quantify the size of the apparent effect by choosing a **test statistic**. 
* The second step is to define a **null hypothesis**, which is a model of the system based on the assumption that the apparent effect is not real (i.e that it can be due to chance).
* The third step is to compute a **p-value**, which is the probability of seeing the apparent effect if the null hypothesis is true.

* The last step is to interpret the result. If the p-value is low, the effect
s said to be **statistically significant**, which means that it is unlikely to have occurred by chance. In that case we infer that the effect is more likely to appear in the larger population.

The logic of this process is similar to a proof by contradiction. To prove a mathematical statement, $A$, you assume temporarily that $A$ is false. If that assumption leads to a contradiction, you conclude that $A$ must actually be
true. Similarly, to test a hypothesis like, “This effect is real,” we assume, temporarily, that it is not. That’s the null hypothesis.
***

### Is this coin fair?

We toss a coin 250 times and see 140 heads and 110 tails...  
![image.png](attachment:image.png)


* **Null hypothesis**: The coin is fair. ($P(\text heads)=0.5$)
* **Alternative Hypothesis**: The coin is biased towards $heads$ (i.e. $p(heads) > 0.5$)

Flipping a coin 250 times in a row, 10000 times:

In [None]:
from collections import Counter

more_than_140_head = 0
num_experiments = 10000

for _ in range(num_experiments):
    experiment = np.random.choice(['heads', 'tails'], 250)
    num_heads = Counter(experiment)['heads']
    more_than_140_head += 1 if (num_heads >= 140) else 0
    
print(f'We have observed {more_than_140_head} in {num_experiments} trials')
print('The probability (p_value) of finding such a extreme result with a fair '
      f'coin is {(more_than_140_head)/(num_experiments)}')

Typically 0.01, 0.05, 0.1 are significant values.

Beware that you can still be wrong in your decision!! In fact, the errors you can make in hypothesis testing have names:

|                       | In Fact H0 is True | In Fact H0 is False | 
|-----------------------|--------------------|---------------------|
| Test Decides H0 True  | Great!             | Type II Error       | 
| Test Decides H0 False | Type I Error       | Great!              |
|                       |                    |                     | 


Relatedly, you will here the term `statistical power` refering to $$\text{power}=\Pr{\big (}{\text{reject }}H_{0}\mid H_{1}{\text{ is true}}{\big )}$$

(It indicates the probability of avoiding a type II error). See also: https://machinelearningmastery.com/statistical-power-and-power-analysis-in-python/

In this case, we did not need to simulate because we know that the number of `heads` (success cases) when tossing a coin follows a `binomial distribution` 

Scipy version: `binomial test`

* **Null Hypothesis** The difference in means is due to chance
* **Alternative Hyothesis** It is not due to chance


# Hypothesis testing: steps

Hypothesis testing is a way to test the results of an experiment and see if you have meaningful results.

* **Null hypothesis:** Denoted with H0, a null hypothesis is an **assumption that the population average is identical to a specific value**. The typical notation is μ = μ0, where μ refers to the population mean and μ0 refers to the hypothesized value.
<br><br>
* **Alternate hypothesis:** An alternative hypothesis is the opposite of the null hypothesis. We compare this hypothesis with the null hypothesis to decide whether or not we reject the null hypothesis. We denote the alternative hypothesis with H1 or Ha.
<br><br>
* **Significance Level:** Indicates whether we are confident enough to reject the null hypothesis.
<br><br>
* **Test Statistic:** Once we determine the type of hypothesis test and that our assumptions have been met, we use our data to decide whether to reject or not reject the null hypothesis. (z-test, t-test)
<br><br>
* **p-value:** is a measure used to help us reject or not the null hypothesis


## Choosing statistical tests

- **z-test**: one sample, N more than 30.
- **t-test**: N less than 30 OR two groups
- **ANOVA**: mean from group1 is equal/not equal to mean of group2 is equal/not equal to group3

![](http://www.ttable.org/uploads/2/1/7/9/21795380/441541708.gif)

- The Z distribution is a special case of the normal distribution with a mean of 0 and standard deviation of 1. 

- The t-distribution is sensitive to sample size: used for small or moderate samples when the population standard deviation is unknown.

### Z-test

Boys of a certain age are known to have a mean weight of 85 pounds. A complaint is made that the boys living in a municipal children's home are underfed. As one bit of evidence, 35 boys (of the same age) are weighed and found to have a mean weight of 80.94 pounds. It is known that the population standard deviation is 11.6 pounds. Based on the available data, what should be concluded concerning the complaint? 
 
 
How to reason about the problem:

It is assumed that the population mean weight is 85, but we do not have the complete data from the population. Otherwise we would have calculated the actual mean directly. However we only have sample data from 35 subjects. So based on this sample data we will try to prove or disprove our assumption, using statistical test.

**Step 1:** Define the null hypothesis - This is our assumption about the population. It is defined by H0 and in this case H0: μ = 85;

**Step 2:** Define the alternative hypothesis - This means, what if our assumption is not true. It is defined by Ha and in this case Ha: μ < 85. 

**Step 3:** Determine if it is a one-tailed or a two-tailed test. Two-tailed is when the mean tested (alternative hypothesis) can be > or < then the mean of the population. In this case we are checking if the mean of the weight of the boys in the home is smaller then the mean of the population of boys, so we will consider it a one-tailed test.

**Step 4:** Decide a test statistics based on the information available. Assuming data is normally distributed and number of observations are more than 30 and population variance is known (since population standard deviation is provided), we will use a z-test. If the population variance was not known or the testing sample is less then 30, we use a t-test. This test is based on a "t-distribution" which is a normal distribution. T test is based on students t distribution which is very similar to a standard normal distribution except that it is much flatter.

<img src=https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.03/7.03-t_distribution.png width="500">


**Step 5:** Level of significance: This defines the rejection region/critical region, it's the probability of making the wrong decision when the null hypothesis is true. Usually it is 0.05. It is defined by greek letter 'alpha'. In the medical field this would go down to 0.01.
 
**Step 6:** Calculate the test statistic based on the given information.

**Step 7:** Check the table to determine the critical value.
<br> For z-test you have fixed values according to Confidence Level.
<br> For t-test you have to calculate according to the degrees of freedom (df), which is the *sample_size - 1*.

**Step 8:** Make conclusions:
* If the test statistic falls in the critical region, then we reject the Null Hypothesis
* If the test statistic falls in the region between the critical region, then we fail to reject the Null Hypothesis.

### t-test: Brexit

If we assume a number of things about our underlying data, we can perform a t test. The more relevant are:


* Our observations must be independent of each other. 

* Normality of data - We assume that the sample is derived from a normally distributed data.



### Were older people more in favour of Brexit?

Every year, the Hansard Society sponsors a survey on political engagement in the UK. They put topical questions in each survey. For the 2016 / 17 survey, they asked about how people voted in the Brexit referendum.


In [None]:
remain_leave = pd.read_csv('../datasets/remain_leave.csv')

#### Is my data Normal?

One of the great things about hypothesis testing is that once you understand the general procedure, you can simply look for the right test to perform depending on the effect you want to test --- of course, it is much better to really understand what you are doing... 

**Example**

You give your results from the t-test on Brexit ages to your Boss, but she doesn´t buy the hypothesys that the underlying data is normally distributed... 

Let´s convince her with another test!

You find `scipy.stats.normaltest` [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html). It says:

"Test whether a sample differs from a normal distribution"

* **null hypothesis** that a sample comes from a normal distribution

Then if the **p-value** returned by the function is smaller than 0.05 (the significance level that our boss wants) we can safely reject the null hypothesis -> the data do no come from a normal distribution. 

### t-test: another one

A psychologist was interested in exploring whether or not male and female college students have different driving behaviors. There were a number of ways that she could quantify driving behaviors. She opted to focus on the fastest speed ever driven by an individual. Therefore, the particular statistical question she framed was as follows:

* Is the mean fastest speed driven by male college students different than the mean fastest speed driven by female college students?
* She conducted a survey of a random n = 34 male college students and a random m = 29 female college students. Here is a descriptive summary of the results of her survey:


- We reject the null hypothesis. <br>
- We can say with 95% of confidence that we have enough evidence to discard the null hypothesis. <br>
- The average speed between male and females are not the same. <br>
- Comparing the means, and checking the positive t-statistic value we can say males drive faster. <br>
- Remember here we are talking about statistical significance of the value, here some tests on pvalue behavior:

## ANOVA

(Analysis of Variance)

What if we want to compare 3 or more population means?

* **Null hypothesis**: $\mu_1 = \mu_2 = \dots=\mu_n$
* **Alternative hypothesis**: at least one mean is different


Imagine you want to choose a school for your kids. You manage to get the data of the exam scores of some kids in 4 different schools
This is the data you have.

Does it make a difference to choose one over the other?

In [None]:
## Further reading: check what groups are different from each other

## p-hacking

https://www.youtube.com/watch?v=HDCOUXE3HMM

## Summary 




## References & further materials


* [Think Stats](https://greenteapress.com/wp/think-stats-2e/)
* [Practical Statistics for Data Scientists](https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/)
* [Online Statistics Book](http://onlinestatbook.com/2/sampling_distributions/sampling_distributions.pdf)