<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Statistical Inference

---

# Icebreaker

To discuss with your neighbour:

What is the weirdest scientific study you've heard of?

*OR*

What do you think is humanity's greatest scientific discovery (so far)?

# Housekeeping

- Unit 2 assignments ideally handed in by end of this week

- project lightning talks are **next Tuesday, 5th June**: prepare 3-5 minutes each to present 2-3 ideas

- if you're struggling for ideas, talk to us!

## Learning Objectives / Agenda

#### How much does my data actually tell me about the world?

- Define a confidence interval and a p-value
- Understand the theory of hypothesis testing
- Know how to perform hypothesis tests and how to calculate confidence intervals and p-values using Python
- Articulate the main considerations of study design and the problem with p-values

#### Why do things happen and how do we know?

- Define correlation and calculate it using Python
- Create appropriate plots to visualise correlations with Python
- Describe the difference between correlation and causation
- Articulate some ways to test for causation

# Part 1

## How much does my data actually tell me about the world?

Imagine you want to know the average height of people.

You sample 100 people at random and measure their heights. The mean and standard deviation of these heights are 1.5m and 0.1m.

**How confident are you that people are 1.5m tall on average?**

Confidence intervals give you a tool to measure that

They're "intervals" because your confidence is tied to a **range** of values

You'd report something like *"based on my sample, people are on average between 1.3m and 1.7m tall, with a 95% confidence."*

Where do those numbers come from?

In [1]:
from scipy import stats

stats.norm.interval(0.95, loc=1.5, scale=0.1)

(1.3040036015459946, 1.6959963984540054)

What does this interval mean exactly?

Specifically, this says:

If you drew 100 samples of people and measured their average heights,

then 95 times the 95% confidence interval would contain the **true** population mean.

http://rpsychologist.com/d3/CI/

What does changing the 95% to 90% or 99% do?

In [2]:
from scipy import stats

print("90% CI:", stats.norm.interval(0.9, loc=1.5, scale=0.1))
print("99% CI:", stats.norm.interval(0.99, loc=1.5, scale=0.1))

90% CI: (1.3355146373048528, 1.6644853626951472)
99% CI: (1.24241706964511, 1.75758293035489)


In [3]:
print("10% CI:", stats.norm.interval(0.10, loc=1.5, scale=0.1))
print("99.999% CI:", stats.norm.interval(0.99999, loc=1.5, scale=0.1))

10% CI: (1.4874338653144925, 1.5125661346855075)
99.999% CI: (1.0582826586529994, 1.9417173413467606)


## How sure can I be of my findings?

In doing science, we always want to err on the side of being sceptical.

If you measure a difference between things, or an effect of X on Y, you want to assume it's due to chance.

Then you have tools to try and suggest otherwise.

### Example

I want to find out if there's a significant height difference between horse jockeys and players from the NBA.

The way we frame this in hypothesis testing is we have **two** hypotheses.

The **null** hypothesis $H_0$, which assumes there is **no** difference (or no effect of X on Y)

The **alternate** hypothesis $H_1$, which assumes there **is** a difference (or an effect)

What are my hypotheses in this case?

$H_0$: there is no difference between jockeys and basketball players.

$H_1$: there **is** such a difference.

Then we need to decide on a **significance level**.

i.e. "how unlikely does it need to be that my findings are purely based on chance for me to trust them?"

Typically 5% (i.e. 0.05)

For comparing the means of two groups we can use a t-test.

https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind

In [9]:
import pandas as pd
import numpy as np

df = pd.DataFrame()

np.random.seed(42)

df["jockeys"] = np.random.normal(150, 10, 100)
df["jockeys_2"] = np.random.normal(150, 10, 100)
df["basketball_players"] = np.random.normal(190, 10, 100)

df.describe()

Unnamed: 0,jockeys,jockeys_2,basketball_players
count,100.0,100.0,100.0
mean,148.961535,150.223046,190.648963
std,9.081684,9.53669,10.842829
min,123.802549,130.812288,157.587327
25%,143.990943,141.943395,183.445565
50%,148.730437,150.841072,190.976957
75%,154.059521,155.381704,197.044374
max,168.522782,177.201692,228.527315


In [12]:
from scipy import stats

t_statistic, p_value = stats.ttest_ind(df["jockeys"],df["basketball_players"])

print(p_value)

2.452490316264101e-74


That's a very small number. That means it is extremely unlikely that this difference is due to chance.

Let's try with another random set of jockeys.

In [13]:
t_statistic_2, p_value_2 = stats.ttest_ind(df["jockeys"], df["jockeys_2"])

print(p_value_2)

0.3392652865361483


In the first case, the p-value was tiny.

That means that **assuming the null hypothesis**, i.e. "there is no significant difference between groups" (which we always do)...

it would be **extremely unlikely** to get two samples with such different means **purely by chance**.

Therefore there **is** a significant difference between the groups, and we **reject the null hypothesis**.

In the second case, the p-value was 0.34, meaning it is 34% likely we'd get a difference due to chance.

That's not enough evidence to conclude a difference, so we **fail to reject the null hypothesis**.

Important wording!

Case 1: *"we reject the null hypothesis"* and **not** *"we proved a difference"* or *"we proved the alternate hypothesis"*

Case 2: *"we fail to reject the null hypothesis"* and **not** *"we proved there is no difference"*

Remember, we're always cautious about our findings

The word **prove** is banned from a Data Scientist's vocabulary

# Study Design

#### Exercise

Your stakeholder wants to know if it's better to sell cars in the morning auction or the afternoon auction.

- What are your null and alternate hypotheses?
- If you could design the auction in a way to test this, what would you do?
    - how would you design the two auctions?
    - what would you be measuring?

Discuss in groups of 3-4.

#### Elements of good study design

- **control** for variables you're not interested in testing
    - e.g. make sure you don't sell Fiestas in the morning and Porsches in the afternoon

- make sure your samples are **representative**
    - remember the different sampling biases?

### Example study

#### "Eating ice cream for breakfast may improve mental performance" - study says

*"...found that people who had consumed ice cream for breakfast showed better reaction time and were able to process information better than those who did not have the ice cream"*

What do we think? Is this an example of good study design?

- two separate control groups = good

- control for other variables? = unclear

- placebo? = NO

The study itself highlights possible other explanations:

- *"also hoping to determine if ice cream is a trigger for **positive emotion**"*

- *"Subjects were tested a second time, during which they were **given cold water instead of ice cream** [...] that particular test did show higher levels of alertness and mental capacity"*

- *"[...] the **sugar high** that may come along after eating ice cream for breakfast"*

### Errors

#### Type I

False positives, i.e. concluding there is an effect/difference when there isn't one

![](assets/images/xkcd_jelly_beans_2.png)

![](assets/images/xkcd_jelly_beans_3.png)

#### Type II

False negatives, i.g. concluding no effect/difference where there is one

Remember the "boy who cried wolf":

- first the boy claimed there was a wolf when there wasn't one (Type I)

- then there was no response from the villagers when there actually was a wolf (Type II)

What's worse? Type I or Type II?

It depends...!

In law, a false positive (jailing an innocent person) is worse than a false negative (a guilty person goes free)

In medicine, a false negative (missing a diagnosis) is worse than a false positive (sending healthy people for follow ups)

Always think of the context when thinking about the cost of false positives and false negatives.

# The Problem with P-Values

What are some reasons it may not be good to rely on p-values?

- the 5% cutoff is arbitrary

- the more hypotheses you test, the higher the chance of even rare events, so 5% gets worse as a cutoff
    - one solution is the [Bonferroni correction](http://www.statsmakemecry.com/smmctheblog/bonferroni-correction-in-regression-fun-to-say-important-to.html)

- it is unintuitive even to scientists: [http://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values](http://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/)

- p-hacking

### P-hacking

https://projects.fivethirtyeight.com/p-hacking/

### The Salmon Study

*"One  mature  Atlantic  Salmon participated  in  the  fMRI  study."*

*"The  salmon  was  shown  a  series  of  photographs  depicting  human individuals in social situations with a specified emotional valence."*

*"The salmon was asked  to  determine  what  emotion  the  individual  in  the  photo  must  have  been experiencing."*

*"The salmon was approximately 18 inches long, weighed 3.8 lbs, and **was not alive** at the time of scanning."*

Result: *"Out of a search volume of 8064 voxels a total of 16 voxels were significant."*

http://users.stat.umn.edu/~corbett/classes/5303/Bennett-Salmon-2009.pdf

*"If you torture the data long enough, it will confess."* - Ronald Coase

The takeaway message is: you will **always** find false associations so beware!