# Statistics

> Much of basic statistics is not intuitive (or, at least, not taught in an intuitive fashion), and the opportunity for misunderstanding and error is massive. 
>
> A strong course in applied statistics should cover basic hypothesis testing, regression, statistical power calculation, model selection, and a statistical programming language like R. 
> 
> *Statistics Done Wrong: The Woefully Complete Guide - Alex Reinhart*

## This notebook

This notebook is an introduction to a number of foundational concepts in statistics:

- law of large numbers,
- IID,
- prediction versus inference,
- what is a statistical analysis,
- 80/20 data science,
- Simpsons paradox,
- Survivorship bias,
- Pseudoreplication,
- data fallacies.


## Law of Large Numbers

[Wikipedia](https://en.wikipedia.org/wiki/Law_of_large_numbers)

The average of the results obtained from a large number of trials will be close to the expected value:
- will tend to become closer as more trials are performed,
- a small number of observations may not coincide with the expected value.

[The Gambler's Fallacy](https://en.wikipedia.org/wiki/Gambler%27s_fallacy) 

- the mistaken belief that if something happens more frequently than normal during a given period, it will happen less frequently in the future (or vice versa),
- a streak will not be immediately be "balanced",
- this is a misunderstanding of independent probabilities.

Joint probability of a & b is the product of their probabilities:

$P(a, b) = P(a) * P(b)$

## Independent and Identically Distributed (IID)

Fundamental assumption made by statistical learning
- assuming that the training set is **independently drawn** from a **fixed distribution**
- **independent** = our samples are not correlated/related to each other
- **identically distributed** = the distribution across our data set is the same as the 'true' distribution

Random sampling is required to make sample representative (unbiased)
- probability of sampling is independent

## Parametric versus non-parametric models

Fixed number of parameters = parametric
- linear models, neural nets

Variable number of parameters = non-parametric
- support vector machines, K nearest neighbours
- often the varying parameter is the number of samples

### Question to class 

Are decision tree ensembles parametric or non-parametric?

## Prediction versus inference

[Stack exchange](https://stats.stackexchange.com/questions/244017/what-is-the-difference-between-prediction-and-inference)

Prediction
- predict what target will be for future features
- given a new measurement, you want to use an existing data set to build a model that reliably chooses the correct identifier from a set of outcomes.

Given some information on a Titanic passenger, you want to choose from the set {lives,dies} and be correct as often as possible. (See bias-variance tradeoff for prediction in case you wonder how to be correct as often as possible.)

Inference
- to infer how nature is associating the target to the features
- given a set of data you want to infer how the output is generated as a function of the data.

You want to find out what the effect of Age, Passenger Class and, Gender has on surviving the Titanic Disaster. You can put up a logistic regression and infer the effect each passenger characteristic has on survival rates.


## Probability

Chapter 3 of [Deep Learning](https://www.deeplearningbook.org/)

Probability
- extension of logic to deal with uncertainty
- nearly all activities require reasoning under uncertainty

Three sources of uncertanity
1. stochastic environment
2. incomplete observability - can't observe all relevant variables 
3. incomplete models that discard information

It can be more practical to use **simple (yet inexact)** rules
- 'most birds can fly' = cheap, simple & broadly applicable
- 'birds fly, except for young birds & kiwis' = expensive, exact & brittle

Heuristics reduce dimensionality --> low dimensional features 
- rule for crossing the street = is it big, is it moving 

Machine learning also reduces dimensionality
- from business to a prediction that allows us to do control
- control of a business objective

## Frequentist or Bayesian perspective on probability

**Frequentist** = frequency that an event would occur with
- measurement of an expected probability

**Bayesian** = degree of belief that an event will happen
- prior + data -> posterior

## Question to class

Examples of a 
- frequentist probability
- Bayesian probability 
 
## Decisions made in a statistical analysis

Chapter 9 of [Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/)

- what to measure (labels)
- what variables to use (features)
- samples to drop (cleaning)
- how to group
- how to deal with missing data
- how much data do I need

Too much freedom = allows bias to creep in
- commit to experiment setup before seeing the data i.e. a p-value threshold

## Mistakes made in a statistical analysis

Chapter 10 of [Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/).

> Surveys of statistically significant results reported in medical and psychological trials suggest that many p values are wrong and some statistically insignificant results are actually significant when computed correctly. 
>
> Even the prestigious journal Nature isn’t perfect, with roughly **38% of papers making typos and calculation errors in their p values**. 
>
>Other reviews find examples of misclassified data, erroneous duplication of data, inclusion of the wrong dataset entirely, and other mix-ups, all concealed by papers that did not describe their analysis in enough detail for the errors to be easily noticed. 

## What to look for in a statistical analysis

Chapter 12 of [Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/).

What is the statistical power of the study?

How were features selected / discarded?

Effect-size estimates and confidence intervals accompanying significance tests

Showing whether the results have practical importance 

Whether appropriate statistical tests were used and how they were corrected for multiple comparisons 

## Relationship between quality & sharing

Share your code & data!

> Next Wicherts and his colleagues looked for a correlation between these errors and an unwillingness to share data. There was a clear relationship. 
>
> Authors who refused to share their data were more likely to have committed an error in their paper, and their statistical evidence tended to be weaker. 
> 
> Because most authors refused to share their data, Wicherts could not dig for deeper statistical errors, and many more may be lurking. 

## 80/20 of data science

Knowing when to use the mean versus median
- always report both
- the difference is a measure of skew

**Knowing when the ranks matter versus when the absolute values matter**
- can allow you to change a regression problem into a classification problem

## How does bias change with more data?

[Statistical Thinking for Data Science | SciPy 2015 | Chris Fonnesbeck](https://www.youtube.com/watch?v=TGGGDpb04Yc)

Bias = systematic error 
- wrong in the same direction all the time

How does sample size changes bias?
- imagine we induce bias intentionally by ignoring negative values 50% of the time
- bias gets worse as sample size increases
- highly precise wrong answers

## Simpsons Paradox

[Wikipedia](https://en.wikipedia.org/wiki/Simpson%27s_paradox)

Trend appearing in several different groups that reverses when groups are combined

![](./assets/si.gif)

#### Kidney stone treatment
This is a real-life example from a medical study comparing the success rates of two treatments for kidney stones.


The table below shows the success rates and numbers of treatments for treatments involving both small and large kidney stones, where Treatment A includes all open surgical procedures and Treatment B involves only a small puncture. The numbers in parentheses indicate the number of success cases over the total size of the group.

![](./assets/simpson-kidney.png)

The paradoxical conclusion is that treatment A is more effective when used on small stones, and also when used on large stones, yet treatment B is more effective when considering both sizes at the same time. In this example, the "lurking" variable is the severity of the case (represented by the doctors' treatment decision trend of favoring B for less severe cases).
When the less effective treatment (B) is applied more frequently to less severe cases, it can appear to be a more effective treatment.

## Survivorship bias

[Statistical Thinking for Data Science | SciPy 2015 | Chris Fonnesbeck](https://www.youtube.com/watch?v=TGGGDpb04Yc)

Cat survivorship numbers being higher [NY Times](https://www.nytimes.com/1989/08/22/science/on-landing-like-a-cat-it-is-a-fact.html
)

>From June 4 through Nov. 4, 1984, for instance, 132 such victims were admitted to the Animal Medical Center on 62d Street in Manhattan.
Most of the cats landed on concrete. Most survived. Experts believe they were able to do so because of the laws of physics, superior balance and what might be called the flying-squirrel tactic.
Even more surprising, the longer the fall, the greater the chance of survival. Only one of 22 cats that plunged from above 7 stories died, and there was only one fracture among the 13 that fell more than 9 stories. The cat that fell 32 stories on concrete, Sabrina, suffered a mild lung puncture and a chipped tooth. She was released from the hospital after 48 hours.

Survivorship
- found data
- conveinent data
- only see winners, odds are distorted, because failures are excluded
- is the missing data related to the target?

Self selection bias
- can choose to report data or not

## The Baltimore stockbroker

Chapter 6 of [How Not to Be Wrong: The Power of Mathematical Thinking - Jordan Ellenberg](https://en.wikipedia.org/wiki/How_Not_to_Be_Wrong).

You recieve mail from a stockbroker who correctly predicts the movement of stock prices
- this continues for 10 weeks

We can calculate the probability of this happening:

In [1]:
0.5 ** 10

0.0009765625

If we instead look at it from the stockbrokers perspective
- send 10,240 letters in the first week
- send 5,120 in the second (to those who we got the first week correct)
- continue for 10 weeks, and we have 10 people who have recieved perfect predictions

The key is to understand how many chances did the stockbroker have
- not only the probablility of success

Improbable things happen a lot

In [2]:
10240 * 0.5 ** 10

10.0

### Pseudoreplication

[Chapter 3 of Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/)

Counting the same sample multiple times
- dependence is the problem here (non independent aka biased sampling)

Additional measurements that depend on previous data don't prove your results generalize 
- they only increase certainty about specific sample studied

Eliminate hidden sources of correlation between variables
- measure 1,000 patients rather than 100 patients 10 times
- 100's neurons in two animals
- comparing growth rates of different crops in different fields

Solutions
- average dependent data points 
- analyze each point separately - don't combine, analyze only a subset (ie day 5)

Doing PCA on different batches of results
- if the number of the batch is important (i.e batch_index), then you have problems with the distribution for each time

## Data fallacies

[Survivorship bias](https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d)
- Abraham Wald in WW2
- intitution to place armor on planes where the most bullets are = wrong!
- place armor where there are no bullet holes - because these planes didn't come back
- *How Not To Be Wrong by Jordan Ellenberg*

[Cobra effect](https://en.wikipedia.org/wiki/Cobra_effect)
- bounty on dead Cobras in British colonial India
- people breed cobras for the bounty

<img src="assets/fallacies.jpg" alt="" width="900"/>

## Quiz

What does IID stand for?

What is a parametric model?  What is a non-parametric model?

Three sources of uncertantity?

Difference between Frequentist & Bayesian probability?

How does bias change as we add data?

What is Simpson's Paradox?

What is survivorship bias?

What is psuedoreplication?