# Statistics

This notebook outlines a few background concepts before we dig into the course proper.

> Much of basic statistics is not intuitive (or, at least, not taught in an intuitive fashion), and the opportunity for misunderstanding and error is massive. 
>
> A strong course in applied statistics should cover basic hypothesis testing, regression, statistical power calculation, model selection, and a statistical programming language like R. 
> 
> *Statistics Done Wrong: The Woefully Complete Guide - Alex Reinhart*

## Law of Large Numbers

[Wikipedia](https://en.wikipedia.org/wiki/Law_of_large_numbers)

The average of the results obtained from a large number of trials should be close to the expected value
- will tend to become closer as more trials are performed
- a small number of observations may not coincide with the expected value
- a streak of one value will not be immediately be "balanced" by the others 

[The gambler's fallacy](https://en.wikipedia.org/wiki/Gambler%27s_fallacy) 
- the mistaken belief that if something happens more frequently than normal during a given period
- it will happen less frequently in the future (or vice versa)

## Independent and Identically Distributed (IID)

Fundamental assumption made in statistical learning
- assuming that the training set is independently drawn from a fixed distribution
- **independent** = our samples are not correlated/related to each other
- **identically distributed** = the distribution across our data set is the same as the 'true' distribution

Random sampling is required to make sample representative

## Parametric versus non-parametric models

Fixed number of parameters = parametric
- neural nets

Variable number of parameters = non-parametric
- support vector machines, K nearest neighbours
- often the varying parameter is the number of samples

## Probability

Chapter 3 of [Deep Learning](https://www.deeplearningbook.org/).

Probability = extension of logic to deal with uncertantity.  Nearly all activities require reasoning under uncertantity.  

Three sources of uncertantity
1. stochastic environment
2. incomplete observation
3. incomplete models (that discard infomation)

More practical to use simple (yet inexact) rules
- 'most birds can fly' = cheap, simple & broadly applicable
- 'birds fly, except for young birds & kiwis' = expensive, exact & brittle

**Frequentist** probability = represent the frequency that an event would occur with
- rolling dice, counting cards

**Bayesian** probability = represents a degree of belief
- quantifying a level of certantity

## Decisions made in a statistical analysis

Chapter 9 of [Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/).

- what to measure (labels)
- what variables to use (features)
- samples to drop (cleaning)
- how to group
- how to deal with missing data
- how much data do I need

Too much freedom = allows bias to creep in
- commit to decisions before seeing the data
- i.e. a p-value threshold

## Mistakes made in a statistical analysis

Chapter 10 of [Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/).

> Surveys of statistically significant results reported in medical and psychological trials suggest that many p values are wrong and some statistically insignificant results are actually significant when computed correctly. 
>
> Even the prestigious journal Nature isnâ€™t perfect, with roughly **38% of papers making typos and calculation errors in their p values**. 
>
>Other reviews find examples of misclassified data, erroneous duplication of data, inclusion of the wrong dataset entirely, and other mix-ups, all concealed by papers that did not describe their analysis in enough detail for the errors to be easily noticed. 

## What to look for in a statistical analysis

Chapter 12 of [Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/).

What is the statistical power of the study?

How were features selected / discarded?

Effect-size estimates and confidence intervals accompanying significance tests, showing whether the results have practical importance 

Whether appropriate statistical tests were used and how they were corrected for multiple comparisons 

## Relationship between quality & sharing

Share your code & data!

> Next Wicherts and his colleagues looked for a correlation between these errors and an unwillingness to share data. There was a clear relationship. 
>
> Authors who refused to share their data were more likely to have committed an error in their paper, and their statistical evidence tended to be weaker. 
> 
> Because most authors refused to share their data, Wicherts could not dig for deeper statistical errors, and many more may be lurking. 

## 80/20 of data science

Knowing when to use the mean versus median
- always report both

Knowing when the ranks matter versus when the absolute values matter
- can allow you to change a regression problem into a classification problem

## How does bias change with more data?

[Statistical Thinking for Data Science | SciPy 2015 | Chris Fonnesbeck](https://www.youtube.com/watch?v=TGGGDpb04Yc)

Bias = systematic error (wrong in the same direction all the time)

Does experiment looking at how sample size changes bias (min 8)
- induces bias intentionally by ignoring negative values 50% of the time
- shows that bias gets worse with sample size!
- highly precise wrong answers

## Simpsons paradox

[Wikipedia](https://en.wikipedia.org/wiki/Simpson%27s_paradox)

Trend appearing in several different groups that reverses when groups are combined

![](./assets/si.gif)

Study of gender bias among graduate school admissions to University of California, Berkeley. The admission figures for the fall of 1973:

![](./assets/simpson1.png)

The figures indicate that men are less likely to be admitted.  The effect is unlikely to be due to chance.

But if we examine the six largest departments:

![](./assets/simpson2.png)

Bickel et. al (1975) concluded
- women tended to apply for competitive departments with lowel admission rates (English)
- men apply for less competitive departments with high rates of admission (engineering, chemistry)

I think it is more a commentary on lack of funding (more funding = less competition)

## Survivorship bias

[Statistical Thinking for Data Science | SciPy 2015 | Chris Fonnesbeck](https://www.youtube.com/watch?v=TGGGDpb04Yc)

Cat survivorship numbers being higher [NY Times](https://www.nytimes.com/1989/08/22/science/on-landing-like-a-cat-it-is-a-fact.html
)

>>>
From June 4 through Nov. 4, 1984, for instance, 132 such victims were admitted to the Animal Medical Center on 62d Street in Manhattan.
Most of the cats landed on concrete. Most survived. Experts believe they were able to do so because of the laws of physics, superior balance and what might be called the flying-squirrel tactic.
Even more surprising, the longer the fall, the greater the chance of survival. Only one of 22 cats that plunged from above 7 stories died, and there was only one fracture among the 13 that fell more than 9 stories. The cat that fell 32 stories on concrete, Sabrina, suffered a mild lung puncture and a chipped tooth. She was released from the hospital after 48 hours.
>>>

Survivorship
- found data
- conveinent data
- is the missing data related to the target?

Self selection bias
- can choose to report data or not

## The Baltimore stockbroker

Chapter 6 of [How Not to Be Wrong: The Power of Mathematical Thinking - Jordan Ellenberg](https://en.wikipedia.org/wiki/How_Not_to_Be_Wrong).

You recieve mail from a stockbroker who correctly predicts the movement of stock prices
- this continues for 10 weeks

We can calculate the probability of this happening:

In [None]:
0.5 ** 10

If we instead look at it from the stockbrokers perspective
- send 10,240 letters in the first week
- send 5,120 in the second (to those who we got the first week correct)
- continue for 10 weeks, and we have 10 people who have recieved perfect predictions

The key is to understand how many chances did the stockbroker have
- not only the probablility

Improbable things happen a lot

In [None]:
10240 * 0.5 ** 10

### Pseudoreplication

[Chapter 3 of Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/)

Counting the same sample multiple times
- dependence is the problem here (non independent sampling)

Additional measurements that depend on previous data don't prove your results generalize 
- they only increase certainty about specific sample studied

Eliminate hidden sources of correlation between variables
- meausure 1,000 paitients rather than 100 paitents 10 times
- 100's neurons in two animals
- comparing growth rates of different crops in different fields

Solutions
- average dependent data points 
- analyze each point separately - don't combine, analyze only a subset (ie day 5)

Doing PCA on different batches of results
- if the number of the batch is important, then you have problems with the distribution for each time


## Data fallacies

<img src="assets/fallacies.jpg" alt="" width="900"/>