# Statistics

This notebook outlines a few background concepts before we dig into the course proper.

> Much of basic statistics is not intuitive (or, at least, not taught in an intuitive fashion), and the opportunity for misunderstanding and error is massive. 
>
> A strong course in applied statistics should cover basic hypothesis testing, regression, statistical power calculation, model selection, and a statistical programming language like R. 
> 
> *Statistics Done Wrong: The Woefully Complete Guide - Alex Reinhart*

## Law of Large Numbers

[Wikipedia](https://en.wikipedia.org/wiki/Law_of_large_numbers)

The average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed

There is no principle that a small number of observations will coincide with the expected value or that a streak of one value will immediately be "balanced" by the others 

[The gambler's fallacy](https://en.wikipedia.org/wiki/Gambler%27s_fallacy) = the mistaken belief that, if something happens more frequently than normal during a given period, it will happen less frequently in the future (or vice versa)

## IID

Fundamental assumption in statistical learning
- **independent and identically distributed**
- assuming that the training set is independently drawn from a fixed distribution

**Independent** = our samples are not correlated/related to each other

**Identically distributed** = the distribution across our data set is the same as the 'true' distribution

Random sample required to make sample representative

## Parametric versus non-parametric models

Fixed number of parameters = parametric
- neural nets

Variable number of parameters = non-parametric
- support vector machines, K nearest neighbours

## Probability

Chapter 3 of [Deep Learning](https://www.deeplearningbook.org/).

Probability = extension of logic to deal with uncertantity.  Nearly all activities require reasoning under uncertantity.  

Three sources of uncertantity
1. stochastic environment
2. incomplete observation
3. incomplete models (that discard infomation)

More practical to use simple (yet inexact) rules
- 'most birds can fly' = cheap, simple & broadly applicable
- 'birds fly, except for young birds & kiwis' = expensive, exact & brittle

**Frequentist** probability = represent the frequency that an event would occur with
- rolling dice, counting cards

**Bayesian** probability = represents a degree of belief
- quantifying a level of certantity

## Decisions made in a statistical analysis

Chapter 9 of [Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/).

- what to measure (labels)
- what variables to use (features)
- samples to drop (cleaning)
- how to group
- how to deal with missing data
- how much data do I need

Too much freedom = allows bias to creep in
- commit to decisions before seeing the data
- i.e. a p-value threshold

## Mistakes made in a statistical analysis

Chapter 10 of [Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/).

> Surveys of statistically significant results reported in medical and psychological trials suggest that many p values are wrong and some statistically insignificant results are actually significant when computed correctly. 
>
> Even the prestigious journal Nature isn’t perfect, with roughly **38% of papers making typos and calculation errors in their p values**. 
>
>Other reviews find examples of misclassified data, erroneous duplication of data, inclusion of the wrong dataset entirely, and other mix-ups, all concealed by papers that did not describe their analysis in enough detail for the errors to be easily noticed. 

## What to look for in a statistical analysis

Chapter 12 of [Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/).

What is the statistical power of the study?

How were features selected / discarded?

Effect-size estimates and confidence intervals accompanying significance tests, showing whether the results have practical importance 

Whether appropriate statistical tests were used and how they were corrected for multiple comparisons 

## Relationship between quality & sharing

Share your code & data!

> Next Wicherts and his colleagues looked for a correlation between these errors and an unwillingness to share data. There was a clear relationship. 
>
> Authors who refused to share their data were more likely to have committed an error in their paper, and their statistical evidence tended to be weaker. 
> 
> Because most authors refused to share their data, Wicherts could not dig for deeper statistical errors, and many more may be lurking. 