# Statistics
[stats ouline from wiki](https://en.wikipedia.org/wiki/Outline_of_statistics)
## Describing a single set of data

### Central Tendencies
* mean
  * Sensitive to outliers
* median
  * Insensitive to outliers
  * Can use Quickselect for efficient calculation.
* Quantiles
  * Generalization of the median
  * represents the value less than which a certain percentile of the data lies. (The median represents a value less than which 50% of the data lies).
 
### Dispersion

* The spread of our data.
* range

In [2]:
def data_range(x):
    return max(x) - min(x)

  * range doesn't really depend on the whole data set.
* variance
  * More complex than range but better at describing the *spread*
  * Has units that are the square of the original units which can be hard to intuit.


In [3]:
def de_mean(x):
    """translate x by subtracting its mean (so the result has mean 0)"""
    x_bar = mean(x)
    return [x_i - x_bar for x_i in x]

def variance(x):
    """assumes x has at least two elements"""
    n = len(x)
    deviations = de_mean(x)
    return sum_of_squares(deviations) / (n - 1)

* Standard deviation
  * is unitless so easier to make sense of

In [4]:
def standard_deviation(x):
    return math.sqrt(variance(x))

Both the range and the standard deviation are heavily affected by outliers.

A more robust alternative computes the difference between the 75th and 25th percentile value:

In [5]:
def interquartile_range(x):
    return quantile(x, 0.75) - quantile(x, 0.25)

## Correlation

* **Covariance**
  * Hard to interpret since units are a product of component units
  * Changing the scale changes the covariance
  

In [6]:
def covariance(x, y):
    n = len(x)
    return dot(de_mean(x), de_mean(y)) / (n - 1)

* **Correlation**
  * Divides out the standard deviation of both variables
  * Unitless 
  * lies between -1 and 1
  

In [7]:
def correlation(x,y):
    stdev_x = standard_deviation(x)
    stdev_y = standard_deviation(y)
    if stdev_x > 0 and stdev_y > 0:
        return covariance(x,y) / stdev_x / stdev_y
    
    else:
        return 0

## Simpson's paradox

# The MAGIC Criteria
[source](http://drafts.jsvine.com/the-magic-criteria/)

“There are several properties of data, and its analysis and presentation, that govern its persuasive force,” he writes. “We label these by the acronym MAGIC, which stands for magnitude, articulation, generality, interestingness, and credibility.” In slightly more detail:

## Magnitude
“The strength of a statistical argument is enhanced in accord with the quantitative magnitude of support for its qualitative claim.” Oversimplified: Bigger is better. Magnitude, of course, is relative. The difference between two outcomes in an experiment might seem large on its own (“effect size”) or because the intervention seemed small (“cause size”).

## Articulation
“By articulation, we refer to the degree of comprehensible detail in which conclusions are phrased.” Under another acronym, Abelson might have called this precision or detail. An argument that says, “Trains A and B run at different speeds,” is less compelling than one that says, “Train A runs faster than Train B between Chicago and New York, but slower between New York and Boston.”

## Generality
“Generality denotes the breadth of applicability of the conclusions.” One way to accrue generality: Take multiple approaches to answering the same question. Another: Apply the same approach in different contexts.

## Interestingness
“For a statistical story to be theoretically interesting, it must have the potential, through empirical analysis, to change what people believe about an important issue.” Change what people believe. About an important issue.

## Credibility
“Credibility refers to the believability of a research claim. It requires both methodological soundness, and theoretical coherence.” The burden of proof, at least at the outset, is on the investigator.

# Effect Size
[wiki](https://en.wikipedia.org/wiki/Effect_size)
* Error of effect-size

> Always present effect sizes for primary outcomes...If the units of measurement are meaningful on a practical level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure (regression coefficient or mean difference) to a standardized measure (r or d).
—  L. Wilkinson and APA Task Force on Statistical Inference (1999, p. 599)

* Cohen's *d*
Cohen's d is defined as the difference between two means divided by a standard deviation for the data, i.e.
$$
d = \frac{\bar{x}_1 - \bar{x}_2}{s}
$$

$s$ is the [pooled standard deviation](https://en.wikipedia.org/wiki/Pooled_standard_deviation)

# Power analysis
# Statistical Hypothesis testing
# Exploratory Data Analysis

The goals of EDA are to:
* Suggest hypotheses about the causes of observed phenomena
* Assess assumptions on which statistical inference will be based
* Support the selection of appropriate statistical tools and techniques
* Provide a basis for further data collection through surveys or experiments[5]

# Summary statistics
[wiki](https://en.wikipedia.org/wiki/Summary_statistics)
Some of the characteristics we might want to report are:
* central tendency: Do the values tend to cluster around a particular point?
* modes: Is there more than one cluster?
* spread: How much variability is there in the values?
* tails: How quickly do the probabilities drop off as we move away from the modes?
* outliers: Are there extreme values far from the modes?

Make sure you visualize your data in addition to looking at descriptive stats. [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) are four graphs with very different behaviors (as evidenced when graphed) yet have identical descriptive statistics.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/638px-Anscombe%27s_quartet_3.svg.png)
All four sets are identical when examined using simple summary statistics, but vary considerably when graphed

# Probability mass function

PMFs map each value to its probability. 
PMFs are histograms that are normalized by dividing each bin by the total number of samples.

If `ts` is a series, then to get a PMF in pandas
```python
pmf = ts.value_counts().sort_index() / len(ts)
```
to plot the pmf:
```python
pmf.plot(kind='hist')
```

## The class size paradox (aka Friendship paradox)
* [wiki](https://en.wikipedia.org/wiki/Friendship_paradox)
* [original paper](http://cs.marlboro.edu/courses/spring2010/statistics/wiki/wiki.attachments/Why_Your_Friends_Have_More_Friends_Than_You_Do.pdf)
* [class-size paradox](http://www.umasocialmedia.com/socialnetworks/glossary/class-size-paradox/)


