# Basic statistics concepts

In [19]:
import pandas as pd
from scipy.stats import trim_mean
import numpy as np
import wquantiles

In [20]:
state = pd.read_csv('../data/state.csv')
state.sort_values(by="Population")


Unnamed: 0,State,Population,Murder.Rate,Abbreviation
49,Wyoming,563626,2.7,WY
44,Vermont,625741,1.6,VT
33,North Dakota,672591,3.0,ND
1,Alaska,710231,5.6,AK
40,South Dakota,814180,2.3,SD
7,Delaware,897934,5.8,DE
25,Montana,989415,3.6,MT
38,Rhode Island,1052567,2.4,RI
28,New Hampshire,1316470,0.9,NH
18,Maine,1328361,1.6,ME


In [26]:
print(f"mean: {state['Population'].mean():,}")
print(f"trimmed mean: {trim_mean(state['Population'], 0.1):,}") # exclude 10% of each edge
print(f"median: {state['Population'].median():,}")
# In this case mean > trim_mean > median

mean: 6,162,876.3
trimmed mean: 4,783,697.125
median: 4,436,369.5


# Robust vs Non robust

In [29]:
print("unweighted")
print(np.average(state["Murder.Rate"]))
print(state["Murder.Rate"].median())

print("WEIGHTED")
print(np.average(state["Murder.Rate"], weights=state["Population"]))
print(wquantiles.median(state["Murder.Rate"], weights=state["Population"]))

unweighted
4.066
4.0
WEIGHTED
4.445833981123393
4.4


# Deviations

For a dataset of {1, 4, 4} the mean is 3 and median is 4
The deviations  from the mean are \
$1-3=-2$ \
$4-3=1$ \
$4-3=1$ \
These measurements says how disperse is the data respect a central value, a metric that helps us is the \
### mean absolute deviation 
$${\[ \sum_{i=1}^{n}  |x_i-\bar{x}| \]}\over n$$
where x is our mean
### Variance
$${\[ \sum_{i=1}^{n}  (x_i-\bar{x})^² \]}\over n-1$$
### Standard deviation
$$ \sqrt{variance} $$

### Median Absolute deviation from the median (MAD)
is very resistant to variability 
$$ Median(|x_1 - m|, |x_2 - m|, ..., |x_n - m|)$$
where m is our median
## Percentile based estimations
A common variability measurement is the Inter Quartile Range (IQR) is the difference between percentile 25 and 75
### Percentile
Mathematically, the percentile has a very high cost because before using the algorithm we have to order the data, its formal definition is:
$$ (1-w)x_{(j)} + wx_{(j+1)} $$
but software tools use approximations, so the results will be different depending on which one we use 

