## Python statistics essential training - 03_02_distributions

Standard imports

In [24]:
import numpy as np
from scipy.stats import percentileofscore
import pandas as pd

In [2]:
import matplotlib
import matplotlib.pyplot as plt

from IPython import display
from ipywidgets import interact, widgets

%matplotlib inline

In [3]:
import re
import mailbox
import csv

In [5]:
ch65 = pd.read_csv('income-1965-china.csv')
ch15 = pd.read_csv('income-2015-china.csv')
us65 = pd.read_csv('income-1965-usa.csv')
us15 = pd.read_csv('income-2015-usa.csv')

In [6]:
ch65.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
income          1000 non-null float64
log10_income    1000 non-null float64
dtypes: float64(2)
memory usage: 15.8 KB


In [7]:
ch65.head()

Unnamed: 0,income,log10_income
0,1.026259,0.011257
1,0.912053,-0.03998
2,0.110699,-0.955857
3,0.469659,-0.328217
4,0.374626,-0.426402


In [9]:
# here were going to look at the range of incomes
print(f'China 65 min: {ch65.min()}\n')
print(f'China 65 max: {ch65.max()}')

China 65 min: income          0.041968
log10_income   -1.377078
dtype: float64

China 65 max: income          5.426802
log10_income    0.734544
dtype: float64


In [10]:
# the mean is the average which is represented by the sum of all observations divided by the number of observations
ch65.mean()

income          0.660597
log10_income   -0.274157
dtype: float64

# [Variance](https://www.investopedia.com/terms/v/variance.asp)

In [21]:
#  variance measures how far a set of numbers are spread out from their average value.
ch65.var(ddof = 0)

income          0.208846
log10_income    0.088610
dtype: float64

# Quantiles and Percentiles

In [22]:
# A quantile is a statistic that describes a value for which a certain percentage of the data points lie below it
# here we are computing the quantile for 25% and 75% of the distribution

# in this case we find that 25% of the china income data points data points are smaller than .34 cents
# and that 75% is smaller than .86 cents
ch65.quantile([.25, .75])

Unnamed: 0,income,log10_income
0.25,0.34413,-0.463277
0.75,0.863695,-0.06364


### [quantiles](https://www.youtube.com/watch?v=IFKQLDmRK0Y)
A quantile is a statistic that tells you how much of your data is within or below a certain threshold. For example, if I have the numbers 1-15, the 50% quantile (aka the median) would be the value 8. This also means that 50% of the data is equal to or less than 8.

``` python
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]  
quantile(.5) = 8
                                
```                
                

In [23]:
# if we were to pass in 1 or 100% we would get back the maximim income because all of the incomes will be less than the max
ch65.quantile([.25, .75, 1])

Unnamed: 0,income,log10_income
0.25,0.34413,-0.463277
0.75,0.863695,-0.06364
1.0,5.426802,0.734544


### [Percentiles](https://www.mathsisfun.com/data/percentiles.html)
This is the inverse of a quntile. If a quantile tells a value based off of a percentage, a percentile tells you a percentage based off a value. The nth percentile of a set of data is the value at which n percent of the data is below it.

Back to our 15 numbers, if we were to input the number 11 we would get back ~75% because 75% of the data is equal to or less than 11.

``` python
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]  
percentile(11) = .75
```

__Alternate Example:__ You are the fourth tallest person in a group of 20

80% of people are shorter than you:
![link](https://www.mathsisfun.com/data/images/percentile-80.svg)

That means you are at the 80th percentile.

If your height is 1.85m then "1.85m" is the 80th percentile height in that group.

In [26]:
# in this case we see that $1.50 is at the 95th percentile and 95% of the data is less than $1.50
percentileofscore(ch65['income'],1.5)

95.5

In [27]:
ch65.describe()

Unnamed: 0,income,log10_income
count,1000.0,1000.0
mean,0.660597,-0.274157
std,0.457226,0.297822
min,0.041968,-1.377078
25%,0.34413,-0.463277
50%,0.557477,-0.253773
75%,0.863695,-0.06364
max,5.426802,0.734544


In [28]:
us65.describe()

Unnamed: 0,income,log10_income
count,1000.0,1000.0
mean,31.587965,1.418835
std,22.101531,0.2622
min,4.177852,0.620953
25%,17.498592,1.243003
50%,26.069531,1.416133
75%,39.017113,1.591255
max,246.030397,2.390989
