# What are the measures of spread?

Some important concepts to under what are measures of spread are:

* Quantilies: Are values that split a group of data into equal parts to help show how data is spread out.
  Example: for [1, 2, 3, 4, 5]
    * 25% quantile = 2
    * 50% quantile = 3 (commonly referred to as the median)
    * 75% quantile = 4
      
* Standard deviation (std): is the measure of how much individual numers in a group differ from the average, which gives an idea of how spread out the data in the group is. Since std is measured in the same unit of measurement as the data, std can be used to compare individual numbers in a group. The formulas for std are as follows, pandas uses the population deviation formula by default:

!["Standard deviation formulas"](standard_deviation.jpeg)\
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Example: for [1, 2, 3, 4, 5]\
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Mean = 3\
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[1 - 3, 2 - 3, 3 - 3, 4 - 3, 5 - 3] (subtract mean from each value)\
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[-2, -1, 0, 1, 2]\
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[4, 1, 0, 1, 4] (square each value)\
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;10 / 4 = 2.5 (sum values in list, divide by list length - 1, for sample)\
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2.5 ^ 0.5 = 1.581 (take the square root)

* Variance: is a measure that shows how the numbers in a group differ from their average, to see how scattered or clustered a data is. Variance is just the standard deviation squared.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Example: for [1, 2, 3, 4, 5]\
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Standard deviation = 1.581\
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1.581 ^ 2 = 2.5 

# import libraries

In [1]:
import pandas as pd

# Load Dataset

In [2]:
df = pd.read_csv('census_income_data.csv')

# Finding Quantiles on a DataFrame

Pandas can be used to answer questions like this:

    "What is the 25%, 50%, 75%, 90%, 95% quantile of the capital gained and lost in the dataset?"

Using the `quantile()` function.

In [5]:
df[['capital-gain','capital-loss']].quantile([0.25,0.5,0.75,0.9,0.95])

Unnamed: 0,capital-gain,capital-loss
0.25,0.0,0.0
0.5,0.0,0.0
0.75,0.0,0.0
0.9,0.0,0.0
0.95,5013.0,0.0


# Finding Quantiles, Standard Deviation and Variance on a group

Quantiles, standard deviation and variance can be applied on a group on its own.

Doing this with the `workclass` group would look something like this.

In [6]:
df['workclass'].value_counts()

workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: count, dtype: int64

## Finding the 50%, 90%, 95% quantiles for `age`, `capital-gain`, `capital-loss` and `hours-per-week`

In [10]:
df.groupby('workclass')[["age","capital-gain","capital-loss",'hours-per-week']].quantile([0.5,0.9,0.95])

Unnamed: 0_level_0,Unnamed: 1_level_0,age,capital-gain,capital-loss,hours-per-week
workclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Federal-gov,0.5,43.0,0.0,0.0,40.0
Federal-gov,0.9,58.0,0.0,0.0,50.0
Federal-gov,0.95,61.0,7298.0,1579.55,60.0
Local-gov,0.5,41.0,0.0,0.0,40.0
Local-gov,0.9,58.0,0.0,0.0,52.0
Local-gov,0.95,63.0,5178.0,1583.4,60.0
Never-worked,0.5,18.0,0.0,0.0,35.0
Never-worked,0.9,25.8,0.0,0.0,40.0
Never-worked,0.95,27.9,0.0,0.0,40.0
Private,0.5,35.0,0.0,0.0,40.0


## Calculating the Standard Deviation

Calculating the standard deviation with the mean is great for seein how distributed the data is to the mean.

In [11]:
df.groupby('workclass')[["age","capital-gain","capital-loss",'hours-per-week']].mean()

Unnamed: 0_level_0,age,capital-gain,capital-loss,hours-per-week
workclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Federal-gov,42.590625,833.232292,112.26875,41.379167
Local-gov,41.751075,880.20258,109.854276,40.9828
Never-worked,20.571429,0.0,0.0,28.428571
Private,36.797585,889.217792,80.008724,40.267096
Self-emp-inc,46.017025,4875.693548,155.138889,48.8181
Self-emp-not-inc,44.969697,1886.061787,116.631641,44.421881
State-gov,39.436055,701.699538,83.256549,39.031587
Without-pay,47.785714,487.857143,0.0,32.714286


In [12]:
df.groupby('workclass')[["age","capital-gain","capital-loss",'hours-per-week']].std()

Unnamed: 0_level_0,age,capital-gain,capital-loss,hours-per-week
workclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Federal-gov,11.509171,4101.966767,453.504623,8.838605
Local-gov,12.272856,5775.043442,439.513203,10.771559
Never-worked,4.613644,0.0,0.0,15.186147
Private,12.827721,6424.267599,384.157003,11.256298
Self-emp-inc,12.553194,17976.548086,549.488497,13.900417
Self-emp-not-inc,13.338162,10986.233506,467.611687,16.674958
State-gov,12.431065,3777.749185,394.469789,11.697014
Without-pay,21.07561,1300.780467,0.0,17.3579


## Calculating the variance

Because of how variance uses the standard deviation ot be calcualted, it can be useful to spot distributions in the outliers. But there are better ways to get the outliers like useing interquartile range (IQR) or the median absolution deviation (MAD).

Variance is more important for other areas like calculating confidence intervals, data normalization, testing hypotheses and assessing quality of statistical models.

In [13]:
df.groupby('workclass')[["age","capital-gain","capital-loss",'hours-per-week']].var()

Unnamed: 0_level_0,age,capital-gain,capital-loss,hours-per-week
workclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Federal-gov,132.461017,16826130.0,205666.442818,78.120942
Local-gov,150.622997,33351130.0,193171.855905,116.026473
Never-worked,21.285714,0.0,0.0,230.619048
Private,164.550434,41271210.0,147576.60314,126.704246
Self-emp-inc,157.58267,323156300.0,301937.608495,193.221591
Self-emp-not-inc,177.906562,120697300.0,218660.689455,278.05423
State-gov,154.531374,14271390.0,155606.414471,136.820127
Without-pay,444.181319,1692030.0,0.0,301.296703


# A quicker way to get all the caluclations in the notebook

`describe()` can be used to get all the calculations for the measuremetns in center and measurements in spread. A `percentiles` parameter can be used to specify which quantiles should be used.

In [14]:
df.groupby('workclass')[["age","capital-gain","capital-loss",'hours-per-week']].describe(percentiles=[0.5,0.9,0.95])

Unnamed: 0_level_0,age,age,age,age,age,age,age,age,capital-gain,capital-gain,...,capital-loss,capital-loss,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week
Unnamed: 0_level_1,count,mean,std,min,50%,90%,95%,max,count,mean,...,95%,max,count,mean,std,min,50%,90%,95%,max
workclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Federal-gov,960.0,42.590625,11.509171,17.0,43.0,58.0,61.0,90.0,960.0,833.232292,...,1579.55,3683.0,960.0,41.379167,8.838605,4.0,40.0,50.0,60.0,99.0
Local-gov,2093.0,41.751075,12.272856,17.0,41.0,58.0,63.0,90.0,2093.0,880.20258,...,1583.4,2444.0,2093.0,40.9828,10.771559,2.0,40.0,52.0,60.0,99.0
Never-worked,7.0,20.571429,4.613644,17.0,18.0,25.8,27.9,30.0,7.0,0.0,...,0.0,0.0,7.0,28.428571,15.186147,4.0,35.0,40.0,40.0,40.0
Private,22696.0,36.797585,12.827721,17.0,35.0,55.0,60.0,90.0,22696.0,889.217792,...,0.0,4356.0,22696.0,40.267096,11.256298,1.0,40.0,50.0,60.0,99.0
Self-emp-inc,1116.0,46.017025,12.553194,17.0,45.0,63.0,67.0,84.0,1116.0,4875.693548,...,1902.0,2559.0,1116.0,48.8181,13.900417,1.0,50.0,65.0,70.0,99.0
Self-emp-not-inc,2541.0,44.969697,13.338162,17.0,44.0,63.0,68.0,90.0,2541.0,1886.061787,...,1672.0,2824.0,2541.0,44.421881,16.674958,1.0,40.0,63.0,72.0,99.0
State-gov,1298.0,39.436055,12.431065,17.0,39.0,57.0,61.0,81.0,1298.0,701.699538,...,0.0,3683.0,1298.0,39.031587,11.697014,1.0,40.0,50.0,60.0,99.0
Without-pay,14.0,47.785714,21.07561,19.0,57.0,67.7,69.4,72.0,14.0,487.857143,...,0.0,0.0,14.0,32.714286,17.3579,10.0,27.5,53.5,58.5,65.0
