#  Example: Variability Estimates of State Population

Data set containing population and murder rates for each state.

In [10]:
import pandas as pd
from statsmodels import robust

In [2]:
# Load the data
state = pd.read_csv('data/state.csv')

In [3]:
state.head()

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA


In [4]:
state.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   State         50 non-null     object 
 1   Population    50 non-null     int64  
 2   Murder.Rate   50 non-null     float64
 3   Abbreviation  50 non-null     object 
dtypes: float64(1), int64(1), object(2)
memory usage: 1.7+ KB


#### (a) pandas df provide methods for calculating std dev and quantiles

In [5]:
# calculate std. dev.
state['Population'].std()

6848235.347401142

In [6]:
# calculate 25th quantiles
state['Population'].quantile(0.25)

1833004.25

In [7]:
# calculate 50th quantiles
state['Population'].quantile(0.50)

4436369.5

In [8]:
# calculate 75th quantiles
state['Population'].quantile(0.75)

6680312.25

In [9]:
# calculate IQR
state['Population'].quantile(0.75) - state['Population'].quantile(0.25)

4847308.0

#### (b) For the robust MAD, we use the function robust.scale.mad from statsmodels package

In [11]:
robust.scale.mad(state['Population'])

3849876.1459979336

**Note** :<br>
    The standard deviation is almost twice as large as the MAD.<br>
    This is because std dev is sensitive to outliers.

---
---

**Key Ideas**

* Variance and std. dev. are the most widespread and routinely reported statistics of variablity.
* Both are sensitive to outliers.
* More robust metrics include mean absolute deviation and median absolute deviation from the median, and percentile(quantiles)

---
---