# Advanced Pandas Lab

Let's return to the unemployment rate data in the Atlas of Rural and Small Town America.  Previously, we simply looked at the entirety of the unemployment data for the entire country.  Sometimes, it is adequate to look at then broad properties of a dataset, but often we want to dig deeper to understand what's going on at a finer level.  With this data, it is much more informative to look at the unemployment rates on a state level, both how states compare to each other and how counties vary within a given state.

As we did before, ``import`` pandas, load the Excel spreadsheet, and pull the ``Jobs`` sheet.

In [1]:
import pandas as pd

df = pd.read_excel('../data/RuralAtlasData18.xlsx', sheet_name='Jobs')

Let's use the ``.head()`` function to familiarize with the DataFrame again and make sure it was read correctly. 

In [2]:
df.head()

Unnamed: 0,FIPS,State,County,UnempRate2017,UnempRate2016,UnempRate2015,UnempRate2014,UnempRate2010,UnempRate2007,PctEmpChange1017,...,NumUnemployed2013,NumEmployed2013,NumCivLaborforce2013,NumUnemployed2007,NumEmployed2007,NumUnemployed2012,NumEmployed2012,UnempRate2012,NumCivLaborForce2012,NumUnemployed2014
0,0,US,United States,,,,,,,,...,,,,,,,,,,
1,1000,AL,Alabama,4.4,5.9,6.1,6.8,10.5,4.0,5.52526,...,156957.0,2017043.0,2174000.0,86485.0,2089127.0,173047.0,2003290.0,8.0,2176337.0,146531.0
2,1001,AL,Autauga,3.9,5.1,5.2,5.8,8.9,3.3,6.303615,...,1605.0,24205.0,25810.0,806.0,23577.0,1779.0,23961.0,6.9,25740.0,1495.0
3,1003,AL,Baldwin,4.0,5.4,5.5,6.1,10.0,3.1,17.032748,...,5654.0,79626.0,85280.0,2560.0,80099.0,6349.0,78065.0,7.5,84414.0,5300.0
4,1005,AL,Barbour,5.9,8.4,8.9,10.5,12.3,6.3,-13.49481,...,931.0,8168.0,9099.0,650.0,9684.0,1079.0,8283.0,11.5,9362.0,932.0


As we previously found, we need to set the index to the county FIPS and filter out the US, DC, and Puerto Rico.  Let's do that now.

In [3]:
df.set_index('FIPS')

Unnamed: 0_level_0,State,County,UnempRate2017,UnempRate2016,UnempRate2015,UnempRate2014,UnempRate2010,UnempRate2007,PctEmpChange1017,PctEmpChange1617,...,NumUnemployed2013,NumEmployed2013,NumCivLaborforce2013,NumUnemployed2007,NumEmployed2007,NumUnemployed2012,NumEmployed2012,UnempRate2012,NumCivLaborForce2012,NumUnemployed2014
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,US,United States,,,,,,,,,...,,,,,,,,,,
1000,AL,Alabama,4.4,5.9,6.1,6.8,10.5,4.0,5.525260,1.343453,...,156957.0,2017043.0,2174000.0,86485.0,2089127.0,173047.0,2003290.0,8.0,2176337.0,146531.0
1001,AL,Autauga,3.9,5.1,5.2,5.8,8.9,3.3,6.303615,1.280852,...,1605.0,24205.0,25810.0,806.0,23577.0,1779.0,23961.0,6.9,25740.0,1495.0
1003,AL,Baldwin,4.0,5.4,5.5,6.1,10.0,3.1,17.032748,2.637293,...,5654.0,79626.0,85280.0,2560.0,80099.0,6349.0,78065.0,7.5,84414.0,5300.0
1005,AL,Barbour,5.9,8.4,8.9,10.5,12.3,6.3,-13.494810,0.649351,...,931.0,8168.0,9099.0,650.0,9684.0,1079.0,8283.0,11.5,9362.0,932.0
1007,AL,Bibb,4.4,6.5,6.6,7.2,11.4,4.1,2.767248,1.031056,...,689.0,8016.0,8705.0,359.0,8432.0,751.0,8047.0,8.5,8798.0,617.0
1009,AL,Blount,4.0,5.4,5.4,6.1,9.8,3.2,4.670525,1.122677,...,1562.0,23325.0,24887.0,849.0,25780.0,1716.0,23244.0,6.9,24960.0,1503.0
1011,AL,Bullock,4.9,6.9,7.9,8.8,11.8,9.4,5.027548,1.825061,...,447.0,4331.0,4778.0,345.0,3308.0,491.0,4245.0,10.4,4736.0,418.0
1013,AL,Butler,5.5,6.9,7.6,8.5,13.6,6.2,5.713928,-1.145797,...,941.0,8206.0,9147.0,560.0,8539.0,1049.0,8039.0,11.5,9088.0,792.0
1015,AL,Calhoun,4.9,6.6,7.0,8.0,11.4,3.9,-5.177356,1.209856,...,4242.0,44163.0,48405.0,2152.0,52709.0,4397.0,45252.0,8.9,49649.0,3773.0


In the other lab we pulled out the ``UnempRate2017`` column and filtered ``Null`` values from the series.  Here, let's leave the column in the DataFrame, but filter out the rows that ``Null`` values in that column.  We used ``.isna()`` to identify ``Nulls`` before.  Here, try to use ``.notnull()`` to get truth values that depend on if a value in the row is not a ``Null``.  Not only does it practice a different way to accomplish the task, actually makes the filtering a bit easier.

In [8]:
print(df.UnempRate2017.notnull())

0       False
1        True
2        True
3        True
4        True
        ...  
3273     True
3274     True
3275     True
3276     True
3277     True
Name: UnempRate2017, Length: 3278, dtype: bool


To check that this was done properly, you can count the ``Null`` and non-``Null`` values in the column with ``.isna().sum()`` and ``.notnull().sum()``.

In [19]:
df.info
#print(df.isna().sum())
#print('----------------------')
#print(df.notnull().sum())

<bound method DataFrame.info of        FIPS State         County  UnempRate2017  UnempRate2016  UnempRate2015  \
0         0    US  United States            NaN            NaN            NaN   
1      1000    AL        Alabama            4.4            5.9            6.1   
2      1001    AL        Autauga            3.9            5.1            5.2   
3      1003    AL        Baldwin            4.0            5.4            5.5   
4      1005    AL        Barbour            5.9            8.4            8.9   
...     ...   ...            ...            ...            ...            ...   
3273  72145    PR      Vega Baja           12.4           13.9           13.8   
3274  72147    PR        Vieques           13.7           10.6           11.3   
3275  72149    PR       Villalba           19.6           20.2           19.8   
3276  72151    PR        Yabucoa           16.4           16.9           17.5   
3277  72153    PR          Yauco           17.3           18.8           18.9

We now have cleaned data, so we can start to perform our calculations.  Start by grouping by state.

In [16]:
dfgroup = df.groupby('State')
print(dfgroup.groups)

{'AK': Int64Index([ 69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,
             82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,
             95,  96,  97,  98,  99, 100, 101, 102],
           dtype='int64'), 'AL': Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
            35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
            52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
            68],
           dtype='int64'), 'AR': Int64Index([119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
            132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
            145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157,
            158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
            171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,
  

Check that the data is grouped as you expect by printing the groups.  You can do this by calling ``.groups`` which will print the index values in the groups.  Hint: the first two digits are the state code so they should all be the same.  If you really want to be sure you can look the codes up and check a few to confirm it is correct.

(To make the notebook less cluttered, feel free to only print a selection of the groups using slicing on the groups, or comment out the code line when you have completed your inspection of the groups)

In [20]:
df.UnempRate2017.max()

20.1

Now that we have these groups, let's grab the ``UnempRate2017`` column and get the minimum, maximum, and average (mean) for each state.  Save them to the variables below.

In [23]:
state_unemp_mins = df.UnempRate2017.min()
state_unemp_maxs = df.UnempRate2017.max()
state_unemp_mean = df.UnempRate2017.mean()
print(state_unemp_mins)
print(state_unemp_maxs)
print(state_unemp_mean)

1.6
20.1
4.817273005197187


Print the average unemployment rate, ranked by highest unemployment.

In [30]:
df["UnempRate2017"].sort_values(ascending=False).head(10)

85      20.1
3242    20.0
3247    19.8
3275    19.6
3255    19.6
3240    19.4
3262    19.1
208     19.1
3241    18.4
102     18.3
Name: UnempRate2017, dtype: float64

Now let's find the ten states with the largest disparity between the counties with the highest unemployment and lowest.  There are a few steps here.  What is the calculation we are doing and what sort of data manipulations are we doing?

Let's conclude with a challenge by actually look at how each county's unemployment rate varies within its state.  In particular, we want to produce a table with each county's name, the state name, the unemployment rate, and a measure of how many standard deviations away from the state mean they lie.  The lecture notes may be helpful.

This is a very rich dataset.  If you've finished early, look into how the unemployment has changed over time in these counties, and how it depends on things like incomes and county classifications from the other sheets.