# Basic Pandas and I/O 

To practice using Pandas, we will explore the Atlas of Rural and Small Town America.  The USDA compiles county-level statistics from many different surveys to provide a comprehensive overview of different aspects of America.  This Atlas looks at broad categories of socioeconomic factors including the demographics of the population, economic data on employment, county properties on a rural-urban continuum, income data, and veteran status.  These categories are all stored as different sheets in an Excel file.  For this lab, we will only focus on the ``Jobs`` sheet.  For a full description of the available data, check out the webpage:

https://www.ers.usda.gov/data-products/atlas-of-rural-and-small-town-america/

Let's first import the Pandas library and load the data using the ``read_excel`` function.  Remember to specify the ``jobs`` sheet with the ``sheet_name`` argument.

In [8]:
import pandas as pd

df = pd.read_excel('../data/RuralAtlasData18.xlsx', sheet_name='Jobs')
df.head()
df.columns
print('DataFrame shape is:',df.shape)

DataFrame shape is: (3278, 62)


Before doing any analysis, we first need to know what data fields we have and the type of data we will be working with.  Use ``.head()`` to get the first five rows of data.  If you just want to see a list of the columns are available, you can get the column names with ``.columns``.

In [10]:
# use .head on df
df.columns

Index(['FIPS', 'State', 'County', 'UnempRate2017', 'UnempRate2016',
       'UnempRate2015', 'UnempRate2014', 'UnempRate2010', 'UnempRate2007',
       'PctEmpChange1017', 'PctEmpChange1617', 'PctEmpChange0717',
       'PctEmpChange0710', 'PctEmpAgriculture', 'PctEmpMining',
       'PctEmpConstruction', 'PctEmpManufacturing', 'PctEmpTrade',
       'PctEmpTrans', 'PctEmpInformation', 'PctEmpFIRE', 'PctEmpServices',
       'PctEmpGovt', 'NumCivEmployed', 'NumEmployed2008',
       'NumCivLaborForce2011', 'UnempRate2011', 'NumUnemployed2010',
       'NumEmployed2010', 'NumCivLaborForce2010', 'NumUnemployed2009',
       'NumEmployed2009', 'NumCivLaborForce2009', 'NumCivLaborForce2008',
       'NumUnemployed2008', 'NumEmployed2011', 'UnempRate2009',
       'UnempRate2008', 'NumCivLaborforce2014', 'NumUnemployed2017',
       'NumEmployed2017', 'NumCivLaborforce2017', 'NumUnemployed2016',
       'NumEmployed2016', 'NumCivLaborforce2016', 'NumCivLaborforce2015',
       'NumEmployed2015', 'NumUnem

This seems like a lot of data.  How much is it exactly?  We can get the size of the DataFrame in terms of rows and columns using ``.shape``.

In [9]:
print('DataFrame shape is:',df.shape)
# use .shape  on sheet_data


DataFrame shape is: (3278, 62)


Pandas has automatically filled in its own index, but the unique identifier for our data is actually the Federal Information Processing Standards (FIPS) county code.  Let's set that as the index using ``.set_index()`` with the column we want specified as an argument.

In [11]:
df.set_index('FIPS')

Unnamed: 0_level_0,State,County,UnempRate2017,UnempRate2016,UnempRate2015,UnempRate2014,UnempRate2010,UnempRate2007,PctEmpChange1017,PctEmpChange1617,...,NumUnemployed2013,NumEmployed2013,NumCivLaborforce2013,NumUnemployed2007,NumEmployed2007,NumUnemployed2012,NumEmployed2012,UnempRate2012,NumCivLaborForce2012,NumUnemployed2014
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,US,United States,,,,,,,,,...,,,,,,,,,,
1000,AL,Alabama,4.4,5.9,6.1,6.8,10.5,4.0,5.525260,1.343453,...,156957.0,2017043.0,2174000.0,86485.0,2089127.0,173047.0,2003290.0,8.0,2176337.0,146531.0
1001,AL,Autauga,3.9,5.1,5.2,5.8,8.9,3.3,6.303615,1.280852,...,1605.0,24205.0,25810.0,806.0,23577.0,1779.0,23961.0,6.9,25740.0,1495.0
1003,AL,Baldwin,4.0,5.4,5.5,6.1,10.0,3.1,17.032748,2.637293,...,5654.0,79626.0,85280.0,2560.0,80099.0,6349.0,78065.0,7.5,84414.0,5300.0
1005,AL,Barbour,5.9,8.4,8.9,10.5,12.3,6.3,-13.494810,0.649351,...,931.0,8168.0,9099.0,650.0,9684.0,1079.0,8283.0,11.5,9362.0,932.0
1007,AL,Bibb,4.4,6.5,6.6,7.2,11.4,4.1,2.767248,1.031056,...,689.0,8016.0,8705.0,359.0,8432.0,751.0,8047.0,8.5,8798.0,617.0
1009,AL,Blount,4.0,5.4,5.4,6.1,9.8,3.2,4.670525,1.122677,...,1562.0,23325.0,24887.0,849.0,25780.0,1716.0,23244.0,6.9,24960.0,1503.0
1011,AL,Bullock,4.9,6.9,7.9,8.8,11.8,9.4,5.027548,1.825061,...,447.0,4331.0,4778.0,345.0,3308.0,491.0,4245.0,10.4,4736.0,418.0
1013,AL,Butler,5.5,6.9,7.6,8.5,13.6,6.2,5.713928,-1.145797,...,941.0,8206.0,9147.0,560.0,8539.0,1049.0,8039.0,11.5,9088.0,792.0
1015,AL,Calhoun,4.9,6.6,7.0,8.0,11.4,3.9,-5.177356,1.209856,...,4242.0,44163.0,48405.0,2152.0,52709.0,4397.0,45252.0,8.9,49649.0,3773.0


Let's get into the summary statistics a bit.  It looks like there are 3278 counties in our dataset, but that definitely isn't right since there is a row for "United States" that appears when using ``.head()``.  Let's figure out how many counties there are in each state and see if there is anything else we don't want in the dataset.  The ``.value_counts()`` function is useful here and we can apply it to the ``State`` column.

In [17]:
df.count

<bound method DataFrame.count of        FIPS State         County  UnempRate2017  UnempRate2016  UnempRate2015  \
0         0    US  United States            NaN            NaN            NaN   
1      1000    AL        Alabama            4.4            5.9            6.1   
2      1001    AL        Autauga            3.9            5.1            5.2   
3      1003    AL        Baldwin            4.0            5.4            5.5   
4      1005    AL        Barbour            5.9            8.4            8.9   
...     ...   ...            ...            ...            ...            ...   
3273  72145    PR      Vega Baja           12.4           13.9           13.8   
3274  72147    PR        Vieques           13.7           10.6           11.3   
3275  72149    PR       Villalba           19.6           20.2           19.8   
3276  72151    PR        Yabucoa           16.4           16.9           17.5   
3277  72153    PR          Yauco           17.3           18.8           18.

In [70]:
newindx = df.State.value_counts()
print(newindx)


TX    255
GA    160
VA    135
KY    121
MO    116
KS    106
IL    103
NC    101
IA    100
TN     96
NE     94
IN     93
OH     89
MN     88
MI     84
MS     83
PR     79
OK     78
AR     76
WI     73
PA     68
FL     68
AL     68
SD     67
CO     65
LA     65
NY     63
CA     59
MT     57
WV     56
ND     54
SC     47
ID     45
WA     40
OR     37
AK     34
NM     34
UT     30
MD     25
WY     24
NJ     22
NV     18
ME     17
AZ     16
VT     15
MA     15
NH     11
CT      9
HI      6
RI      6
DE      4
DC      2
US      1
Name: State, dtype: int64


In addition to 'US', it looks like DC and Puerto Rico are in our data.  Filter them out since they aren't technically states.  "!=" means not equal.

In [140]:
print(df.shape)
newindx = df["State"]!='DC' 
newdataset = df[newindx]
print(newdataset)


(3278, 62)
       FIPS State         County  UnempRate2017  UnempRate2016  UnempRate2015  \
0         0    US  United States            NaN            NaN            NaN   
1      1000    AL        Alabama            4.4            5.9            6.1   
2      1001    AL        Autauga            3.9            5.1            5.2   
3      1003    AL        Baldwin            4.0            5.4            5.5   
4      1005    AL        Barbour            5.9            8.4            8.9   
...     ...   ...            ...            ...            ...            ...   
3273  72145    PR      Vega Baja           12.4           13.9           13.8   
3274  72147    PR        Vieques           13.7           10.6           11.3   
3275  72149    PR       Villalba           19.6           20.2           19.8   
3276  72151    PR        Yabucoa           16.4           16.9           17.5   
3277  72153    PR          Yauco           17.3           18.8           18.9   

      UnempRate2

Print the value counts again to make sure they were taken out correctly before proceeding.

Now that we have a dataset that only contains records for counties that are in the 50 states, let's look at unemployment.  In particular we want to focus on the 2017 unemployment rate which can be accessed in the ``UnempRate2017`` column.  Let's pull it out into its own Series and assign it to ``UnempRate``.  If you recall the index will be maintained.

In [141]:
UnempRate = newdataset.UnempRate2017
print(UnempRate)

0        NaN
1        4.4
2        3.9
3        4.0
4        5.9
        ... 
3273    12.4
3274    13.7
3275    19.6
3276    16.4
3277    17.3
Name: UnempRate2017, Length: 3276, dtype: float64


Let's see if there are any missing values in our Series.  First, we can use ``.isna()`` to identify Null values with a Boolean flag (True or False) and then we can count them with ``.sum()``.  You can use these individually or chain them together.

In [142]:
mNaN = UnempRate.isna()
print('Total Missing Values:', mNaN.sum())


Total Missing Values: 7


Okay, it looks like we have 6.  To handle these Null values we could do something like filling it with the average value, but let's just drop them with ``.dropna()``.

In [143]:
UnempRate1 = UnempRate.dropna()
print('Total Missing Values:', UnempRate1.sum())

Total Missing Values: 15745.1


With this cleaned dataset, print the maximum unemployment rate, the minimum unemployment rate, and the average unemployment rate.

In [144]:
print('Maximum unemployment rate is:', UnempRate1.max())
print('Minimum unemployment rate is:', UnempRate1.min())
print('Average unemployment rate is:', UnempRate1.mean())

Maximum unemployment rate is: 20.1
Minimum unemployment rate is: 1.6
Average unemployment rate is: 4.816488222698073


Let's actually see which counties have the maximum and minimum values.  We can do this by matching based off the column value.  For bonus points try to only print out the county and state.

In [151]:
print('The county with maximum unemployment rate is:',df[df['UnempRate2017'] == UnempRate1.max()][['County', 'State', 'UnempRate2017']])


The county with maximum unemployment rate is:       County State  UnempRate2017
85  Kusilvak    AK           20.1


In [153]:
print('The counties with minimum unemployment rate are:', df[df['UnempRate2017'] == UnempRate1.min()][['County', 'State', 'UnempRate2017']])
print()

The counties with minimum unemployment rate are:     County State  UnempRate2017
259   Baca    CO            1.6
318   Yuma    CO            1.6



Finally, let's sort the entire DataFrame by each county's unemployment rate (in ascending order).

In [161]:
UnempRate.sort_values()
print(UnempRate)

0        NaN
1        4.4
2        3.9
3        4.0
4        5.9
        ... 
3273    12.4
3274    13.7
3275    19.6
3276    16.4
3277    17.3
Name: UnempRate2017, Length: 3276, dtype: float64
