# Basic Pandas and I/O 

To practice using Pandas, we will explore the Atlas of Rural and Small Town America.  The USDA compiles county-level statistics from many different surveys to provide a comprehensive overview of different aspects of America.  This Atlas looks at broad categories of socioeconomic factors including the demographics of the population, economic data on employment, county properties on a rural-urban continuum, income data, and veteran status.  These categories are all stored as different sheets in an Excel file.  For this lab, we will only focus on the ``Jobs`` sheet.  For a full description of the available data, check out the webpage:

https://www.ers.usda.gov/data-products/atlas-of-rural-and-small-town-america/

Let's first import the Pandas library and load the data using the ``read_excel`` function.  Remember to specify the ``jobs`` sheet with the ``sheet_name`` argument.

In [1]:
import pandas as pd

df = pd.read_excel('../data/RuralAtlasData18.xlsx', sheet_name='Jobs')

Before doing any analysis, we first need to know what data fields we have and the type of data we will be working with.  Use ``.head()`` to get the first five rows of data.  If you just want to see a list of the columns are available, you can get the column names with ``.columns``.

In [2]:
# use .head on df
df.head()

Unnamed: 0,FIPS,State,County,UnempRate2017,UnempRate2016,UnempRate2015,UnempRate2014,UnempRate2010,UnempRate2007,PctEmpChange1017,...,NumUnemployed2013,NumEmployed2013,NumCivLaborforce2013,NumUnemployed2007,NumEmployed2007,NumUnemployed2012,NumEmployed2012,UnempRate2012,NumCivLaborForce2012,NumUnemployed2014
0,0,US,United States,,,,,,,,...,,,,,,,,,,
1,1000,AL,Alabama,4.4,5.9,6.1,6.8,10.5,4.0,5.52526,...,156957.0,2017043.0,2174000.0,86485.0,2089127.0,173047.0,2003290.0,8.0,2176337.0,146531.0
2,1001,AL,Autauga,3.9,5.1,5.2,5.8,8.9,3.3,6.303615,...,1605.0,24205.0,25810.0,806.0,23577.0,1779.0,23961.0,6.9,25740.0,1495.0
3,1003,AL,Baldwin,4.0,5.4,5.5,6.1,10.0,3.1,17.032748,...,5654.0,79626.0,85280.0,2560.0,80099.0,6349.0,78065.0,7.5,84414.0,5300.0
4,1005,AL,Barbour,5.9,8.4,8.9,10.5,12.3,6.3,-13.49481,...,931.0,8168.0,9099.0,650.0,9684.0,1079.0,8283.0,11.5,9362.0,932.0


This seems like a lot of data.  How much is it exactly?  We can get the size of the DataFrame in terms of rows and columns using ``.shape``.

In [3]:
print('DataFrame shape is:')
# use .shape  on sheet_data
df.shape

DataFrame shape is:


(3278, 62)

Pandas has automatically filled in its own index, but the unique identifier for our data is actually the Federal Information Processing Standards (FIPS) county code.  Let's set that as the index using ``.set_index()`` with the column we want specified as an argument.

In [4]:
df = df.set_index('FIPS')

Let's get into the summary statistics a bit.  It looks like there are 3278 counties in our dataset, but that definitely isn't right since there is a row for "United States" that appears when using ``.head()``.  Let's figure out how many counties there are in each state and see if there is anything else we don't want in the dataset.  The ``.value_counts()`` function is useful here and we can apply it to the ``State`` column.

In [5]:
df['State'].value_counts()

TX    255
GA    160
VA    135
KY    121
MO    116
KS    106
IL    103
NC    101
IA    100
TN     96
NE     94
IN     93
OH     89
MN     88
MI     84
MS     83
PR     79
OK     78
AR     76
WI     73
FL     68
AL     68
PA     68
SD     67
CO     65
LA     65
NY     63
CA     59
MT     57
WV     56
ND     54
SC     47
ID     45
WA     40
OR     37
AK     34
NM     34
UT     30
MD     25
WY     24
NJ     22
NV     18
ME     17
AZ     16
VT     15
MA     15
NH     11
CT      9
HI      6
RI      6
DE      4
DC      2
US      1
Name: State, dtype: int64

In addition to 'US', it looks like DC and Puerto Rico are in our data.  Filter them out since they aren't technically states.  "!=" means not equal.

In [6]:
df = df[(df['State'] != 'US') & (df['State'] != 'PR') & (df['State'] != 'DC')]

Print the value counts again to make sure they were taken out correctly before proceeding.

Now that we have a dataset that only contains records for counties that are in the 50 states, let's look at unemployment.  In particular we want to focus on the 2017 unemployment rate which can be accessed in the ``UnempRate2017`` column.  Let's pull it out into its own Series and assign it to ``UnempRate``.  If you recall the index will be maintained.

In [7]:
UnempRate = df['UnempRate2017']

Let's see if there are any missing values in our Series.  First, we can use ``.isna()`` to identify Null values with a Boolean flag (True or False) and then we can count them with ``.sum()``.  You can use these individually or chain them together.

In [8]:
print('Total Missing Values:')
UnempRate.isna().sum()

Total Missing Values:


6

Okay, it looks like we have 6.  To handle these Null values we could do something like filling it with the average value, but let's just drop them with ``.dropna()``.

In [9]:
UnempRate = UnempRate.dropna()

With this cleaned dataset, print the maximum unemployment rate, the minimum unemployment rate, and the average unemployment rate.

In [10]:
print('Maximum unemployment rate is:', UnempRate.max())
print('Minimum unemployment rate is:', UnempRate.min())
print('Average unemployment rate is:', UnempRate.mean())

Maximum unemployment rate is: 20.1
Minimum unemployment rate is: 1.6
Average unemployment rate is: 4.61166144200627


Let's actually see which counties have the maximum and minimum values.  We can do this by matching based off the column value.  For bonus points try to only print out the county and state.

In [11]:
print('The county with maximum unemployment rate is:')
print(df[df['UnempRate2017'] == UnempRate.max()][['County', 'State']])

The county with maximum unemployment rate is:
        County State
FIPS                
2158  Kusilvak    AK


In [12]:
print('The counties with minimum unemployment rate are:')
print(df[df['UnempRate2017'] == UnempRate.min()][['County', 'State']])

The counties with minimum unemployment rate are:
     County State
FIPS             
8009   Baca    CO
8125   Yuma    CO


Finally, let's sort the entire DataFrame by each county's unemployment rate (in ascending order).

In [13]:
df.sort_values(by=['UnempRate2017'])

Unnamed: 0_level_0,State,County,UnempRate2017,UnempRate2016,UnempRate2015,UnempRate2014,UnempRate2010,UnempRate2007,PctEmpChange1017,PctEmpChange1617,...,NumUnemployed2013,NumEmployed2013,NumCivLaborforce2013,NumUnemployed2007,NumEmployed2007,NumUnemployed2012,NumEmployed2012,UnempRate2012,NumCivLaborForce2012,NumUnemployed2014
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8125,CO,Yuma,1.6,2.1,2.6,3.3,5.6,2.1,21.113739,5.843668,...,225.0,4511.0,4736.0,143.0,6591.0,251.0,4633.0,5.1,4884.0,164.0
8009,CO,Baca,1.6,1.7,2.0,2.8,4.9,2.5,6.104945,2.285992,...,78.0,1817.0,1895.0,60.0,2338.0,91.0,1924.0,4.5,2015.0,53.0
8095,CO,Phillips,1.7,2.0,2.4,3.3,5.3,2.7,22.729457,5.406521,...,98.0,2026.0,2124.0,66.0,2354.0,110.0,2031.0,5.1,2141.0,74.0
38091,ND,Steele,1.7,2.1,2.3,2.2,2.7,2.3,-0.560748,0.757576,...,25.0,1017.0,1042.0,26.0,1093.0,28.0,1021.0,2.7,1049.0,23.0
8017,CO,Cheyenne,1.7,2.2,2.7,2.9,4.1,2.5,12.051793,6.433302,...,41.0,928.0,969.0,34.0,1332.0,46.0,964.0,4.6,1010.0,28.0
38011,ND,Bowman,1.7,2.2,1.9,1.8,2.6,2.0,2.085747,-2.328160,...,40.0,1959.0,1999.0,36.0,1727.0,35.0,1883.0,1.8,1918.0,34.0
38023,ND,Divide,1.7,2.6,1.8,1.5,2.1,3.3,17.872969,-3.096539,...,25.0,1749.0,1774.0,30.0,870.0,27.0,1760.0,1.5,1787.0,28.0
20071,KS,Greeley,1.8,2.1,2.1,2.1,3.4,3.3,8.135169,-3.247480,...,22.0,886.0,908.0,23.0,675.0,27.0,857.0,3.1,884.0,19.0
8061,CO,Kiowa,1.8,2.0,2.7,3.3,6.4,2.6,26.886145,11.580217,...,34.0,730.0,764.0,24.0,887.0,37.0,780.0,4.5,817.0,25.0
8063,CO,Kit Carson,1.8,2.1,2.4,3.0,4.5,2.7,7.923628,1.823914,...,160.0,3921.0,4081.0,120.0,4355.0,188.0,4041.0,4.4,4229.0,131.0
