### Unemployment Data Cleanup 
This notebook will clean/organize the 2020 unemployment data by state and month. The resulting data frame will be used to compare overdose rates, and see if there is a correlation between increase in unemployment (due to COIVD-19) and a change in rates of overdose

In [1]:
import pandas as pd
import datetime as dt
import warnings 
warnings.filterwarnings('ignore')

### Initial data exploration
Steps
* Load in dataframe
* How many rows/columns are there?
* Define column labels with meta data 
* What does each row represent? 
* Is there missing data?

In [2]:
unemploy_df = pd.read_csv('../data/data_raw/unemployment_rates_2020.csv', skiprows=5)
unemploy_df


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Year,Month,Unnamed: 4,Unnamed: 5,Unnamed: 6,Total,Percent of population,Total.1,Rate
0,,,Labor force,Labor force,,,,Labor force,,level,rate
1,,,,,,,,,,,
2,1.0,Alabama,1976,01,2605000,1492409,57.3,1392154,53.4,100255,6.7
3,2.0,Alaska,1976,01,232000,159154,68.6,147809,63.7,11345,7.1
4,4.0,Arizona,1976,01,1621000,972413,60.0,872738,53.8,99675,10.3
...,...,...,...,...,...,...,...,...,...,...,...
28564,51.0,Virginia,2020,11,6726645,4286658,63.7,4078503,60.6,208155,4.9
28565,53.0,Washington,2020,11,6137465,3839947,62.6,3610482,58.8,229465,6.0
28566,54.0,West Virginia,2020,11,1436925,769685,53.6,722127,50.3,47558,6.2
28567,55.0,Wisconsin,2020,11,4666517,3121636,66.9,2965087,63.5,156549,5.0


This has data going back to 1970s, I only need 2020 data - subset:

In [3]:
year_fil = unemploy_df['Year'] == '2020'
unemploy_df = unemploy_df[year_fil]
unemploy_df

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Year,Month,Unnamed: 4,Unnamed: 5,Unnamed: 6,Total,Percent of population,Total.1,Rate
27986,1.0,Alabama,2020,01,3871113,2247721,58.1,2186652,56.5,61069,2.7
27987,2.0,Alaska,2020,01,544147,346278,63.6,325347,59.8,20931,6.0
27988,4.0,Arizona,2020,01,5777677,3604805,62.4,3442732,59.6,162073,4.5
27989,5.0,Arkansas,2020,01,2348957,1366308,58.2,1318240,56.1,48068,3.5
27990,6.0,California,2020,01,31176732,19509644,62.6,18756375,60.2,753269,3.9
...,...,...,...,...,...,...,...,...,...,...,...
28564,51.0,Virginia,2020,11,6726645,4286658,63.7,4078503,60.6,208155,4.9
28565,53.0,Washington,2020,11,6137465,3839947,62.6,3610482,58.8,229465,6.0
28566,54.0,West Virginia,2020,11,1436925,769685,53.6,722127,50.3,47558,6.2
28567,55.0,Wisconsin,2020,11,4666517,3121636,66.9,2965087,63.5,156549,5.0


In [4]:
unemploy_df.columns

Index(['Unnamed: 0', 'Unnamed: 1', 'Year', 'Month', 'Unnamed: 4', 'Unnamed: 5',
       'Unnamed: 6', 'Total', 'Percent of population', 'Total.1', 'Rate'],
      dtype='object')

In [5]:
head_dict = {'Unnamed: 0': 'FIPS_code',
            'Unnamed: 1': 'State',
            'Year': 'Year',
            'Month':'Month',
            'Unnamed: 4':'civ_pop',
            'Unnamed: 5': 'civ_labor_force_tot',
            'Unnamed: 6':'%_pop',
            'Total': 'Employed',
            'Percent of population': '%_employed',
            'Total.1': 'Unemployed',
            'Rate': '%_unemployed'}
employ_rates = unemploy_df.rename(columns = head_dict)

Rename column labels using BLS meta data:
* FIPS_code: geographical code - this includes states and regions; filter regions 
* State: State, includes some regions 
* civ_pop: civilian population, 16+ years not institutionalized and not employed by armed forces 
* civ_labor_force_tot: civilian labor force; civ_pop registered as either employed or unemployed 
* per_pop: percent of total population that is considered part of the civilian labor force 
* Employed: total number of employed people 
* per_employed: percentage of civ_pop that are employed 
* Unemployed: total number of unemployed people 
* per_unemployed: percentage of civ_pop that are unemployed 

In [6]:
cols_to_use = ['State','Year','Month','Employed','%_employed','Unemployed','%_unemployed']
employ_rates2 = employ_rates[cols_to_use]
employ_rates2

Unnamed: 0,State,Year,Month,Employed,%_employed,Unemployed,%_unemployed
27986,Alabama,2020,01,2186652,56.5,61069,2.7
27987,Alaska,2020,01,325347,59.8,20931,6.0
27988,Arizona,2020,01,3442732,59.6,162073,4.5
27989,Arkansas,2020,01,1318240,56.1,48068,3.5
27990,California,2020,01,18756375,60.2,753269,3.9
...,...,...,...,...,...,...,...
28564,Virginia,2020,11,4078503,60.6,208155,4.9
28565,Washington,2020,11,3610482,58.8,229465,6.0
28566,West Virginia,2020,11,722127,50.3,47558,6.2
28567,Wisconsin,2020,11,2965087,63.5,156549,5.0


Steps:
1. Combine Month/year to date, reformat to datetime 
2. Filter by date - only need data through May to align with other dfs 
3. Remove city/territories 
4. Remove commas in employed/unemployed cols, save as int so that I can do calculations 

Step 1: Reformat date

In [7]:
employ_rates2['Date'] = pd.to_datetime(employ_rates2['Year'] + employ_rates2['Month'], format = '%Y%m')
employ_rates2 = employ_rates2.set_index(employ_rates2['Date'])

Step 2: Filter by date 

In [8]:
employment_rate_2020 = employ_rates2.loc['2020-01-01':'2020-05-01']
employment_rate_2020

Unnamed: 0_level_0,State,Year,Month,Employed,%_employed,Unemployed,%_unemployed,Date
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-01-01,Alabama,2020,01,2186652,56.5,61069,2.7,2020-01-01
2020-01-01,Alaska,2020,01,325347,59.8,20931,6.0,2020-01-01
2020-01-01,Arizona,2020,01,3442732,59.6,162073,4.5,2020-01-01
2020-01-01,Arkansas,2020,01,1318240,56.1,48068,3.5,2020-01-01
2020-01-01,California,2020,01,18756375,60.2,753269,3.9,2020-01-01
...,...,...,...,...,...,...,...,...
2020-05-01,Virginia,2020,05,3916764,58.4,389546,9.0,2020-05-01
2020-05-01,Washington,2020,05,3351584,55.0,593883,15.1,2020-05-01
2020-05-01,West Virginia,2020,05,679272,47.2,100757,12.9,2020-05-01
2020-05-01,Wisconsin,2020,05,2726588,58.6,376650,12.1,2020-05-01


Step 3: Filter cities

In [9]:
employment_rate_2020['State'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Los Angeles County', 'Colorado', 'Connecticut', 'Delaware',
       'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'New York city', 'North Carolina', 'North Dakota', 'Ohio',
       'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island',
       'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah',
       'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin',
       'Wyoming'], dtype=object)

In [10]:
state_filter = employment_rate_2020['State'].isin(['Los Angeles County','District of Columbia','New York city'])
employment_rate_2020 = employment_rate_2020[-state_filter]
employment_rate_2020['State'].nunique()

50

Step 4: Reformat numbers 

In [11]:
type(employment_rate_2020['Employed'])

pandas.core.series.Series

In [12]:
employment_rate_2020['Employed'] = employment_rate_2020['Employed'].replace(',','', regex=True)
employment_rate_2020['Unemployed'] = employment_rate_2020['Unemployed'].replace(',','', regex=True)

In [13]:
employment_rate_2020

Unnamed: 0_level_0,State,Year,Month,Employed,%_employed,Unemployed,%_unemployed,Date
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-01-01,Alabama,2020,01,2186652,56.5,61069,2.7,2020-01-01
2020-01-01,Alaska,2020,01,325347,59.8,20931,6.0,2020-01-01
2020-01-01,Arizona,2020,01,3442732,59.6,162073,4.5,2020-01-01
2020-01-01,Arkansas,2020,01,1318240,56.1,48068,3.5,2020-01-01
2020-01-01,California,2020,01,18756375,60.2,753269,3.9,2020-01-01
...,...,...,...,...,...,...,...,...
2020-05-01,Virginia,2020,05,3916764,58.4,389546,9.0,2020-05-01
2020-05-01,Washington,2020,05,3351584,55.0,593883,15.1,2020-05-01
2020-05-01,West Virginia,2020,05,679272,47.2,100757,12.9,2020-05-01
2020-05-01,Wisconsin,2020,05,2726588,58.6,376650,12.1,2020-05-01


In [14]:
employment_rate_2020['Employed'] = employment_rate_2020['Employed'].astype(int)
employment_rate_2020['Unemployed'] = employment_rate_2020['Unemployed'].astype(int)

In [15]:
employment_rate_2020.to_csv('../data/data_clean/employment_rate_2020.csv')