# City of Boston Payroll Analysis - Part 1: Data Cleaning
Salary data is available for City of Boston employees from at least 2011 to 2017.  Roughly 20,000 employees and salaries are classified by department, job title, “regular” or “overtime”, zip code and a few other criteria.

In [1]:
# import modules
import numpy as np
import pandas as pd
import re, os, glob
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-colorblind')
sns.set()

## Read data, and basic fixes

In [2]:
# read file(s)
path = os.getcwd()                    
all_files = glob.glob(os.path.join(path, "employee*.csv"))  

column_names = ['name', 'department', 'title', 'regular', 'retro', 'other', 'overtime', 'injured',\
                'detail', 'quinn', 'total', 'zipcode']
df_from_each_file = (pd.read_csv(f, encoding = "ISO-8859-1", header=0, names=column_names, \
                                 index_col=None).assign(year=f) for f in all_files)  
earnings   = pd.concat(df_from_each_file, ignore_index=True)

A complete set of names/employees, departments, titles and total earnings are availble for the years 2011-2017. Some fields are blank due to no earnings in that particular category. Number columns are not number type due to empty cells. 8 zipcodes are missing. 

In [3]:
# extract year from file name
earnings['year'] = earnings['year'].replace({'\D':''}, regex=True)
earnings['year'] = earnings['year'].str[-4:]  # remove any numbers from file path

# switch "department" and "title" columns for 2013 and 2014
earnings.loc[earnings.year.isin(['2013', '2014']),['department','title']] = \
        earnings.loc[earnings.year.isin(['2013', '2014']),['title','department']].values

# ignore zip "+4" and add leading zero
earnings.zipcode = np.where(earnings.zipcode.str.len() == 4, '0' + earnings.zipcode, earnings.zipcode)
earnings['zipcode'] = earnings['zipcode'].str[:5]
 
# missing zipcodes
earnings = earnings.sort_values(by='name').fillna(method='ffill')

# convert number strings to numeric dtype
number_columns = ['regular', 'retro', 'other', 'overtime', 'injured', 'detail', 'quinn', 'total']
earnings[number_columns] = earnings[number_columns].replace({'\$': '', ',': ''}, regex=True)\
                                .apply(pd.to_numeric, errors='coerce').fillna(0, axis=1)

## Consolidate department names

There are roughly 1400 unique job titles - a plausible number. Interestingly there is a sharp increase in department names between 2013 and 2014. The city's website lists 72 departments. This appears to be due to a detailed breakdown of the public school department.

In [4]:
print('Number of job titles:', earnings.groupby('year')['title'].nunique())
print('')
print('Number of departments:',  earnings.groupby('year')['department'].nunique())

Number of job titles: year
2011    1390
2012    1411
2013    1418
2014    1400
2015    1419
2016    1448
2017    1444
Name: title, dtype: int64

Number of departments: year
2011     49
2012     49
2013     51
2014    239
2015    233
2016    228
2017    227
Name: department, dtype: int64


In [5]:
# combine all school departments into a single "Boston Public Schools" department

earnings['dept_clean'] = earnings['department'] # all others to stay the same

earnings['dept_clean'] = np.where(earnings.department.astype(str).str[:3] == 'BPS', 'Boston Public Schools', earnings.dept_clean)
earnings['dept_clean'] = np.where(earnings.department.astype(str).str[-3:] == 'K-8', 'Boston Public Schools', earnings.dept_clean)
earnings['dept_clean'] = np.where(earnings.department.astype(str).str[-3:] == 'EEC', 'Boston Public Schools', earnings.dept_clean)
earnings['dept_clean'] = np.where(earnings.department.astype(str).str[-3:] == 'ELC', 'Boston Public Schools', earnings.dept_clean)
earnings['dept_clean'] = np.where(earnings.department.astype(str).str[-6:] == 'Middle', 'Boston Public Schools', earnings.dept_clean)
earnings['dept_clean'] = np.where(earnings.department.astype(str).str[-6:] == 'School', 'Boston Public Schools', earnings.dept_clean)
earnings['dept_clean'] = np.where(earnings.department.astype(str).str[-7:] == 'Academy', 'Boston Public Schools', earnings.dept_clean)
earnings['dept_clean'] = np.where(earnings.department.astype(str).str[-10:] == 'Elementary', 'Boston Public Schools', earnings.dept_clean)
earnings['dept_clean'] = np.where(earnings.department.astype(str).str[:10] == 'Asst Super', 'Boston Public Schools', earnings.dept_clean)

bps = ['Greenwood, E Leadership Acad', 'UP Academy Dorchester',\
       'UP "Unlocking Potential" Acad', 'Lyon Pilot High 9-12', 'Ellison/Parks EES',\
       'Chief Academic Officer', 'UP Academy Holland', 'Achievement Gap', \
       'BPS Boston Comm Leadership Ac', 'English Language Learn', 'Haley Pilot',\
       'Greater Egleston High', 'Early Learning Services', 'Career & Technical Ed', \
       'Teaching & Learning', 'Unified Student Svc', 'Superintendent',\
       'Student Support Svc', 'Harbor High', 'Fam & Student Engagemt', \
       'Enrollment Services', 'Food & Nutrition Svc', 'HPEC: Com Acd Science & Health',\
       'Institutional Advancemt', 'Legal Advisor', 'Professional Developmnt', \
       'Chief Operating Officer', 'Research Assess & Eval', 'Info & Instr Technology',\
       'BTU Pilot', 'Boston Collaborative High Sch', 'Diplomas Plus', 'Chief Financial Officer']  
earnings['dept_clean'] = np.where(earnings.department.isin(bps), 'Boston Public Schools', earnings.dept_clean)

print('Number of departments:',  earnings.groupby('year')['dept_clean'].nunique())
print('Unique department names (all years): ', earnings.dept_clean.nunique())

Number of departments: year
2011    49
2012    49
2013    51
2014    53
2015    51
2016    53
2017    55
Name: dept_clean, dtype: int64
Unique department names (all years):  71


New departments in 2017:

In [6]:
print(set(earnings.dept_clean.loc[earnings.year == '2017'].unique()) - \
      set(earnings.dept_clean.loc[earnings.year == '2016'].unique()))

{'Chief of Staff', 'Advancement & Ext. Affairs'}


Further consolidation of departments would be difficult without further information from the city. The Pareto rule can be applied to focus on the largest departments.

In [7]:
print(earnings.dept_clean.value_counts().nlargest(20)) # all years combined

Boston Public Schools             89690
Boston Police Department          21607
Boston Fire Department            11932
Boston Cntr - Youth & Families     3970
Boston Public Library              3628
Public Works Department            3250
Parks Department                   2083
Inspectional Services Dept         1625
Traffic Division                   1442
Property Management                1380
Neighborhood Development           1139
Transportation Department          1109
Dpt of Innovation & Technology     1051
Boston City Council                 839
Workers Compensation Service        703
Assessing Department                635
Elderly Commission                  546
ASD Human Resources                 406
Law Department                      375
Transportation-Parking Clerk        356
Name: dept_clean, dtype: int64


## Consolidate Job Titles

A closer look at the job titles shows some data cleaning might be helpful as well. Fortunately the school department titles look reasonably clean, however the police, fire and library departments could use some work. The Pareto rule will be applied again to limit further investigation of job titles. (Note: the code below combines all years)

### Police Department Titles

In [8]:
# consolidate police department titles
earnings['title_clean'] = earnings['title'] # all others to stay the same
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:9] == 'Police Of', 'Police Officer', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:9] == 'Police Se', 'Police Sergeant', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:9] == 'PoliceSer', 'Police Sergeant', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:9] == 'Police Li', 'Police Lieutenant', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:9] == 'Police Ca', 'Police Captain', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:9] == 'Police Li', 'Police Lieutenant', earnings.title_clean)


police_titles = earnings.title_clean.loc[(earnings.department == 'Boston Police Department')]
print('Number of unique job titles in police department:', len(set(police_titles)))
print(police_titles.value_counts().nlargest(10))

Number of unique job titles in police department: 166
Police Officer                  10951
Police Detective                 2063
Police Sergeant                  2046
School Traffic Supv              1360
Police Lieutenant                 565
CommunEquipOp III, R-13 (CT)      422
Police Clerk And Typist           376
Police Dispatcher                 283
Communic. EquipOp II 9II(SS)      253
Head Clerk & Secretary            209
Name: title_clean, dtype: int64


### Fire Department Titles

In [9]:
# consolidate fire department titles
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:2] == 'FF', 'Fire Fighter', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:7] == 'Fire Fi', 'Fire Fighter', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:7] == 'FireFig', 'Fire Fighter', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:7] == 'Fire Ca', 'Fire Captain', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:6] == 'Fire L', 'Fire Lieutenant', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:6] == 'FireLi', 'Fire Lieutenant', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:7] == 'Distric', 'District Fire Chief', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:7] == 'Dist Fi', 'District Fire Chief', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:7] == 'DistFCh', 'District Fire Chief', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:7] == 'Dep Fir', 'Dep Fire Chief', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[:7] == 'DepFire', 'Dep Fire Chief', earnings.title_clean)
earnings['title_clean'] = np.where((earnings.title.astype(str).str[:10] == 'Sr Admin A') \
                                   & (earnings.dept_clean == 'Boston Fire Department'), 'Sr Admin (Fire)', earnings.title_clean)


fd_titles = earnings.title_clean.loc[(earnings.dept_clean == 'Boston Fire Department')]
print('Number of unique job titles in fire department:', len(set(fd_titles)))
print(fd_titles.value_counts().nlargest(10))

Number of unique job titles in fire department: 104
Fire Fighter                      7971
Fire Lieutenant                   1605
Fire Captain                       545
District Fire Chief                390
Fire Alarm Operator                144
Sr Admin (Fire)                    122
Dep Fire Chief                     117
Head Clerk                          77
Hvy Mtr Equip Repairperson BFD      45
Sr Fire Alarm Operator              43
Name: title_clean, dtype: int64


### Library titles

In [10]:
earnings['title_clean'] = np.where(earnings.title.astype(str).str[-6:] == 'rian I', 'Librarian I', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str[-7:] == 'rian II', 'Librarian II', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str.contains('Librarian'), 'Librarian', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str.contains('Spec Library Asst'), 'Spec Library Asst', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str.contains('Sr Library Asst'), 'Sr Library Asst', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str.contains('Special Library'), 'Spec Library Asst', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str.contains('Spec Collection L'), 'Librarian', earnings.title_clean)
earnings['title_clean'] = np.where(earnings.title.astype(str).str.contains('Collection Libr'), 'Librarian', earnings.title_clean)

bpl_titles = earnings.title_clean.loc[(earnings.dept_clean == 'Boston Public Library')]
print('Number of unique job titles in Boston Public Library:', len(set(bpl_titles)))
print(bpl_titles.value_counts().nlargest(10))

Number of unique job titles in Boston Public Library: 149
Librarian                      781
Spec Library Asst              575
Sr Library Asst                564
Library Aide                   504
Sr Bldg Custodian              154
Jr Building Custodian          100
Prin Library Asst               45
Generalist II                   40
Technical Support Associate     30
Curator-Professional Lib IV     28
Name: title_clean, dtype: int64


In [11]:
# save combined data to file
earnings.to_csv('earnings.csv')