# Citywide Payroll Data (Fiscal Year)

Data is collected because of public interest in how the City’s budget is being spent on salary and overtime pay for all municipal employees. Data is input into the City's Personnel Management System (“PMS”) by the respective user Agencies. Each record represents the following statistics for every city employee: Agency, Last Name, First Name, Middle Initial, Agency Start Date, Work Location Borough, Job Title Description, Leave Status as of the close of the FY (June 30th), Base Salary, Pay Basis, Regular Hours Paid, Regular Gross Paid, Overtime Hours worked, Total Overtime Paid, and Total Other Compensation (i.e. lump sum and/or retro payments). This data can be used to analyze how the City's financial resources are allocated and how much of the City's budget is being devoted to overtime. The reader of this data should be aware that increments of salary increases received over the course of any one fiscal year will not be reflected. All that is captured, is the employee's final base and gross salary at the end of the fiscal year.

Data source: NYC OpenData https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e
Updated
August 10, 2021
Data Provided by
Office of Payroll Administration (OPA)

## Data Preparation

The Citywide Payroll Data (Fiscal Year) dataset has the unique identifier k397-673e. The identifier is part of the dataste Url https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2014/jt7v-77mi. The following code downloads the dataset in tab-delimited CSV format and stores it in a local file called ./k397-673e.tsv.gz

In [1]:
# Download the full 'Citywide Payroll Data (Fiscal Year)' dataset.
# Note that the downloaded full dataset file is about 578 MB in size!


import gzip
import humanfriendly
import os
from openclean.data.source.socrata import Socrata

dataset = Socrata().dataset("k397-673e")

datafile = './k397-673e.tsv.gz'


# Download file only if it does not exist already.
if not os.path.isfile(datafile):
    with gzip.open(datafile, 'wb') as f:
        print('Downloading ...\n')
        dataset.write(f)


fsize = humanfriendly.format_size(os.stat(datafile).st_size)
print("Using '{}' in file {} of size {}".format(dataset.name, datafile, fsize))

Using 'Citywide Payroll Data (Fiscal Year)' in file ./k397-673e.tsv.gz of size 89.62 MB


In [2]:
# Due to the size of the full dataset file, we make use of openclean's
# stream operator to avoid having to load the dataset into main-memory.

from openclean.pipeline import stream

ds_full = stream(datafile)

In [3]:
# Count number of records in the datasets.

print(f'{ds_full.count():,} rows.')

3,923,290 rows.


In [4]:
# Print the first ten rows of the dataset to get a first
# idea of the content.

ds_full.head()

Unnamed: 0,Fiscal Year,Payroll Number,Agency Name,Last Name,First Name,Mid Init,Agency Start Date,Work Location Borough,Title Description,Leave Status as of June 30,Base Salary,Pay Basis,Regular Hours,Regular Gross Paid,OT Hours,Total OT Paid,Total Other Pay
0,2020,17,OFFICE OF EMERGENCY MANAGEMENT,BEREZIN,MIKHAIL,,08/10/2015,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820,84698.21,0.0,0.0,0.0
1,2020,17,OFFICE OF EMERGENCY MANAGEMENT,GEAGER,VERONICA,M,09/12/2016,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820,84698.21,0.0,0.0,0.0
2,2020,17,OFFICE OF EMERGENCY MANAGEMENT,RAMANI,SHRADDHA,,02/22/2016,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820,84698.21,0.0,0.0,0.0
3,2020,17,OFFICE OF EMERGENCY MANAGEMENT,ROTTA,JONATHAN,D,09/16/2013,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820,84698.21,0.0,0.0,0.0
4,2020,17,OFFICE OF EMERGENCY MANAGEMENT,WILSON II,ROBERT,P,04/30/2018,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820,84698.21,0.0,0.0,0.0
5,2020,17,OFFICE OF EMERGENCY MANAGEMENT,WASHINGTON,MORIAH,A,03/18/2019,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820,87900.95,0.0,0.0,-3202.74
6,2020,17,OFFICE OF EMERGENCY MANAGEMENT,VAZQUEZ,MARGARET,,09/29/2008,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,94415.0,per Annum,1820,84312.72,0.0,0.0,0.0
7,2020,17,OFFICE OF EMERGENCY MANAGEMENT,KRAWCZYK,AMANDA,N,05/15/2017,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820,83976.54,0.0,0.0,0.0
8,2020,17,OFFICE OF EMERGENCY MANAGEMENT,MURRELL,JALEESA,S,12/01/2014,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820,83877.36,0.0,0.0,0.0
9,2020,17,OFFICE OF EMERGENCY MANAGEMENT,DE LOS SANTOS,JANIRA,,06/05/2017,BROOKLYN,EMERGENCY PREPAREDNESS SPECIALIST,ACTIVE,67676.0,per Annum,1820,66647.77,348.5,16572.64,144.15


In [5]:
# Create a view on a subset of columns in the dataset.
# Choose the attributes that we are interested in.
COLUMNS = [
    'Fiscal Year',
    'Payroll Number',
    'Agency Name',
    'Last Name',
    'First Name',
    'Mid Init',
    'Agency Start Date',
    'Work Location Borough',
    'Title Description',
    'Leave Status as of June 30',
    'Base Salary',
    'Pay Basis',
    'Regular Hours',
    'Regular Gross Paid',
    'OT Hours',
    'Total OT Paid',
    'Total Other Pay'
]

ds = ds_full.select(columns=COLUMNS)

## Data Profiling

Data profiling is an important first step in many data analytics efforts. Profiling helps users to gain an understanding of the data properties and to uncover data quality flaws. openclean supports a variety of different data profiling operators that can also be used to generate metadata about the data at hand.

We can use the default column profiler to compute basic statistics such as the number of distinct values, missing values, etc. for each of the columns in our dataset. In the example shown below we use a random sample of 1000 rows for profiling. The result is a list of profiling results (dictionaries). A summary of the results can then be accessed as a data frame using the stats() method.

In [6]:
# Profile the resulting dataset view using the default data profiler.

from openclean.profiling.column import DefaultColumnProfiler

profiles = ds.profile(default_profiler=DefaultColumnProfiler)

In [7]:
# Print overview of profiling results.

profiles.stats()

Unnamed: 0,total,empty,distinct,uniqueness,entropy
Fiscal Year,3923290,0,7,2e-06,2.805614
Payroll Number,3923290,1745440,157,7.2e-05,4.286506
Agency Name,3923290,0,165,4.2e-05,4.365925
Last Name,3923290,2031,157080,0.040059,14.264455
First Name,3923290,2033,88232,0.022501,11.611521
Mid Init,3923290,1596166,43,1.8e-05,4.073274
Agency Start Date,3923290,63,14933,0.003806,11.097847
Work Location Borough,3923290,506226,22,6e-06,1.507244
Title Description,3923290,84,1802,0.000459,6.207524
Leave Status as of June 30,3923290,0,5,1e-06,0.710495


In [8]:
# Print the most frequent data type for each column.

print('Schema\n------')
for col in ds.columns:
    p = profiles.column(col)
    print("  '{}' ({})".format(col, p['datatypes']['distinct'].most_common(1)[0][0]))

Schema
------
  'Fiscal Year' (int)
  'Payroll Number' (int)
  'Agency Name' (str)
  'Last Name' (str)
  'First Name' (str)
  'Mid Init' (str)
  'Agency Start Date' (date)
  'Work Location Borough' (str)
  'Title Description' (str)
  'Leave Status as of June 30' (str)
  'Base Salary' (float)
  'Pay Basis' (str)
  'Regular Hours' (float)
  'Regular Gross Paid' (float)
  'OT Hours' (float)
  'Total OT Paid' (float)
  'Total Other Pay' (float)


In [9]:
# Print the minimum and maximum value for column 'Fiscal Year'

profiles.minmax('Fiscal Year')

Unnamed: 0,min,max
int,2014,2020


In [10]:
# Print the minimum and maximum value for column 'Payroll Number'

profiles.minmax('Payroll Number')

Unnamed: 0,min,max
int,2,996


In [11]:
# Print the minimum and maximum value for column 'Agency Start Date'

profiles.minmax('Agency Start Date')

Unnamed: 0,min,max
date,1901-01-01,9999-12-31 00:00:00


It seems to be a problem--can't be future date here.

In [12]:
# Print the minimum and maximum value for column 'Base Salary'

profiles.minmax('Base Salary')

Unnamed: 0,min,max
float,0.01,414707.0


In [13]:
# Print the minimum and maximum value for column 'Regular Hours'

profiles.minmax('Regular Hours')

Unnamed: 0,min,max
int,-1260.0,4160.0
float,-730.43,4171.43


Why negatives?

In [14]:
# Print the minimum and maximum value for column 'Regular Gross Paid'

profiles.minmax('Regular Gross Paid')

Unnamed: 0,min,max
float,-76223.05,672308.86


Why negatives?

In [15]:
# Print the minimum and maximum value for column 'OT Hours'

profiles.minmax('OT Hours')

Unnamed: 0,min,max
int,-209.0,3147.0
float,-66.5,3347.5


Why negatives?

In [16]:
# Print the minimum and maximum value for column 'Total OT Paid'

profiles.minmax('Total OT Paid')

Unnamed: 0,min,max
float,-26493.88,237389.73


Why negatives?

In [17]:
# Print the minimum and maximum value for column 'Total Other Pay'

profiles.minmax('Total Other Pay')

Unnamed: 0,min,max
float,-281595.04,650000.0


In [18]:
# Print the most frequent values in column 'Agency Name'

profiles.column('Agency Name').get('topValues')

[('DEPT OF ED PEDAGOGICAL', 758360),
 ('DEPT OF ED PER SESSION TEACHER', 608565),
 ('POLICE DEPARTMENT', 367745),
 ('DEPT OF ED PARA PROFESSIONALS', 245259),
 ('BOARD OF ELECTION POLL WORKERS', 235235),
 ('DEPT OF ED HRLY SUPPORT STAFF', 164165),
 ('FIRE DEPARTMENT', 128819),
 ('DEPT OF PARKS & RECREATION', 117212),
 ('DEPARTMENT OF EDUCATION ADMIN', 110936),
 ('HRA/DEPT OF SOCIAL SERVICES', 104331)]

In [19]:
# Print the most frequent values in column 'Work Location Borough'

profiles.column('Work Location Borough').get('topValues')

[('MANHATTAN', 2394979),
 ('QUEENS', 379695),
 ('BROOKLYN', 323565),
 ('BRONX', 177881),
 ('OTHER', 83688),
 ('RICHMOND', 46156),
 ('WESTCHESTER', 3417),
 ('ULSTER', 1953),
 ('Manhattan', 1622),
 ('Bronx', 935)]

In [20]:
# Print the most frequent values in column 'OT Hours'

profiles.column('OT Hours').get('topValues')

[('0', 2923200),
 ('1', 8563),
 ('2', 6718),
 ('8', 6652),
 ('4', 5161),
 ('3', 4871),
 ('5', 4064),
 ('7', 3842),
 ('6', 3408),
 ('16', 2956)]

## Looking for data quality issues

### Mid Init

In [21]:
states = ds.distinct('Mid Init')
for rank, val in enumerate(states.most_common()):
    st, freq = val
    print(f'{rank + 1:<3} {st}  {freq:>10,}')

1      1,596,166
2   A     348,499
3   M     330,479
4   J     210,681
5   L     173,074
6   E     143,374
7   R     128,939
8   C     120,154
9   S     116,505
10  D     113,179
11  P      77,843
12  T      70,156
13  N      61,680
14  B      60,147
15  F      56,090
16  G      55,596
17  K      52,989
18  H      39,774
19  V      37,053
20  I      35,227
21  W      34,275
22  Y      24,374
23  O      23,080
24  Z       5,464
25  U       3,631
26  X       2,231
27  Q       2,181
28  .         131
29  -          86
30  1          71
31  x          39
32  2          30
33  0          26
34  `          19
35  6          13
36  /          10
37  5           6
38  3           4
39  8           4
40  9           4
41  "           2
42  4           2
43  =           1
44  (           1


lowercase letters, numbers and weird characters

### Agency Name (Knn cluster method)
We try different kinds of knn clusters to find data problems. We use LevenshteinDistance, HammingDistance, JaroSimilarity here. By a little bit turning down the pred param, although there comes out a few fine data which we don't want, more problems are also discovered.

In [22]:
agency_names = ds.select('Agency Name').distinct()
#clusters = KeyCollision(func=Fingerprint()).clusters(agency_names)

In [23]:
# Cluster business names using kNN clusterer (with the default n-gram setting)
# using the Levenshtein distance as the similarity measure.
# Remove clusters that contain less than ten distinct values (for display
# purposes).

from openclean.cluster.knn import knn_clusters
from openclean.function.similarity.base import SimilarityConstraint
from openclean.function.similarity.text import LevenshteinDistance
from openclean.function.similarity.text import HammingDistance
from openclean.function.similarity.text import JaroSimilarity
from openclean.function.similarity.text import StringSimilarityFunction
from openclean.function.value.threshold import GreaterThan

# Minimum cluster size. Use ten as default (to limit
# the number of clusters that are printed in the next cell).
minsize = 2

clusters = knn_clusters(
    values=agency_names,
    #sim=SimilarityConstraint(func=LevenshteinDistance(), pred=GreaterThan(0.85)),
    #sim=SimilarityConstraint(func=HammingDistance(), pred=GreaterThan(0.85)),
    sim=SimilarityConstraint(func=JaroSimilarity(), pred=GreaterThan(0.9)),
    minsize=minsize
)

print('{} clusters of size {} or greater'.format(len(clusters), minsize))

13 clusters of size 2 or greater


In [24]:
# Define simple helper method to print the k largest clusters.

def print_k_clusters(clusters, k=5):
    clusters = sorted(clusters, key=lambda x: len(x), reverse=True)
    val_count = sum([len(c) for c in clusters])
    print('Total number of clusters is {} with {} values'.format(len(clusters), val_count))
    for i in range(min(k, len(clusters))):
        print('\nCluster {}'.format(i + 1))
        for key, cnt in clusters[i].items():
            if key == '':
                key = "''"
            print(f'  {key} (x {cnt})')

In [25]:
print_k_clusters(clusters,13)

Total number of clusters is 13 with 77 values

Cluster 1
  BROOKLYN COMMUNITY BOARD #2 (x 27)
  BROOKLYN COMMUNITY BOARD #3 (x 21)
  BROOKLYN COMMUNITY BOARD #4 (x 24)
  BROOKLYN COMMUNITY BOARD #5 (x 30)
  BROOKLYN COMMUNITY BOARD #6 (x 24)
  BROOKLYN COMMUNITY BOARD #7 (x 22)
  BROOKLYN COMMUNITY BOARD #8 (x 22)
  BROOKLYN COMMUNITY BOARD #9 (x 20)
  BROOKLYN COMMUNITY BOARD #10 (x 30)
  BROOKLYN COMMUNITY BOARD #11 (x 24)
  BROOKLYN COMMUNITY BOARD #12 (x 23)
  BROOKLYN COMMUNITY BOARD #13 (x 28)
  BROOKLYN COMMUNITY BOARD #14 (x 21)
  BROOKLYN COMMUNITY BOARD #15 (x 24)
  BROOKLYN COMMUNITY BOARD #16 (x 22)
  BROOKLYN COMMUNITY BOARD #17 (x 28)
  BROOKLYN COMMUNITY BOARD #18 (x 22)
  BROOKLYN COMMUNITY BOARD #1 (x 22)

Cluster 2
  QUEENS COMMUNITY BOARD #2 (x 28)
  QUEENS COMMUNITY BOARD #3 (x 30)
  QUEENS COMMUNITY BOARD #4 (x 22)
  QUEENS COMMUNITY BOARD #5 (x 28)
  QUEENS COMMUNITY BOARD #6 (x 29)
  QUEENS COMMUNITY BOARD #7 (x 32)
  QUEENS COMMUNITY BOARD #8 (x 36)
  QUEENS COM

A typo found in Cluster 8

### Agency Name (key collision method)

In [26]:
# Cluster Agency Name using key collision (with the default key generator).
# Remove clusters that contain less than seven distinct values (for display
# purposes). Use multiple threads (4) to generate value keys in parallel.

from openclean.cluster.key import key_collision

# Minimum cluster size. Use seven as defaultfor the full dataset (to limit
# the number of clusters that are printed in the next cell).
#minsize = 7

# Use minimum cluster size of 2 when using the dataset sample
minsize = 2

clusters = key_collision(values=agency_names, minsize=minsize, threads=4)

print('{} clusters of size {} or greater'.format(len(clusters), minsize))

1 clusters of size 2 or greater


In [27]:
print_k_clusters(clusters)

Total number of clusters is 1 with 2 values

Cluster 1
  POLICE DEPARTMENT (x 367745)
  Police Department (x 55619)


lowercase uppercase

### Agency Start Date
find the impossible dates

In [28]:
agency_start_dates = ds.distinct('Agency Start Date')

for rank, val in enumerate(agency_start_dates.most_common()):
    dt, freq = val
    if dt == '':
        continue
    if int(dt.split('/')[-1]) > 2020:        
        print(dt)
    elif int(dt.split('/')[1]) > 31 or int(dt.split('/')[1]) < 1:
        print(dt)
    elif int(dt.split('/')[1]) > 30 and int(dt.split('/')[0]) in [4,6,9,11]:
        print(dt)
    elif int(dt.split('/')[1]) > 28 and int(dt.split('/')[0]) == 2 and int(dt.split('/')[-1]) % 4 != 0:
        print(dt)
    elif int(dt.split('/')[1]) > 29 and int(dt.split('/')[0]) == 2 and int(dt.split('/')[-1]) % 4 == 0:
        print(dt)
    elif int(dt.split('/')[0]) > 12 or int(dt.split('/')[0]) < 1:
        print(dt)


12/31/9999
10/16/2049


future date

### Work Location Borough (key_collision)

In [29]:
# Get set of distinct values for column 'Work Location Borough'. Print the
# values in decreasing order of frequency.

wlb = ds.distinct('Work Location Borough')
for rank, val in enumerate(wlb.most_common()):
    wlb, freq = val
    print(f'{rank + 1:<3} {wlb}  {freq:>10,}')

1   MANHATTAN   2,394,979
2        506,226
3   QUEENS     379,695
4   BROOKLYN     323,565
5   BRONX     177,881
6   OTHER      83,688
7   RICHMOND      46,156
8   WESTCHESTER       3,417
9   ULSTER       1,953
10  Manhattan       1,622
11  Bronx         935
12  SULLIVAN         822
13  Queens         660
14  DELAWARE         551
15  NASSAU         245
16  PUTNAM         243
17  SCHOHARIE         175
18  DUTCHESS         140
19  Richmond         112
20  ALBANY          95
21  GREENE          61
22  WASHINGTON DC          47
23  ORANGE          22


In [30]:
# Cluster street names using 'Key Collision' clustering with the
# default fingerprint key generator.

from openclean.cluster.key import KeyCollision
from openclean.function.value.key.fingerprint import Fingerprint

work_locations = ds.select('Work Location Borough').distinct()
clusters = KeyCollision(func=Fingerprint()).clusters(work_locations)

In [31]:
print_k_clusters(clusters)

Total number of clusters is 4 with 8 values

Cluster 1
  BRONX (x 177881)
  Bronx (x 935)

Cluster 2
  MANHATTAN (x 2394979)
  Manhattan (x 1622)

Cluster 3
  QUEENS (x 379695)
  Queens (x 660)

Cluster 4
  RICHMOND (x 46156)
  Richmond (x 112)


lowercase uppercase

### Title Description(knn clusters)

In [32]:
title = ds.select('Title Description').distinct()

In [33]:
# Minimum cluster size. Use ten as default (to limit
# the number of clusters that are printed in the next cell).
minsize = 2

clusters = knn_clusters(
    values=title,
    #sim=SimilarityConstraint(func=LevenshteinDistance(), pred=GreaterThan(0.85)),
    sim=SimilarityConstraint(func=HammingDistance(), pred=GreaterThan(0.85)),
    #sim=SimilarityConstraint(func=JaroSimilarity(), pred=GreaterThan(0.9)),
    minsize=minsize
)

print('{} clusters of size {} or greater'.format(len(clusters), minsize))

39 clusters of size 2 or greater


In [34]:
print_k_clusters(clusters,39)

Total number of clusters is 39 with 83 values

Cluster 1
  NON-TEACHING ADJUNCT III (x 3419)
  NON-TEACHING ADJUNCT I (x 8407)
  NON-TEACHING ADJUNCT V (x 1047)
  NON-TEACHING ADJUNCT IV (x 1279)
  NON-TEACHING ADJUNCT II (x 3774)

Cluster 2
  EDUCATIONAL ADMINISTRATOR UFT (x 58)
  EDUCATIONAL ADMINISTRATOR (x 1056)
  EDUCATIONAL ADMINISTRATOR CSA (x 6281)

Cluster 3
  SUPERVISOR II (x 1503)
  SUPERVISOR I (x 3390)
  SUPERVISOR III (x 490)

Cluster 4
  ADM MANAGER-NON-MGR (x 39)
  ADM MANAGER-NON-MGRL (x 5406)

Cluster 5
  INTELLIGENCE RESEARCH MANAGER (x 6)
  INTELLIGENCE RESEARCH MANAGER-PD (x 39)

Cluster 6
  HEALTH SERVICES MANAGER NON MANAGERIAL LEVEL II (x 92)
  HEALTH SERVICES MANAGER NON MANAGERIAL LEVEL I (x 301)

Cluster 7
  ADM SCHOOL SECURITY MANAGER-U (x 3)
  ADM SCHOOL SECURITY MANAGER (x 15)

Cluster 8
  AGENCY ATTORNEY INTERN (x 110)
  AGENCY ATTORNEY INTERNE (x 698)

Cluster 9
  DIRECTOR OF MANAGEMENT PLANNING (x 3)
  DIRECTOR OF MANAGEMENT PLANNING SS (x 23)

Cluster 

Quite a lot typos

### Leave Status as of June 30

In [35]:
# Get set of distinct values for column 'Leave Status as of June 30'. Print the
# values in decreasing order of frequency.

ls = ds.distinct('Leave Status as of June 30')
for rank, val in enumerate(ls.most_common()):
    ls, freq = val
    print(f'{rank + 1:<3} {ls}  {freq:>10,}')

1   ACTIVE   3,355,483
2   CEASED     485,414
3   ON LEAVE      42,401
4   SEASONAL      33,451
5   ON SEPARATION LEAVE       6,541


### Pay Basis

In [36]:
# Get set of distinct values for column 'Pay Basis'. Print the
# values in decreasing order of frequency.

ls = ds.distinct('Pay Basis')
for rank, val in enumerate(ls.most_common()):
    ls, freq = val
    print(f'{rank + 1:<3} {ls}  {freq:>10,}')

1   per Annum   2,340,656
2   per Day     864,431
3   per Hour     699,600
4   Prorated Annual      18,603


### Data issues we found so far:
                                                                                   
1. Empty values in 'Last Name', 'First Name' 
2. weird charaters in 'Mid Init'     
3. 'Agency Name' value : police department, board of correction
4. 'Agency Start Date' value : 12/31/9999, 10/16/2049
5. 'OT hours', 'Regular Hours' negatives:
6. 'Work Location Borough' : MAHATTAN mahattan....
7. 'Title Description': many typos

## Data Cleaning

So far, we have fonud a couple of problems in this dataset. 
It's time to do data cleaning now.

In [37]:
from openclean.data.load import dataset

ds = dataset('./k397-673e.tsv.gz')


### 1. Empty values in 'Last Name', 'First Name'
change the lastname and firstname empty value to 'unknown'

In [38]:
# define a collable function to update the target name
def emptyname_to_unknown(x):
    if x == '':
        return 'UNKNOWN'
    else:
        return x

In [39]:
from openclean.operator.transform.update import update

ds = update(ds, columns='Last Name', func = emptyname_to_unknown)
ds = update(ds, columns='First Name', func = emptyname_to_unknown)

In [40]:
# use filter() method to check whether data is truly udpated

from openclean.operator.transform.filter import filter
from openclean.function.eval.base import Col

filtering = filter(ds, predicate=Col('Last Name') == 'UNKNOWN' and Col('First Name') == 'UNKNOWN')
filtering.head()

Unnamed: 0,Fiscal Year,Payroll Number,Agency Name,Last Name,First Name,Mid Init,Agency Start Date,Work Location Borough,Title Description,Leave Status as of June 30,Base Salary,Pay Basis,Regular Hours,Regular Gross Paid,OT Hours,Total OT Paid,Total Other Pay
147570,2020,300,BOARD OF ELECTION POLL WORKERS,PARVATI,UNKNOWN,,01/01/2010,MANHATTAN,ELECTION WORKER,ACTIVE,1.0,per Hour,0,353.0,0.0,0.0,0.0
528088,2020,901,DISTRICT ATTORNEY-MANHATTAN,UNKNOWN,UNKNOWN,,05/06/1991,MANHATTAN,CHIEF RACKETS INVESTIGATOR,ACTIVE,188568.0,per Annum,1820,186592.22,0.0,0.0,4991.97
528216,2020,901,DISTRICT ATTORNEY-MANHATTAN,UNKNOWN,UNKNOWN,,01/26/2004,MANHATTAN,SENIOR RACKETS INVESTIGATOR - START >4-24-08 N...,ACTIVE,84965.0,per Annum,2080,84096.14,661.5,41351.02,16524.03
528222,2020,901,DISTRICT ATTORNEY-MANHATTAN,UNKNOWN,UNKNOWN,,02/14/2005,MANHATTAN,ASSISTANT CHIEF RACKET INVESTIGATOR,ACTIVE,139018.0,per Annum,1820,137310.11,0.0,0.0,3500.0
528238,2020,901,DISTRICT ATTORNEY-MANHATTAN,UNKNOWN,UNKNOWN,,07/05/2005,MANHATTAN,SENIOR RACKETS INVESTIGATOR - START >4-24-08 N...,ACTIVE,84818.0,per Annum,2080,84626.68,526.5,32764.5,16901.62


### weird charaters in 'Mid Init'
Previously, we found some weird characters here : numbers, symbols, lowercase letters. Since a name initial should only be uppercase letter, so we change all the numbers and symbols in this attribute to 'UNKNOWN', and change all the lowercase to uppercase.

In [41]:
def midemptyname_to_unknown(x):
    if x.isalpha():
        return x
    else:
        return "UNKNOWN"

In [42]:
ds = update(ds, columns='Mid Init', func = midemptyname_to_unknown)

In [43]:
filtering = filter(ds, predicate=Col('Mid Init') == 'UNKNOWN')
filtering.head()

Unnamed: 0,Fiscal Year,Payroll Number,Agency Name,Last Name,First Name,Mid Init,Agency Start Date,Work Location Borough,Title Description,Leave Status as of June 30,Base Salary,Pay Basis,Regular Hours,Regular Gross Paid,OT Hours,Total OT Paid,Total Other Pay
0,2020,17,OFFICE OF EMERGENCY MANAGEMENT,BEREZIN,MIKHAIL,UNKNOWN,08/10/2015,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820,84698.21,0.0,0.0,0.0
2,2020,17,OFFICE OF EMERGENCY MANAGEMENT,RAMANI,SHRADDHA,UNKNOWN,02/22/2016,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,86005.0,per Annum,1820,84698.21,0.0,0.0,0.0
6,2020,17,OFFICE OF EMERGENCY MANAGEMENT,VAZQUEZ,MARGARET,UNKNOWN,09/29/2008,BROOKLYN,EMERGENCY PREPAREDNESS MANAGER,ACTIVE,94415.0,per Annum,1820,84312.72,0.0,0.0,0.0
9,2020,17,OFFICE OF EMERGENCY MANAGEMENT,DE LOS SANTOS,JANIRA,UNKNOWN,06/05/2017,BROOKLYN,EMERGENCY PREPAREDNESS SPECIALIST,ACTIVE,67676.0,per Annum,1820,66647.77,348.5,16572.64,144.15
14,2020,17,OFFICE OF EMERGENCY MANAGEMENT,WILLIAMS,LATOYA,UNKNOWN,02/16/2005,BROOKLYN,EMERGENCY PREPAREDNESS SPECIALIST,ACTIVE,73403.0,per Annum,1820,72287.7,217.75,9636.04,258.44


### 'Agency Start Date' wrong value : 12/31/9999, 10/16/2049
Since an agency start date could not be in the future, we decide to change these dates to ''

In [44]:
def agency_start_date(x):
    if x == '12/31/9999':
        return ''
    elif x == '10/16/2049':
        return ''
    else:
        return x

In [45]:
ds = update(ds, columns='Agency Start Date', func = agency_start_date)

In [46]:
filtering = filter(ds, predicate=Col('Agency Start Date') == '12/31/9999' or Col('Agency Start Date') == '10/16/2049')
filtering.head()

# no data shown here, proving that the wrong values are truly updated.

Unnamed: 0,Fiscal Year,Payroll Number,Agency Name,Last Name,First Name,Mid Init,Agency Start Date,Work Location Borough,Title Description,Leave Status as of June 30,Base Salary,Pay Basis,Regular Hours,Regular Gross Paid,OT Hours,Total OT Paid,Total Other Pay


### change the negative values in 'Regular Hours'  and 'OT Hours' to ' '
We don't think working hours can be represented as negatives

In [47]:
def Regular_Hours(x):
    if float(x) < 0:
        return ''
    else:
        return x
    
def OT_Hours(x):
    if float(x) < 0:
        return ''
    else:
        return x

In [48]:
ds = update(ds, columns='Regular Hours', func = Regular_Hours)
ds = update(ds, columns='OT Hours', func = OT_Hours)

In [63]:
filtering = filter(ds, predicate=Col('Regular Hours')=='' or Col('OT Hours')=='')
filtering.head()

Unnamed: 0,Fiscal Year,Payroll Number,Agency Name,Last Name,First Name,Mid Init,Agency Start Date,Work Location Borough,Title Description,Leave Status as of June 30,Base Salary,Pay Basis,Regular Hours,Regular Gross Paid,OT Hours,Total OT Paid,Total Other Pay
2921,2020,25,LAW DEPARTMENT,ST CYR,ANDREW,C,10/22/2018,MANHATTAN,SENIOR STUDENT LEGAL SPECIALIST,CEASED,49157.0,per Annum,,-711.63,0,0.0,0.0
57539,2020,56,POLICE DEPARTMENT,MEDINA,JOSE,M,12/20/1998,MANHATTAN,SCHOOL SAFETY AGENT,ON LEAVE,50207.0,per Annum,,24806.12,0,0.0,252.79
61546,2020,56,POLICE DEPARTMENT,GIBSON,DARRELL,UNKNOWN,12/20/1998,MANHATTAN,SCHOOL SAFETY AGENT,CEASED,48745.0,per Annum,,12910.02,0,122.13,133.64
61719,2020,56,POLICE DEPARTMENT,EZZELL,LATISHA,N,09/25/2006,MANHATTAN,SCHOOL SAFETY AGENT,CEASED,48745.0,per Annum,,10633.94,0,0.0,1804.54
61733,2020,56,POLICE DEPARTMENT,RYAN,ALLAN,J,07/01/2002,MANHATTAN,SCHOOL SAFETY AGENT,CEASED,48745.0,per Annum,,12435.6,0,0.0,-45.45


### 'Agency Name' value : police department, board of correction
1.to uppercase   "Police Department" -> "POLICE DEPARTMENT"
2.the inconsistent "BOARD OF CORRECTIONS" , "BOARD OF CORRECTION"
we decide to change "BOARD OF CORRECTIONS" to "BOARD OF CORRECTION", because there are more records using "BOARD OF CORRECTION"

In [64]:
def agency_name(x):
    if x != '':
        x = x.upper()
    if x == 'BOARD OF CORRECTIONS':
        x ='BOARD OF CORRECTION'
    return x

In [65]:
ds = update(ds, columns='Agency Name', func = agency_name)

In [66]:
filtering = filter(ds, predicate = Col('Agency Name') == 'BOARD OF CORRECTIONS' or Col('Agency Name') == 'Police Station' )
filtering.head()

# no data : All updated 

Unnamed: 0,Fiscal Year,Payroll Number,Agency Name,Last Name,First Name,Mid Init,Agency Start Date,Work Location Borough,Title Description,Leave Status as of June 30,Base Salary,Pay Basis,Regular Hours,Regular Gross Paid,OT Hours,Total OT Paid,Total Other Pay


### 'Work Location Borough'
All capitalized

In [67]:
def work_location_borough(x):
    if x == '':
        return "UNKNOWN"
    else:
        return x.upper()

In [68]:
ds = update(ds, columns='Work Location Borough', func = work_location_borough)

In [69]:
filtering = filter(ds, predicate = Col('Work Location Borough') == 'Manhattan' )
filtering.head()

Unnamed: 0,Fiscal Year,Payroll Number,Agency Name,Last Name,First Name,Mid Init,Agency Start Date,Work Location Borough,Title Description,Leave Status as of June 30,Base Salary,Pay Basis,Regular Hours,Regular Gross Paid,OT Hours,Total OT Paid,Total Other Pay


### 'TItle Description'
deal with the typoes

In [70]:
def title_description(x):
    if x == 'SERGEANT D/A SPECIAL ASSIGNMENT':
        return 'SERGEANT-D/A SPECIAL ASSIGNMENT'
    
    if x == 'SECRETARY TO ONE DEPUTY COMMISSIONER':
        return 'SECRETARY TO THE DEPUTY COMMISSIONER'
    
    if x == 'SECRETARY TO COMMISSIONER':
        return 'SECRETARY OF COMMISSIONER'
    
    if x == 'SENIOR SYSTEMS ANALYST':
        return 'SENIOR SYSTEMS ANALYSTS'
    
    if x == 'SERGEANT D/A SUPERVISOR DETECTIVE SQUAD':
        return 'SERGEANT-D/A SUPERVISOR DETECTIVE SQUAD'
    
    if x == 'RADIO AND TEVEVISION OPERATOR':
        return 'RADIO AND TELEVISION OPERATOR'
    
    if x == 'SECRETARY TO THE DEPARTMENT':
        return 'SECRETARY OF THE DEPARTMENT'
    
    if x == 'SECRETARY OF DEPARTMENT':
        return 'SECRETARY TO DEPARTMENT'
    
    if x == 'SUPV OF HOUSING EXTERMINATORS':
        return 'SUPV OF HOUSING EXTERMINATOR'
    
    if x == 'P/T SCHOOL AIDE':
        return 'F/T SCHOOL AIDE'
    
    if x == 'SERGEANT-':
        return 'SERGEANT'
    
    if x == 'CHAUFFEUR ATTENDANT':
        return 'CHAUFFEUR-ATTENDANT'

In [71]:
ds = update(ds, columns='Title Description', func = title_description)

In [72]:
filtering = filter(ds, predicate = Col('Title Description') == 'SERGEANT-' )
filtering.head()

Unnamed: 0,Fiscal Year,Payroll Number,Agency Name,Last Name,First Name,Mid Init,Agency Start Date,Work Location Borough,Title Description,Leave Status as of June 30,Base Salary,Pay Basis,Regular Hours,Regular Gross Paid,OT Hours,Total OT Paid,Total Other Pay


## Write the updated dataset to a new csv file

In [73]:
# use openclean stream to load data
new_ds = stream(ds)

In [75]:
# write new data to current directory
file = "./NYC_Payroll_new.csv"

new_ds.write(file)