## TODO

- feature engineer age from DOB
- feature engineer time from data charge was filed to date sent to EEOC
- figure out if class code is Closure Code or Type
- determine size of data subset
- fix data types of cols

## Introduction: Injustice at Work

Our data source explores the relationship between attributes of the complainants/complaints regarding Employee Discrimination charges and the outcomes of each charge.

Due to working on our personal machines, we chose 25,000 rows of data at random to represent the full dataset. The original dataset can be found here: https://github.com/PublicI/employment-discrimination/blob/master/data/complaints_10.txt

According to the Injustice at Work Center, each attribute is defined as follows:

- Unique ID: unique identifier for each case (a case is a collection of related charges)
- State Code: complainant state
- No of Employees Code: code indicating the approximate number of employees working for respondent employer
- No of Employees: approximate number of employees working for respondent employer
- NAICS Code: North American Industry Classification System code of respondent employer
- NAICS Description: North American Industry Classification System description of respondent company (e.g., crude petroleum and natural gas extraction)
- Institution Type Code: classification code of respondent employer
- Institution Type: classification of respondent employer (e.g., private employer)
- CP Date of Birth: complainant’s date of birth
- CP Sex: complainant’s gender
- Date First Office: date charge was filed
- Date FEPA Sent to EEOC: date charge was forwarded to the EEOC
- Closure Date: date investigation of case was closed
- Closure Code: code indicating how case was closed
- Closure Type: description indicating how case was closed (e.g., no cause finding issued)
- Monetary Benefits: monetary benefit complainant received
- Statute Code: code for statute under which charge was filed
- Statute: statute under which charge was filed (e.g., Americans with Disabilities Act)
- Basis Code: code for basis of discrimination
- Basis: basis of discrimination (e.g., race-black/African American)
- Issue Code: type code for adverse action alleged by complainant
- Issue: adverse action alleged by complainant (e.g., harassment)
- Court Filing Date: date complainant filed a lawsuit
- Civil Action Number: case number of lawsuit
- Court: court in which lawsuit was filed
- Litigation Resolution Date: date lawsuit was resolved
- Litigation Monetary Benefits: monetary damages recovered through lawsuit
- Litigation Case Type: case type of lawsuit

Our analysis will be looking to classify data by "Closure Code"(? or type), and we have deduced that the possible predictive attributes are as follows:
- State Code: complainant state
- No of Employees Code: code indicating the approximate number of employees working for respondent employer
- NAICS Code: North American Industry Classification System code of respondent employer
- Institution Type Code: classification code of respondent employer
- CP Date of Birth: complainant’s date of birth *
- CP Sex: complainant’s gender
- Date First Office: date charge was filed *
- Date FEPA Sent to EEOC: date charge was forwarded to the EEOC *
- Basis Code: code for basis of discrimination
- Issue Code: type code for adverse action alleged by complainant
- Litigation Case Type: case type of lawsuit


## Pandas Settings

In [224]:
import pandas as pd

pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 100)
pd.options.display.float_format = "{:,.2f}".format

## Preprocessing:
We are taking 25,000 rows from our dataset which included around 343,000 rows due to the limitations of doing this project on personal machines.

In [225]:
ncols = 28
data = pd.read_csv("complaints_10.txt", sep="\t", skiprows=1,
                      dtype={1: str},
                      names=["unique_id", "state_code", "num_employee_code", "num_employees",
                             "naics_code", "naics_desc", "inst_type_code", "inst_type",
                             "age", "sex", "date_filed", "date_sent_eeoc", "date_closed",
                             "closure_code", "closure_action", "monetary_benefits", "statute_code",
                             "statute", "basis_code", "basis", "issue_code", "issue",
                             "court_filing_date", "civil_action_num", "court", "resolution_date",
                             "litigation_monetary_benefits", "litigation_case_type"])

cols_to_drop = ['unique_id', 'num_employees', 'naics_desc', 
                'inst_type', 'date_closed', "closure_action",
                "monetary_benefits", "statute_code", "statute",
                "basis", "issue", "court_filing_date", "date_sent_eeoc",
                "civil_action_num", "court", "resolution_date",
                "litigation_monetary_benefits"
                ]

data = data.drop(cols_to_drop, axis = 1) 
data = data.sample(n = 25_000)
data.head()
data.groupby('litigation_case_type').nunique()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0_level_0,state_code,num_employee_code,naics_code,inst_type_code,age,sex,date_filed,closure_code,basis_code,issue_code,litigation_case_type
litigation_case_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Class,4,4,4,1,10,2,8,1,4,5,1
Individual,20,5,25,3,43,3,37,3,18,18,1


Getting age from birth date and cleaning the age column

In [226]:
def get_age(date):
    if pd.isna(date):
        return 0
    og_year = int(str(date)[-2:]) + 2000
    year = og_year if og_year <= 2010 else og_year - 100
    age = 2010 - year
    og_year = age if age >= 14 else 0
    return og_year

def get_year(date):
    if pd.isna(date):
        return 0
    og_year = int(str(date)[-2:]) + 2000
    year = og_year if og_year <= 2010 else og_year - 100
    og_year = year
    return og_year
    
data['age'] = data['age'].apply(get_age)

Cleaning age field

In [227]:
from datetime import timedelta, date
import numpy as np

data['age'].replace(0, np.nan) # to avoid counting in the zeroes
data['age'].replace(np.nan, data['age'].mean())
data.head()

Unnamed: 0,state_code,num_employee_code,naics_code,inst_type_code,age,sex,date_filed,closure_code,basis_code,issue_code,litigation_case_type
143018,TN,C,561720.0,E,53,F,02/17/10,M3,OR,D2,
274176,NC,C,923130.0,E,50,M,07/09/10,,GM,H1,
240731,OH,A,921140.0,G,0,F,05/26/10,M3,OR,D3,
311324,AL,A,561990.0,U,34,F,08/27/10,,RB,L1,
315076,MA,D,611699.0,N,58,F,08/24/10,,AO,D3,


In [228]:
data["litigation_case_type"].fillna("No Litigation", inplace = True)
data["state_code"].fillna(data["state_code"].mode()[0], inplace = True)
data["sex"].fillna(data["sex"].mode()[0], inplace = True)
data.isna().sum()
data.groupby('age').count()

Unnamed: 0_level_0,state_code,num_employee_code,naics_code,inst_type_code,sex,date_filed,closure_code,basis_code,issue_code,litigation_case_type
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,2484,2330,2484,2482,2484,2484,1653,2468,2484,2484
14,1,1,1,1,1,1,0,1,1,1
17,9,9,9,9,9,9,2,9,9,9
18,24,22,24,24,24,24,12,24,24,24
19,57,55,57,57,57,57,45,57,57,57
20,75,74,75,75,75,75,51,75,75,75
21,109,102,109,109,109,109,71,109,109,109
22,141,136,141,141,141,141,92,138,141,141
23,202,191,202,202,202,202,128,201,202,202
24,211,201,211,211,211,211,153,211,211,211
