This notebook works with anonymized data of sexual harassment charges filed to the [U.S. Equal Employment
Opportunity Commission](https://www.eeoc.gov/) (EEOC) and aggregated data from the [Bureau of Labor Statistics](https://www.bls.gov). The code below does the following:

* Combines separate spreadsheets supplied by the EEOC into one data frame and eliminates duplicates
* Merges this data with data on gender, wages and total employment from the Bureau of Labor Statistics
* Aggregates the data for each industry
* Aggregates the data for each sector

## Load and combine EEOC data

In [1]:
import pandas as pd

The data supplied by the EEOC came in three separate sheets, with a maximum of 65,000 rows each. We were instructed to combine them.

In [2]:
def parse_eeoc(path, sheet_name):
    df = pd.read_excel(
        path,
        sheet_name=sheet_name,
        dtype={
            "R_NAICS_CODE": str
        },
        parse_dates=[ "CHARGE_FILING_DATE" ]
    )
    print(len(df))
    df["R_NAICS_DESCRIPTION"] = df["R_NAICS_DESCRIPTION"].str.strip()
    return df

In [3]:
all_eeoc_data = pd.concat([ parse_eeoc("../data/SH Charge Receipts.xlsx", name )
    for name in [ "Sheet 1", "Sheet 2", "Sheet 3" ] ])

print(len(all_eeoc_data))
all_eeoc_data.head()

64999
64999
40024
170022


Unnamed: 0,CHARGE_FILING_DATE,CP_SEX,CP_NATIONAL_ORIGIN,CP_DOB,HISPANIC_CP,CP_RACE_STRING,R_NAICS_CODE,R_NAICS_DESCRIPTION,R_NUMBER_OF_EMPLOYEES,R_TYPE
0,1995-10-01,Female,Other National Origin - Obsolete,1969-10-21 00:00:00,,B,311612,Meat Processed from Carcasses,201 - 500 Employees,Private Employer
1,1995-10-02,Female,Other National Origin - Obsolete,2001-01-01 00:00:00,,O,541990,"All Other Professional, Scientific, and Techni...",15 - 100 Employees,Private Employer
2,1995-10-02,Female,Other National Origin - Obsolete,1960-07-30 00:00:00,,W,722110,Full-Service Restaurants,15 - 100 Employees,Private Employer
3,1995-10-02,Male,Other National Origin - Obsolete,1957-06-02 00:00:00,,B,422990,Other Miscellaneous Nondurable Goods Wholesalers,15 - 100 Employees,Private Employer
4,1995-10-02,Female,Other National Origin - Obsolete,1959-04-15 00:00:00,,W,523999,Miscellaneous Financial Investment Activities,501+ Employees,Private Employer


There are a handful of duplicates; here, we remove them:

In [4]:
claims = all_eeoc_data.drop_duplicates().copy()
print(len(claims))

170022


## Aggregate claims by NAICS codes

In [5]:
grp = claims.fillna("No category")\
    .groupby([ "R_NAICS_CODE", "R_NAICS_DESCRIPTION" ])

claims_by_naics = pd.DataFrame({
    "claims_total": grp.size(),
    "claims_2016": grp["CHARGE_FILING_DATE"].apply(lambda x: (x.dt.year == 2016).sum())
}).reset_index()

claims_by_naics.head()

Unnamed: 0,R_NAICS_CODE,R_NAICS_DESCRIPTION,claims_2016,claims_total
0,111110,Soybean Farming,0,1
1,111150,Corn Farming,0,3
2,111199,All Other Grain Farming,0,15
3,111219,Other Vegetable (except Potato) and Melon Farming,1,20
4,111310,Orange Groves,0,3


In [6]:
claims_by_naics["claims_total"].sum()

170022

In [7]:
claims_by_naics["claims_2016"].sum()

5236

## Merge BLS data

Here, we read in data that we created in the previous notebook, `01-merge-bls-data.ipynb`.

In [8]:
ces_industry_metrics = pd.read_csv(
    "../output/ces_industry_metrics.csv",
    dtype={ "naics_code": str }
)

bls_sector_metrics = pd.read_csv(
    "../output/bls_sector_metrics.csv",
    dtype={ "naics_code": str, "naics_sector": str, "naics_sector_rollup": str }
)

Some NAICS titles appear under different NAICS codes, but these multiples all belong to the same NAICS sector:

In [9]:
grp = claims_by_naics[
    claims_by_naics["R_NAICS_DESCRIPTION"].isin(
        claims_by_naics["R_NAICS_DESCRIPTION"].value_counts().pipe(lambda x: x[x > 1]).index
    )
].groupby("R_NAICS_DESCRIPTION")
grp["R_NAICS_CODE"].apply(lambda x: x.str[:2].unique())

R_NAICS_DESCRIPTION
All Other Information Services                                                   [51]
Cable and Other Program Distribution                                             [51]
Cafeterias, Grill Buffets, and Buffets                                           [72]
Cellular and Other Wireless Telecommunications                                   [51]
Commercial and Institutional Building Construction                               [23]
Employment Placement Agencies                                                    [56]
Food Product Machinery Manufacturing                                             [33]
Full-Service Restaurants                                                         [72]
Household Appliance Stores                                                       [44]
Limited-Service Restaurants                                                      [72]
No category                                                      [42, 44, 51, 56, na]
Other Commercial and Service Indus

In [10]:
grp = claims_by_naics.sort_values("R_NAICS_CODE").groupby("R_NAICS_DESCRIPTION")
claims_by_naics_desc = pd.DataFrame({
    "claims_2016": grp["claims_2016"].sum(),
    "claims_total": grp["claims_total"].sum(),
    "R_NAICS_CODE": grp["R_NAICS_CODE"].apply(",".join),
    "n_codes": grp.size()
}).drop("No category")\
    .reset_index()
claims_by_naics_desc.head()

Unnamed: 0,R_NAICS_DESCRIPTION,R_NAICS_CODE,claims_2016,claims_total,n_codes
0,Adhesive Manufacturing,325520,0,2,1
1,Administration of Air and Water Resource and S...,924110,2,97,1
2,Administration of Conservation Programs,924120,2,4,1
3,Administration of Education Programs,923110,1,79,1
4,Administration of General Economic Programs,926110,0,45,1


In [11]:
claims_by_desc_with_meta = claims_by_naics_desc.pipe(pd.merge,
    ces_industry_metrics,
    left_on="R_NAICS_CODE",
    right_on="naics_code",
    how="left"
)\
.assign(naics_sector=lambda x: x["R_NAICS_CODE"].str[:2])\
.pipe(pd.merge,
    bls_sector_metrics,
    how="left",
    on="naics_sector",
    suffixes=[ "", "_meta" ]
)

claims_by_desc_with_meta.head().T

Unnamed: 0,0,1,2,3,4
R_NAICS_DESCRIPTION,Adhesive Manufacturing,Administration of Air and Water Resource and S...,Administration of Conservation Programs,Administration of Education Programs,Administration of General Economic Programs
R_NAICS_CODE,325520,924110,924120,923110,926110
claims_2016,0,2,2,1,0
claims_total,2,97,4,79,45
n_codes,1,1,1,1,1
industry_code,,,,,
avg_hrly_earnings,,,,,
total_employment,,,,,
women_employment,,,,,
naics_supersector,,,,,


In [12]:
assert claims_by_desc_with_meta["women_percentage_meta"].isnull().sum() == 0

In [13]:
for col in [ "women_percentage", "avg_hrly_earnings", "total_employment" ]:
    claims_by_desc_with_meta[col + "_best"] = claims_by_desc_with_meta[col]\
        .fillna(claims_by_desc_with_meta[col + "_meta"])

In [14]:
claims_by_desc_with_meta["does_not_have_detailed"] = claims_by_desc_with_meta["women_percentage"].isnull()

## Merge graphics info

In [15]:
graphics_info = pd.read_csv(
    "../data/graphics_info.csv",
    dtype={
        "naics_code":str, 
        "focus":str,
        "index_num":str,
        "industry_code":str
    }
)

graphics_info.head()

Unnamed: 0,industry,naics_code,grouping,industry_class,grouping_class,color,focus,index_num,info_text
0,Accommodation and Food Services,72,Service and sales-related jobs,accommodationandfoodservices,serviceandsalesrelatedjobs,#af2469,1,1,"By far, the most claims were filed by service-..."
1,Retail Trade,44,Service and sales-related jobs,retailtrade,serviceandsalesrelatedjobs,#f43192,1,2,"By far, the most claims were filed by service-..."
2,Other Services (except Public Administration),81,Service and sales-related jobs,otherservices,serviceandsalesrelatedjobs,#efb4d1,1,3,"By far, the most claims were filed by service-..."
3,Wholesale Trade,42,Service and sales-related jobs,wholesaletrade,serviceandsalesrelatedjobs,#efdce7,1,4,"By far, the most claims were filed by service-..."
4,Manufacturing,31,Manual Labor,manufacturing,manuallabor,#096c5f,2,1,"Manual labor jobs like construction, warehousi..."


In [16]:
for_graphic = pd.merge(
    claims_by_desc_with_meta,
    graphics_info.drop(["info_text", "naics_code"], axis=1),
    left_on="naics_sector_name",
    right_on="industry"
)

In [17]:
for_graphic.to_csv("../output/d3_claims_by_industry.csv", index=False)

## Aggregate claims by NAICS sector

In [18]:
claims["naics_sector"] = claims["R_NAICS_CODE"].str[:2]

In [19]:
grp = claims.fillna("No category")\
    .groupby([ "naics_sector" ])

claims_by_naics_sector = pd.DataFrame({
    "claims_total": grp.size(),
    "claims_2016": grp["CHARGE_FILING_DATE"].apply(lambda x: (x.dt.year == 2016).sum())
}).reset_index()

claims_by_naics_sector.head()

Unnamed: 0,naics_sector,claims_2016,claims_total
0,11,23,950
1,21,8,700
2,22,9,693
3,23,46,3070
4,31,51,3410


In [20]:
claims_by_naics_sector["claims_total"].sum()

170022

In [21]:
claims_by_naics_sector["claims_2016"].sum()

5236

In [22]:
sectors = pd.read_csv("../data/naics_sectors.csv", dtype=str)

In [23]:
pd.merge(
    claims_by_naics_sector,
    sectors,
    on="naics_sector",
    how="left"
).fillna("n/a")

Unnamed: 0,naics_sector,claims_2016,claims_total,naics_sector_rollup,naics_supersector,naics_sector_name
0,11,23,950,11.0,,"Agriculture, Forestry, Fishing and Hunting"
1,21,8,700,21.0,10.0,"Mining, Quarrying, and Oil and Gas Extraction"
2,22,9,693,22.0,40.0,Utilities
3,23,46,3070,23.0,20.0,Construction
4,31,51,3410,31.0,30.0,Manufacturing
5,32,48,3004,31.0,30.0,Manufacturing
6,33,162,7639,31.0,30.0,Manufacturing
7,42,44,2287,42.0,40.0,Wholesale Trade
8,44,140,8358,44.0,40.0,Retail Trade
9,45,77,5761,44.0,40.0,Retail Trade


In [24]:
grp = pd.merge(
    claims_by_naics_sector,
    sectors,
    on="naics_sector",
    how="left"
).fillna("No category").groupby("naics_sector_rollup")

sector_for_graphics = pd.DataFrame({
    "claims_2016": grp["claims_2016"].sum().astype(int),
    "claims_total": grp["claims_total"].sum().astype(int),
    "naics_sector_name": grp["naics_sector_name"].first()
}).reset_index().pipe(pd.merge,
    graphics_info,
    how="right",
    left_on="naics_sector_name",
    right_on="industry"
).drop("naics_sector_name", axis=1)

sector_for_graphics.head()

Unnamed: 0,naics_sector_rollup,claims_2016,claims_total,industry,naics_code,grouping,industry_class,grouping_class,color,focus,index_num,info_text
0,11,23,950,"Agriculture, Forestry, Fishing and Hunting",11,Manual Labor,agricultureforestryfishingandhunting,manuallabor,#a4ffe6,2,4,"Manual labor jobs like construction, warehousi..."
1,21,8,700,"Mining, Quarrying, and Oil and Gas Extraction",21,Manual Labor,miningquarryingandoilandgasextraction,manuallabor,#bfec89,2,5,"Manual labor jobs like construction, warehousi..."
2,22,9,693,Utilities,22,Manual Labor,utilities,manuallabor,#e7f9d2,2,6,"Manual labor jobs like construction, warehousi..."
3,23,46,3070,Construction,23,Manual Labor,construction,manuallabor,#0dccb0,2,3,"Manual labor jobs like construction, warehousi..."
4,31,261,14053,Manufacturing,31,Manual Labor,manufacturing,manuallabor,#096c5f,2,1,"Manual labor jobs like construction, warehousi..."


In [25]:
len(sector_for_graphics)

21

In [26]:
sector_for_graphics.to_csv("../output/d3_claims_by_sector.csv")

### Additional aggregation for article

In [27]:
claims.fillna("[missing]")["CP_SEX"].value_counts().to_frame("claims")\
    .assign(prop=lambda x: x["claims"] / len(claims))

Unnamed: 0,claims,prop
Female,141380,0.831539
Male,25503,0.149998
CP Sex Not Available/Applicable,3087,0.018156
[missing],52,0.000306


---

---

---