### Prepping Data Challenge: Gender Pay Gap (Week 12)

We're using the data currently available on the Gender Pay Gap Service from 2017 to 2022:

### Requirements
 - Input the data
 - Combine the files
 - Keep only relevant fields
 - Extract the Report years from the file paths
 - Create a Year field based on the the first year in the Report name
 - Some companies have changed names over the years. For each EmployerId, find the most recent report they submitted and apply this EmployerName across all reports they've submitted
 - Create a Pay Gap field to explain the pay gap in plain English
   - You may encounter floating point inaccuracies. Find out more about how to resolve them here
   - In this dataset, a positive DiffMedianHourlyPercent means the women's pay is lower than the men's pay, whilst a negative value indicates the other way around
   - The phrasing should be as follows:
      - In this organisation, women's median hourly pay is X% higher/lower than men's.
      - In this organisation, men's and women's median hourly pay is equal.
 - Output the data

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
#Input the data
#Combine the files
all_file = ['UK Gender Pay Gap Data - 2017 to 2018.csv','UK Gender Pay Gap Data - 2018 to 2019.csv',
           'UK Gender Pay Gap Data - 2019 to 2020.csv','UK Gender Pay Gap Data - 2020 to 2021.csv',
           'UK Gender Pay Gap Data - 2021 to 2022.csv']
df = pd.concat([pd.read_csv(X).assign(New=os.path.basename(X).split('.')[0])
                 for X in all_file])

In [3]:
df.columns

Index(['EmployerName', 'EmployerId', 'Address', 'PostCode', 'CompanyNumber',
       'SicCodes', 'DiffMeanHourlyPercent', 'DiffMedianHourlyPercent',
       'DiffMeanBonusPercent', 'DiffMedianBonusPercent', 'MaleBonusPercent',
       'FemaleBonusPercent', 'MaleLowerQuartile', 'FemaleLowerQuartile',
       'MaleLowerMiddleQuartile', 'FemaleLowerMiddleQuartile',
       'MaleUpperMiddleQuartile', 'FemaleUpperMiddleQuartile',
       'MaleTopQuartile', 'FemaleTopQuartile', 'CompanyLinkToGPGInfo',
       'ResponsiblePerson', 'EmployerSize', 'CurrentName',
       'SubmittedAfterTheDeadline', 'DueDate', 'DateSubmitted', 'New'],
      dtype='object')

In [4]:
#Extract the Report years from the file paths
df['Report'] = df['New'].str.extract("-\s([0-9\sa-z]+)")

In [5]:
#Keep only relevant fields
df = df[['Report','EmployerName','EmployerId','DiffMedianHourlyPercent','EmployerSize']]

In [6]:
df.sample(n=10, random_state = 40)

Unnamed: 0,Report,EmployerName,EmployerId,DiffMedianHourlyPercent,EmployerSize
9041,2017 to 2018,The Provost and Scholars of the Queen's Colleg...,16312,4.9,250 to 499
1980,2020 to 2021,City of Bristol College,14816,22.1,500 to 999
1005,2020 to 2021,BELONG LIMITED,1968,5.0,500 to 999
994,2018 to 2019,BELL DECORATING GROUP LIMITED,1954,1.0,1000 to 4999
2433,2017 to 2018,DCS INC LIMITED,3959,-13.0,250 to 499
9897,2020 to 2021,VESTAS OFFSHORE WIND BLADES UK LTD,8404,0.0,500 to 999
4781,2017 to 2018,KAINOS SOFTWARE LIMITED,16344,20.0,250 to 499
198,2017 to 2018,Advanced Travel Partners UK Ltd,936,26.0,250 to 499
8088,2018 to 2019,SIEMENS PUBLIC LIMITED COMPANY,155,21.4,"5000 to 19,999"
6864,2018 to 2019,PAYDENS LIMITED,9638,0.9,1000 to 4999


In [7]:
#Create a Year field based on the the first year in the Report name
df['Year'] = df['Report'].str.extract("([0-9]+)\s[a-z]+")

In [8]:
#Some companies have changed names over the years. 
#For each EmployerId, find the most recent report they submitted and apply this EmployerName 
#across all reports they've submitted
df['EmployerName'] = df.groupby(['EmployerId','Year'])['EmployerName'].transform('last')

In [9]:
#Create a Pay Gap field to explain the pay gap in plain English
df["Pay Gap"] =(df['DiffMedianHourlyPercent'].apply(lambda x: 
                      f"In this organisation, women's median hourly pay is {x} % lower than men's" 
                                                    if x < 0 else
                      
                      (f"In this organisation, women's median hourly pay is {x} % higher than men's" 
                       if x > 0 else
                      "In this organisation, men's and women's median hourly pay is equal."
                      )))

In [10]:
df = df[['Year','Report','EmployerName','EmployerId','EmployerSize','DiffMedianHourlyPercent', 'Pay Gap']]
df.head()

Unnamed: 0,Year,Report,EmployerName,EmployerId,EmployerSize,DiffMedianHourlyPercent,Pay Gap
0,2017,2017 to 2018,"""Bryanston School"",Incorporated",676,500 to 999,28.2,"In this organisation, women's median hourly pa..."
1,2017,2017 to 2018,"""RED BAND"" CHEMICAL COMPANY, LIMITED",16879,250 to 499,-2.7,"In this organisation, women's median hourly pa..."
2,2017,2017 to 2018,123 EMPLOYEES LTD,17677,250 to 499,36.0,"In this organisation, women's median hourly pa..."
3,2017,2017 to 2018,1610 LIMITED,682,250 to 499,-34.0,"In this organisation, women's median hourly pa..."
4,2017,2017 to 2018,1879 EVENTS MANAGEMENT LIMITED,17101,250 to 499,8.1,"In this organisation, women's median hourly pa..."


In [11]:
#output the data
df.to_csv('wk12-output.csv', index=False)