## MODELLING - GPG - Government Equalities Office

### Target: "DiffMedianHourlyPercent"

#### 1 - General Preparation
- Unnecessary Columns
- Missing Values
- Imputing Values

#### Feature Extension
- Quantizise Company Size
- Include Company Sector

### Linear Model
Feature Engineering:


In [28]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [29]:
path = "data/gender-pay-gap-uk-gov/2017.csv.gz"
df = pd.read_csv(path, compression='gzip')
rows = df.shape[0]
cols = df.shape[1]
print(f"Rows: {rows}")
print(f"Cols: {cols}")

Rows: 10562
Cols: 25


### 1 - General Preparation

In [30]:
df.head(3)

Unnamed: 0,EmployerName,Address,CompanyNumber,SicCodes,DiffMeanHourlyPercent,DiffMedianHourlyPercent,DiffMeanBonusPercent,DiffMedianBonusPercent,MaleBonusPercent,FemaleBonusPercent,...,FemaleUpperMiddleQuartile,MaleTopQuartile,FemaleTopQuartile,CompanyLinkToGPGInfo,ResponsiblePerson,EmployerSize,CurrentName,SubmittedAfterTheDeadline,DueDate,DateSubmitted
0,"""Bryanston School"",Incorporated","Bryanston House,\r\nBlandford,\r\nDorset,\r\nD...",00226143,85310,18.0,28.2,0.0,0.0,0.0,0.0,...,50.8,51.5,48.5,https://www.bryanston.co.uk/employment,Nick McRobb (Bursar and Clerk to the Governors),500 to 999,BRYANSTON SCHOOL INCORPORATED,False,05/04/2018 00:00:00,27/03/2018 11:42:49
1,"""RED BAND"" CHEMICAL COMPANY, LIMITED","19 Smith's Place,\r\nLeith Walk,\r\nEdinburgh,...",SC016876,47730,2.3,-2.7,15.0,37.5,15.6,66.7,...,89.7,18.1,81.9,,Philip Galt (Managing Director),250 to 499,"""RED BAND"" CHEMICAL COMPANY, LIMITED",False,05/04/2018 00:00:00,28/03/2018 16:44:25
2,118 LIMITED,"3 Alexandra Gate Ffordd Pengam,\r\nGround Floo...",03951948,61900,1.7,2.8,13.1,13.6,70.0,57.0,...,50.0,58.0,42.0,,"Emma Crowe (VP, Human Resources)",500 to 999,118 LIMITED,False,05/04/2018 00:00:00,27/03/2018 19:10:41


In [31]:
print(df.nunique())

EmployerName                 10561
Address                       9039
CompanyNumber                 9203
SicCodes                      1943
DiffMeanHourlyPercent          829
DiffMedianHourlyPercent        891
DiffMeanBonusPercent          1763
DiffMedianBonusPercent        1734
MaleBonusPercent               978
FemaleBonusPercent             979
MaleLowerQuartile              980
FemaleLowerQuartile            980
MaleLowerMiddleQuartile        986
FemaleLowerMiddleQuartile      986
MaleUpperMiddleQuartile        974
FemaleUpperMiddleQuartile      974
MaleTopQuartile                944
FemaleTopQuartile              944
CompanyLinkToGPGInfo          6589
ResponsiblePerson             7304
EmployerSize                     7
CurrentName                  10561
SubmittedAfterTheDeadline        2
DueDate                          2
DateSubmitted                10474
dtype: int64


In [32]:
df['DueDate'].value_counts()
df['CompanyLinkToGPGInfo'].sample(10)

3600    http://www.fairhurst.co.uk/site-information/ge...
7670                                                  NaN
6116                                                  NaN
5905    https://www.middevon.gov.uk/your-council/equal...
4985                                                  NaN
4881                           http://www.jzflowers.co.uk
6729                                                  NaN
6892    http://www.penspen.com/wp-content/uploads/2018...
4455    https://www.iwt.co.uk/about-us/gender-pay-report/
3259                                                  NaN
Name: CompanyLinkToGPGInfo, dtype: object

**Dropping Data:**
- EmployerName / CurrentName: are unique in every row
- CompanyLinkToGPGInfo: contains urls
- ResponsiblePerson: Maybe we could check whether the companies that didn't fill this field also skipped PDFs or if they're incomplete or have a pattern.
- DueDate / DateSubmitted / SubmittedAfterTheDeadline 

In [33]:
columns_to_drop = [
    'EmployerName','CurrentName',
    'CompanyLinkToGPGInfo','ResponsiblePerson',
    'DueDate','DateSubmitted','SubmittedAfterTheDeadline'
]
df_cols = df.drop(columns_to_drop, axis='columns')
df.shape

(10562, 25)

#### Imputing values

In [34]:
def check_presence(df):
    return df.notnull().sum() / rows
check_presence(df_cols)

Address                      0.974153
CompanyNumber                0.871331
SicCodes                     0.946317
DiffMeanHourlyPercent        1.000000
DiffMedianHourlyPercent      1.000000
DiffMeanBonusPercent         0.998106
DiffMedianBonusPercent       0.998106
MaleBonusPercent             1.000000
FemaleBonusPercent           1.000000
MaleLowerQuartile            1.000000
FemaleLowerQuartile          1.000000
MaleLowerMiddleQuartile      1.000000
FemaleLowerMiddleQuartile    1.000000
MaleUpperMiddleQuartile      1.000000
FemaleUpperMiddleQuartile    1.000000
MaleTopQuartile              1.000000
FemaleTopQuartile            1.000000
EmployerSize                 1.000000
dtype: float64

In [36]:
# Mean because the measurement is mean
mean_bonus_percent = df_cols['DiffMeanBonusPercent'].mean()
df_cols['DiffMeanBonusPercent'] = df_cols['DiffMeanBonusPercent'].fillna(mean_bonus_percent)

# Median because the measurement is median
median_bonus_percent = df['DiffMedianBonusPercent'].median()
df_cols['DiffMedianBonusPercent'] = df_cols['DiffMedianBonusPercent'].fillna(median_bonus_percent)

# Mode because it is categorical
employer_size_mode = df_cols['EmployerSize'].mode()
df_cols['EmployerSize'] = df_cols['EmployerSize'].fillna(employer_size_mode)

# employer_size_mode
check_presence(df_cols)

Address                      0.974153
CompanyNumber                0.871331
SicCodes                     0.946317
DiffMeanHourlyPercent        1.000000
DiffMedianHourlyPercent      1.000000
DiffMeanBonusPercent         1.000000
DiffMedianBonusPercent       1.000000
MaleBonusPercent             1.000000
FemaleBonusPercent           1.000000
MaleLowerQuartile            1.000000
FemaleLowerQuartile          1.000000
MaleLowerMiddleQuartile      1.000000
FemaleLowerMiddleQuartile    1.000000
MaleUpperMiddleQuartile      1.000000
FemaleUpperMiddleQuartile    1.000000
MaleTopQuartile              1.000000
FemaleTopQuartile            1.000000
EmployerSize                 1.000000
dtype: float64

In [38]:
df_cols['EmployerSize'].replace('Not Provided', '250 to 499', inplace=True)
df_cols['EmployerSize'].value_counts()
# check_presence(df)

250 to 499        5026
500 to 999        2534
1000 to 4999      2208
5000 to 19,999     441
Less than 250      288
20,000 or more      65
Name: EmployerSize, dtype: int64

### EmpoyerSize range to category

In [46]:
# Could come a copy here.
cat_to_quant = {
 'Less than 250' : 125,
 '250 to 499' : 375,
 '500 to 999': 750,
 '1000 to 4999' : 2500,
 '5000 to 19,999': 15000,
 '20,000 or more': 35000
}
# cat_to_quant['Less than 250']
# df_cols['EmployerSize'] = df_cols['EmployerSize'].map(cat_to_quant) DO NOT DO AGAIN
df_cols['EmployerSize'].value_counts()

# check_presence(df)

375      5026
750      2534
2500     2208
15000     441
125       288
35000      65
Name: EmployerSize, dtype: int64

### 3 Linear Model

In [64]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

features = [
#     'DiffMeanBonusPercent', 'DiffMedianBonusPercent',
    'MaleBonusPercent', 'FemaleBonusPercent',
    'MaleLowerQuartile', 'FemaleLowerQuartile',
    'MaleLowerMiddleQuartile', 'FemaleLowerMiddleQuartile',
    'MaleUpperMiddleQuartile', 'FemaleUpperMiddleQuartile',
    'MaleTopQuartile', 'FemaleTopQuartile', 'EmployerSize'
]

In [59]:
X = df_cols[features]
y = df_cols['DiffMedianHourlyPercent']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(7921, 11) (7921,)
(2641, 11) (2641,)


In [62]:
linear_reg = LinearRegression(normalize=True)
linear_reg.fit(X_train,y_train)

In [69]:
y_preds_test = linear_reg.predict(X_test)
y_preds_train = linear_reg.predict(X_train)

train_score = r2_score(y_train, y_preds_train, multioutput='raw_values')
test_score = r2_score(y_test, y_preds_test, multioutput='raw_values')

print("Train Score:", train_score)
print("Test Score:", test_score)

Train Score: [0.49317162]
Test Score: [0.50060095]
