## 3. Gender Guesser

### Step 1: I installed a python package, gender-guesser (pypi.org/project/gender-guesser), to convert the officer names to a count of male and female officers. I then used this count to generate the proportion of female officers for each company. 

In [1]:
import requests
import bs4
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import time, random
import re

In [2]:
officers_18_19 = pd.read_csv('officers_18_19_v2.csv', parse_dates=['date_appointed','date_resigned'])

In [3]:
officers_18_19.drop(columns='Unnamed: 0', inplace=True)

In [4]:
officers_18_19.head()

Unnamed: 0,company_no,name,date_appointed,date_resigned
0,SC016876,"VALENTINE, Dianne Davidson",2015-05-11,
1,SC016876,"CLUBB, Malcolm Richard",2018-05-09,
2,SC016876,"CUMMING, Alexander Coulson",,
3,SC016876,"CUMMING, Charles Nigel Coulson",,
4,SC016876,"GALT, Philip Callaghan",2014-08-19,


### Step 2: I used regex to split the name on the part after the comma, which gives the first and middle names. I noticed there were a few company names listed as officers so removed these - they were identified from the dataframe because when I tried to split using the regex below there was an error as these lines did not contain a comma. Therefore the came up as nan using the below code. 

In [5]:
name_split = []
for line in officers_18_19.name:
    if ',' in line:
        for i in re.findall(r',([^;]*)', line):
            name_split.append(i)
    else:
        name_split.append(np.nan)

In [6]:
# I checked the lists were the same length:
print(len(officers_18_19.name))
print(len(name_split))

178420
178420


### Step 3: I created a column for first name - this takes the first item in the 'name' (since it starts with a space its not the 0th item you want, its the first). The regex part gets rid of any characters that are not a-z or a dash by replacing with nothing. Noted that some names came up as "SURNAME, Firstname, Middlename" instead of "SURNAME, Firstname Middlename".
    
### I then similarly created a column for the middle name.

In [7]:
first_name = []
for x in name_split:
    try:
        splitter = (x.split(' ')[1])
        splitter = re.sub(r'[^a-zA-Z-]', '', splitter)
        first_name.append(splitter)
    except:
        first_name.append(np.nan)

In [8]:
middle_name = []
for x in name_split:
    try:
        splitter_2 = (x.split(' ')[2])
        splitter_2 = re.sub(r'[^a-zA-Z-]', '', splitter_2)
        middle_name.append(splitter_2)
    except:
        middle_name.append(np.nan)

In [9]:
# add columns to the dataframe
officers_18_19['name_split'] = name_split
officers_18_19['first_name'] = first_name
officers_18_19['middle_name'] = middle_name

### Step 4: I used the gender-guesser on the first name and middle name, creating two list stating the gender of each and added these lists to the dataframe. 

### I also removed duplicates and removed lines where the 'test' column was na as these lines appeared when the officer was a company rather than an individual. 


In [10]:
import gender_guesser.detector as gender
d = gender.Detector()
gender_first_name = []
for name in officers_18_19.first_name:
    name_input = u"{}"
    gender_first_name.append(d.get_gender(name_input.format(name)))

In [11]:
d = gender.Detector()
gender_middle_name = []
for name in officers_18_19.middle_name:
    name_input = u"{}"
    gender_middle_name.append(d.get_gender(name_input.format(name)))

In [12]:
officers_18_19['gender_first_name'] = gender_first_name
officers_18_19['gender_middle_name'] = gender_middle_name

In [13]:
officers_18_19.drop_duplicates(inplace=True)

In [14]:
officers_18_19.dropna(subset=['name_split'], inplace=True)

In [15]:
officers_18_19.gender_first_name.value_counts()

male             130226
female            27542
unknown            5088
mostly_male        2361
mostly_female      2328
andy                631
Name: gender_first_name, dtype: int64

In [16]:
officers_18_19[officers_18_19['gender_first_name']=='unknown'].head()

Unnamed: 0,company_no,name,date_appointed,date_resigned,name_split,first_name,middle_name,gender_first_name,gender_middle_name
15,03951948,"KHATTAR, Nitin",2020-03-30,,Nitin,Nitin,,unknown,unknown
18,03951948,"RAVAL, Pradipkumar",2014-05-30,,Pradipkumar,Pradipkumar,,unknown,unknown
19,03951948,"CHANDRANI, Rupesh Sandeep",2012-10-01,2014-06-25,Rupesh Sandeep,Rupesh,Sandeep,unknown,mostly_male
28,03951948,"STONE, Deidrie Alexandria",2019-07-31,2020-03-26,Deidrie Alexandria,Deidrie,Alexandria,unknown,female
138,SC156515,"CURLE, Tolla Joanne",2015-10-02,,Tolla Joanne,Tolla,Joanne,unknown,female


In [17]:
officers_18_19[officers_18_19['date_appointed']=='None'].head()

Unnamed: 0,company_no,name,date_appointed,date_resigned,name_split,first_name,middle_name,gender_first_name,gender_middle_name
2,SC016876,"CUMMING, Alexander Coulson",,,Alexander Coulson,Alexander,Coulson,male,unknown
3,SC016876,"CUMMING, Charles Nigel Coulson",,,Charles Nigel Coulson,Charles,Nigel,male,male
6,SC016876,"BARGH, Frederick Charles",,1990-06-13,Frederick Charles,Frederick,Charles,male,male
7,SC016876,"O'TOOLE, Moira",,2015-05-11,Moira,Moira,,female,unknown
8,SC016876,"CLARK, David Robertson",,1995-01-19,David Robertson,David,Robertson,male,male


### Step 5: This left me with around 5000 unknown values and 600 listed as 'andy'. I wasn't quite sure what the andy items were so I went through each one and manually, this is detailed further below. I decided to remove the unknown values.

### I also wanted to remove officers appointed after 2019 as this was after the gender pay gap data period. I therefore removed all lines were the date appointed was in 2020 or 2021. 

In [18]:
officers_18_19['after_2019'] = 0
for i, line in enumerate(officers_18_19['date_appointed']):
    if '2020' in line:
        officers_18_19['after_2019'][i] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  officers_18_19['after_2019'][i] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [19]:
for i, line in enumerate(officers_18_19['date_appointed']):
    if '2021' in line:
        officers_18_19['after_2019'][i] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  officers_18_19['after_2019'][i] = np.nan


In [20]:
officers_18_19.dropna(subset=['after_2019'], inplace=True)

In [21]:
officers_18_19.gender_first_name[officers_18_19.gender_first_name == "unknown"] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  officers_18_19.gender_first_name[officers_18_19.gender_first_name == "unknown"] = np.nan


In [22]:
officers_18_19.gender_first_name.value_counts()

male             125151
female            26499
mostly_male        2271
mostly_female      2245
andy                607
Name: gender_first_name, dtype: int64

In [23]:
# Save to a CSV file
# officers_18_19.to_csv('detailed_officers_18_19', index=False)

detailed_officers_18_19 = pd.read_csv('detailed_officers_18_19.csv')

In [24]:
detailed_officers_18_19.dropna(subset=['first_name'], inplace=True)
detailed_officers_18_19.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 161654 entries, 0 to 161655
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   company_no          161654 non-null  object 
 1   name                161654 non-null  object 
 2   date_appointed      161654 non-null  object 
 3   date_resigned       161654 non-null  object 
 4   test                161654 non-null  object 
 5   first_name          161654 non-null  object 
 6   middle_name_1       118000 non-null  object 
 7   gender_first_name   156773 non-null  object 
 8   gender_middle_name  161654 non-null  object 
 9   after_2019          161654 non-null  float64
dtypes: float64(1), object(9)
memory usage: 13.6+ MB


### Step 6: This left me with around 5000 unknown values and 600 listed as 'andy'. I wasn't quite sure what the andy items were so I iterated through each row and then input 'male' or 'female' after googling the individual to determine if they were male or female. For the individuals I was unable to classify, I had to mark these as NA. I used this to generate a column with the manual gender input, removing the NA values. 

In [25]:
# Could create 2 columns or just dummify manual gender below:
# detailed_officers_18_19['manual_female'] = 0
# detailed_officers_18_19['manual_male'] = 0

In [26]:
# use iterrows to loop through the rows and then print out the row name, input the replacement (m or f) to go
# into the manual gender line. 
# This method will give you the index of the line and the manual gender column - you can use this to join the data 
# together (https://stackoverflow.com/questions/40468069/merge-two-dataframes-by-index) then dummify the column. 
manual_gender = []
index_manual = []
for index, row in detailed_officers_18_19.iterrows():
    if row['gender_first_name']=='andy':
        print(row['name'])
        manual = input('male or female?')
        index_manual.append(index)
        manual_gender.append(manual)
#     detailed_officers_18_19.name[detailed_officers_18_19['gender_first_name']=='andy']

NOVAKOVIC, Novica
male or female?m
NOVAKOVIC, Novica


KeyboardInterrupt: Interrupted by user

In [27]:
manual_gender

['m']

In [28]:
index_manual

[761]

In [30]:
andy_gender = pd.DataFrame(index=index_manual)
andy_gender['manual_gender'] = manual_gender

In [31]:
andy_gender['manual_gender'][andy_gender['manual_gender'] == 'NA'] = np.nan

In [None]:
# andy_gender.to_csv('andy_gender_18_19')

In [32]:
andy_gender = pd.read_csv('andy_gender_18_19.csv')

In [33]:
detailed_officers_18_19_v2 = pd.concat([detailed_officers_18_19, andy_gender], axis=1)

In [34]:
detailed_officers_18_19_v2.head()

Unnamed: 0.1,company_no,name,date_appointed,date_resigned,test,first_name,middle_name_1,gender_first_name,gender_middle_name,after_2019,Unnamed: 0,manual_gender
0,SC016876,"VALENTINE, Dianne Davidson",2015-05-11,,Dianne Davidson,Dianne,Davidson,female,male,0.0,761.0,male
1,SC016876,"CLUBB, Malcolm Richard",2018-05-09,,Malcolm Richard,Malcolm,Richard,male,male,0.0,780.0,male
2,SC016876,"CUMMING, Alexander Coulson",,,Alexander Coulson,Alexander,Coulson,male,unknown,0.0,923.0,female
3,SC016876,"CUMMING, Charles Nigel Coulson",,,Charles Nigel Coulson,Charles,Nigel,male,male,0.0,1111.0,female
4,SC016876,"GALT, Philip Callaghan",2014-08-19,,Philip Callaghan,Philip,Callaghan,male,unknown,0.0,1251.0,female


### Step 7: I dummified the gender columns and then for those that were listed as 'mostly female' and had a female middle name, I changed these to female. Similarly, if the first name was 'mostly male' and the middle name was male, I converted them to a male. 
    
### I then combined all column data on gender to generate a count of male officers and female officers per country. This was then used to calculate the proportion of officers who were female. 

In [35]:
detailed_officers_18_19_v2 = pd.get_dummies(detailed_officers_18_19_v2, columns = ['gender_first_name', 
                                                                                       'gender_middle_name', 
                                                                                       'manual_gender'])

In [36]:
detailed_officers_18_19_v2['middle_check_manual_female'] = 0
detailed_officers_18_19_v2['middle_check_manual_male'] = 0

In [37]:
for i, line in enumerate(detailed_officers_18_19_v2['gender_first_name_mostly_female']):
    if line == 1 and detailed_officers_18_19_v2['gender_middle_name_female'][i] == 1:
        detailed_officers_18_19_v2['middle_check_manual_female'][i] = 1
    else:
        detailed_officers_18_19_v2['middle_check_manual_female'][i] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_officers_18_19_v2['middle_check_manual_female'][i] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_officers_18_19_v2['middle_check_manual_female'][i] = 1


In [38]:
for i, line in enumerate(detailed_officers_18_19_v2['gender_first_name_mostly_male']):
    if line == 1 and detailed_officers_18_19_v2['gender_middle_name_male'][i] == 1:
        detailed_officers_18_19_v2['middle_check_manual_male'][i] = 1
    else:
        detailed_officers_18_19_v2['middle_check_manual_male'][i] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_officers_18_19_v2['middle_check_manual_male'][i] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_officers_18_19_v2['middle_check_manual_male'][i] = 1


In [39]:
detailed_officers_18_19_v2['middle_check_manual_female'].value_counts()

0    161214
1       440
Name: middle_check_manual_female, dtype: int64

In [40]:
detailed_officers_18_19_v2['middle_check_manual_male'].value_counts()

0    160518
1      1136
Name: middle_check_manual_male, dtype: int64

In [41]:
detailed_officers_18_19_v2.head()

Unnamed: 0.1,company_no,name,date_appointed,date_resigned,test,first_name,middle_name_1,after_2019,Unnamed: 0,gender_first_name_andy,...,gender_middle_name_andy,gender_middle_name_female,gender_middle_name_male,gender_middle_name_mostly_female,gender_middle_name_mostly_male,gender_middle_name_unknown,manual_gender_female,manual_gender_male,middle_check_manual_female,middle_check_manual_male
0,SC016876,"VALENTINE, Dianne Davidson",2015-05-11,,Dianne Davidson,Dianne,Davidson,0.0,761.0,0,...,0,0,1,0,0,0,0,1,0,0
1,SC016876,"CLUBB, Malcolm Richard",2018-05-09,,Malcolm Richard,Malcolm,Richard,0.0,780.0,0,...,0,0,1,0,0,0,0,1,0,0
2,SC016876,"CUMMING, Alexander Coulson",,,Alexander Coulson,Alexander,Coulson,0.0,923.0,0,...,0,0,0,0,0,1,1,0,0,0
3,SC016876,"CUMMING, Charles Nigel Coulson",,,Charles Nigel Coulson,Charles,Nigel,0.0,1111.0,0,...,0,0,1,0,0,0,1,0,0,0
4,SC016876,"GALT, Philip Callaghan",2014-08-19,,Philip Callaghan,Philip,Callaghan,0.0,1251.0,0,...,0,0,0,0,0,1,1,0,0,0


In [42]:
detailed_officers_18_19_v2['female_final'] = detailed_officers_18_19_v2.gender_first_name_female + \
detailed_officers_18_19_v2.manual_gender_female + detailed_officers_18_19_v2.middle_check_manual_female

In [43]:
detailed_officers_18_19_v2['male_final'] = detailed_officers_18_19_v2.gender_first_name_male + \
detailed_officers_18_19_v2.manual_gender_male + detailed_officers_18_19_v2.middle_check_manual_male

In [44]:
detailed_officers_18_19_v2.female_final.value_counts()

0    134700
1     26775
2       179
Name: female_final, dtype: int64

In [45]:
detailed_officers_18_19_v2['filter'] = 0

In [None]:
# for i, line in enumerate(detailed_officers_18_19_v2['female_final']):
#     if line == 1 or detailed_officers_18_19_v2['male_final'][i] == 1:
#         detailed_officers_18_19_v2['filter'][i] = 1
#     else:
#         detailed_officers_18_19_v2['filter'][i] = np.nan

In [46]:
detailed_officers_18_19_v2.dropna(subset=['filter'], inplace=True)

In [47]:
grouped_by_company = detailed_officers_18_19_v2.groupby('company_no')
final_gendercount_18_19 = grouped_by_company.sum()

In [48]:
final_gendercount_18_19.columns

Index(['after_2019', 'Unnamed: 0', 'gender_first_name_andy',
       'gender_first_name_female', 'gender_first_name_male',
       'gender_first_name_mostly_female', 'gender_first_name_mostly_male',
       'gender_middle_name_andy', 'gender_middle_name_female',
       'gender_middle_name_male', 'gender_middle_name_mostly_female',
       'gender_middle_name_mostly_male', 'gender_middle_name_unknown',
       'manual_gender_female', 'manual_gender_male',
       'middle_check_manual_female', 'middle_check_manual_male',
       'female_final', 'male_final', 'filter'],
      dtype='object')

In [49]:
final_gendercount_18_19.drop(columns = ['after_2019', 'gender_first_name_andy',
       'gender_first_name_female', 'gender_first_name_male',
       'gender_first_name_mostly_female', 'gender_first_name_mostly_male',
       'gender_middle_name_andy', 'gender_middle_name_female',
       'gender_middle_name_male', 'gender_middle_name_mostly_female',
       'gender_middle_name_mostly_male', 'gender_middle_name_unknown',
       'manual_gender_female', 'manual_gender_male',
       'middle_check_manual_female', 'middle_check_manual_male', 'filter'], inplace=True)

In [50]:
final_gendercount_18_19['percent_female_officer'] = \
    final_gendercount_18_19.female_final/(final_gendercount_18_19.female_final+final_gendercount_18_19.male_final)

In [51]:
# final_gendercount_18_19.to_csv('final_gendercount_18_19', index='company_no')

In [52]:
final_gendercount_18_19 = pd.read_csv('final_gendercount_18_19.csv')

In [53]:
final_gendercount_18_19.set_index(keys = 'company_no', inplace=True)

In [54]:
final_gendercount_18_19.describe()

Unnamed: 0,female_final,male_final,percent_female_officer
count,7810.0,7810.0,7810.0
mean,3.521767,16.173367,0.187136
std,3.484793,8.873243,0.163375
min,0.0,0.0,0.0
25%,1.0,9.0,0.066667
50%,3.0,16.0,0.151515
75%,5.0,24.0,0.266667
max,24.0,35.0,1.0


In [57]:
final_gendercount_18_19[final_gendercount_18_19['percent_female_officer'] >= 0.5].head()

Unnamed: 0_level_0,female_final,male_final,percent_female_officer
company_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6400,23,12,0.657143
231824,18,13,0.580645
232081,17,14,0.548387
353341,17,13,0.566667
430051,5,3,0.625


### Interesting that only 500/ 7800 companies have 50% or more female officers whereas 7300 have >50% male officers. 