## 3. Gender Guesser

I installed a python package, gender-guesser (pypi.org/project/gender-guesser), to convert the officer names to a count of male and female officers. I then used this count to generate the proportion of female officers for each company. 

In [1]:
import requests
import bs4
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import time, random

In [2]:
officers_18_19 = pd.read_csv('officers_18_19_v2', parse_dates=['date_appointed','date_resigned'])

In [3]:
officers_18_19.drop(columns='Unnamed: 0', inplace=True)

In [4]:
officers_18_19

Unnamed: 0,company_no,name,date_appointed,date_resigned
0,SC016876,"VALENTINE, Dianne Davidson",2015-05-11,
1,SC016876,"CLUBB, Malcolm Richard",2018-05-09,
2,SC016876,"CUMMING, Alexander Coulson",,
3,SC016876,"CUMMING, Charles Nigel Coulson",,
4,SC016876,"GALT, Philip Callaghan",2014-08-19,
...,...,...,...,...
178415,01032611,"HIGGINSON, Anthony Mitchell",2002-09-19,2020-12-31
178416,01032611,"HIGGINSON, Julie Elizabeth",2007-08-12,2020-12-31
178417,01032611,"LOTGERINK, Ronald Emmanuel Maria",1998-06-05,2018-07-26
178418,01032611,"STEVERS-VAN DER LAAN, Johanna Clazina Maria",,2016-06-30


<font color='red'>

#### I used regex to split the name on the part after the comma, which gives the first and middle names. I noticed there were a few company names listed as officers so removed these - they were identified from the dataframe because when I tried to split using the regex below there was an error as these lines did not contain a comma. Therefore the came up as nan using the below code. 
    
</font>



In [5]:
import re
test = []
for line in officers_18_19.name:
    if ',' in line:
        for i in re.findall(r',([^;]*)', line):
            test.append(i)
    else:
        test.append(np.nan)

In [6]:
# I checked the lists were the same length:
print(len(officers_18_19.name))
print(len(test))

<font color='red'>

#### I created a column for first name - this takes the first item in the 'name' (since it starts with a space its not the 0th item you want, its the first). The regex part gets rid of any characters that are not a-z or a dash by replacing with nothing. Noted that some names came up as "SURNAME, Firstname, Middlename" instead of "SURNAME, Firstname Middlename".
    
#### I then similarly created a column for the middle name.
    
</font>

In [9]:
first_name = []
for x in test:
    try:
        splitter = (x.split(' ')[1])
        splitter = re.sub(r'[^a-zA-Z-]', '', splitter)
        first_name.append(splitter)
    except:
        first_name.append(np.nan)

In [11]:
middle_name_1 = []
for x in test:
    try:
        splitter_2 = (x.split(' ')[2])
        splitter_2 = re.sub(r'[^a-zA-Z-]', '', splitter_2)
        middle_name_1.append(splitter_2)
    except:
        middle_name_1.append(np.nan)

In [13]:
# add columns to the dataframe
officers_18_19['test'] = test
officers_18_19['first_name'] = first_name
officers_18_19['middle_name_1'] = middle_name_1

In [14]:
officers_18_19

Unnamed: 0,company_no,name,date_appointed,date_resigned,test,first_name,middle_name_1
0,SC016876,"VALENTINE, Dianne Davidson",2015-05-11,,Dianne Davidson,Dianne,Davidson
1,SC016876,"CLUBB, Malcolm Richard",2018-05-09,,Malcolm Richard,Malcolm,Richard
2,SC016876,"CUMMING, Alexander Coulson",,,Alexander Coulson,Alexander,Coulson
3,SC016876,"CUMMING, Charles Nigel Coulson",,,Charles Nigel Coulson,Charles,Nigel
4,SC016876,"GALT, Philip Callaghan",2014-08-19,,Philip Callaghan,Philip,Callaghan
...,...,...,...,...,...,...,...
178415,01032611,"HIGGINSON, Anthony Mitchell",2002-09-19,2020-12-31,Anthony Mitchell,Anthony,Mitchell
178416,01032611,"HIGGINSON, Julie Elizabeth",2007-08-12,2020-12-31,Julie Elizabeth,Julie,Elizabeth
178417,01032611,"LOTGERINK, Ronald Emmanuel Maria",1998-06-05,2018-07-26,Ronald Emmanuel Maria,Ronald,Emmanuel
178418,01032611,"STEVERS-VAN DER LAAN, Johanna Clazina Maria",,2016-06-30,Johanna Clazina Maria,Johanna,Clazina


<font color='red'>

#### I used the gender-guesser on the first name and middle name, creating two list stating the gender of each and added these lists to the dataframe. 
#### I also removed duplicates and removed lines where the 'test' column was na as these lines appeared when the officer was a company rather than an individual. 
    
</font>

In [15]:
import gender_guesser.detector as gender
d = gender.Detector()
gender_first_name = []
for name in officers_18_19.first_name:
    name_input = u"{}"
    gender_first_name.append(d.get_gender(name_input.format(name)))

In [16]:
d = gender.Detector()
gender_middle_name = []
for name in officers_18_19.middle_name_1:
    name_input = u"{}"
    gender_middle_name.append(d.get_gender(name_input.format(name)))

In [17]:
officers_18_19['gender_first_name'] = gender_first_name
officers_18_19['gender_middle_name'] = gender_middle_name

In [19]:
officers_18_19.drop_duplicates(inplace=True)

In [22]:
officers_18_19.dropna(subset=['test'], inplace=True)

In [23]:
officers_18_19.gender_first_name.value_counts()

male             130226
female            27542
unknown            5088
mostly_male        2361
mostly_female      2328
andy                631
Name: gender_first_name, dtype: int64

In [24]:
officers_18_19[officers_18_19['gender_first_name']=='unknown'].head()

Unnamed: 0,company_no,name,date_appointed,date_resigned,test,first_name,middle_name_1,gender_first_name,gender_middle_name
15,03951948,"KHATTAR, Nitin",2020-03-30,,Nitin,Nitin,,unknown,unknown
18,03951948,"RAVAL, Pradipkumar",2014-05-30,,Pradipkumar,Pradipkumar,,unknown,unknown
19,03951948,"CHANDRANI, Rupesh Sandeep",2012-10-01,2014-06-25,Rupesh Sandeep,Rupesh,Sandeep,unknown,mostly_male
28,03951948,"STONE, Deidrie Alexandria",2019-07-31,2020-03-26,Deidrie Alexandria,Deidrie,Alexandria,unknown,female
138,SC156515,"CURLE, Tolla Joanne",2015-10-02,,Tolla Joanne,Tolla,Joanne,unknown,female


In [25]:
officers_18_19[officers_18_19['date_appointed']=='None']

Unnamed: 0,company_no,name,date_appointed,date_resigned,test,first_name,middle_name_1,gender_first_name,gender_middle_name
2,SC016876,"CUMMING, Alexander Coulson",,,Alexander Coulson,Alexander,Coulson,male,unknown
3,SC016876,"CUMMING, Charles Nigel Coulson",,,Charles Nigel Coulson,Charles,Nigel,male,male
6,SC016876,"BARGH, Frederick Charles",,1990-06-13,Frederick Charles,Frederick,Charles,male,male
7,SC016876,"O'TOOLE, Moira",,2015-05-11,Moira,Moira,,female,unknown
8,SC016876,"CLARK, David Robertson",,1995-01-19,David Robertson,David,Robertson,male,male
...,...,...,...,...,...,...,...,...,...
178347,00243883,"LEITCH, Alexander Park",,1991-12-10,Alexander Park,Alexander,Park,male,unknown
178407,01032611,"VAN DER LAAN, Arnoldus Theodorus Maria",,,Arnoldus Theodorus Maria,Arnoldus,Theodorus,unknown,male
178409,01032611,"MCDONALD, Peter John",,2019-05-31,Peter John,Peter,John,male,male
178418,01032611,"STEVERS-VAN DER LAAN, Johanna Clazina Maria",,2016-06-30,Johanna Clazina Maria,Johanna,Clazina,female,female


<font color='red'>

#### This left me with around 5000 unknown values and 600 listed as 'andy'. I wasn't quite sure what the andy items were so I went through each one and manually, this is detailed further below. I decided to remove the unknown values.

#### I also wanted to remove officers appointed after 2019 as this was after the gender pay gap data period. I therefore removed all lines were the date appointed was in 2020 or 2021. 
</font>

In [27]:
officers_18_19['after_2019'] = 0
for i, line in enumerate(officers_18_19['date_appointed']):
    if '2020' in line:
        officers_18_19['after_2019'][i] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  officers_18_19['after_2019'][i] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [28]:
for i, line in enumerate(officers_18_19['date_appointed']):
    if '2021' in line:
        officers_18_19['after_2019'][i] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  officers_18_19['after_2019'][i] = np.nan


In [30]:
officers_18_19.dropna(subset=['after_2019'], inplace=True)

In [31]:
officers_18_19.gender_first_name[officers_18_19.gender_first_name == "unknown"] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  officers_18_19.gender_first_name[officers_18_19.gender_first_name == "unknown"] = np.nan


In [32]:
officers_18_19.gender_first_name.value_counts()

male             125151
female            26499
mostly_male        2271
mostly_female      2245
andy                607
Name: gender_first_name, dtype: int64

In [38]:
# Save to a CSV file
# officers_18_19.to_csv('detailed_officers_18_19', index=False)

detailed_officers_18_19 = pd.read_csv('detailed_officers_18_19')

detailed_officers_18_19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161656 entries, 0 to 161655
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   company_no          161656 non-null  object 
 1   name                161656 non-null  object 
 2   date_appointed      161656 non-null  object 
 3   date_resigned       161656 non-null  object 
 4   test                161656 non-null  object 
 5   first_name          161654 non-null  object 
 6   middle_name_1       118000 non-null  object 
 7   gender_first_name   156773 non-null  object 
 8   gender_middle_name  161656 non-null  object 
 9   after_2019          161656 non-null  float64
dtypes: float64(1), object(9)
memory usage: 12.3+ MB


In [None]:
# detailed_officers_18_19.dropna(subset=['first_name'], inplace=True)
# detailed_officers_18_19.info()

In [None]:
# Could create 2 columns or just dummify manual gender below:
# detailed_officers_18_19['manual_female'] = 0
# detailed_officers_18_19['manual_male'] = 0

In [None]:
# just to check it worked 
detailed_officers_18_19['manual_female'].sum()

<font color='red'>

#### This left me with around 5000 unknown values and 600 listed as 'andy'. I wasn't quite sure what the andy items were so I iterated through each row and then input 'male' or 'female' after googling the individual to determine if they were male or female. For the individuals I was unable to classify, I had to mark these as NA. I used this to generate a column with the manual gender input, removing the NA values. 
</font>

In [39]:
# use iterrows to loop through the rows and then print out the row name, input the replacement (m or f) to go
# into the manual gender line. 
# This method will give you the index of the line and the manual gender column - you can use this to join the data 
# together (https://stackoverflow.com/questions/40468069/merge-two-dataframes-by-index) then dummify the column. 
manual_gender = []
index_manual = []
for index, row in detailed_officers_18_19.iterrows():
    if row['gender_first_name']=='andy':
        print(row['name'])
        manual = input('male or female?')
        index_manual.append(index)
        manual_gender.append(manual)
#     detailed_officers_18_19.name[detailed_officers_18_19['gender_first_name']=='andy']

NOVAKOVIC, Novica
male or female?m
NOVAKOVIC, Novica
male or female?m
COLE, Pat
male or female?f
RYDER, Dominique Michelle
male or female?f
OOI, Shu Mei
male or female?f
BALE, Lesley Christine
male or female?f
REYNOLDS, Lesley
male or female?f
BROWN, Lesley Ann
male or female?f
KAY, Lesley
male or female?f
PARK, Jong An
male or female?m
ALLISON, Calum
male or female?m
HAIG, Lesley Jane, Professor
male or female?f
VERHAGEN, Adri
male or female?m
HARADA, Jo
male or female?f
FAN, Li
male or female?f
GUO, Jing
male or female?f
HARRIS, Lesley Ann
male or female?f
BERTONCINI, Dominique
male or female?f
CAI, Yuan, Finance Manager
male or female?m
COWLEY, Lesley Ruth
male or female?f
BROWN, Calum John Mcdowall
male or female?m
DENG, Ying
male or female?f
JAP, Chee Miau
male or female?m
GRAEME, Lesley Joyce
male or female?f
KNOX, Lesley Mary Samuel
male or female?f
DAVIES, Ceri Thomas
male or female?m
GRAEME, Lesley Joyce
male or female?f
LONGSTONE, Lesley Carol
male or female?f
WILSON, Averil 

LIBOYI, Jackie Awinja
male or female?f
LANGRIDGE, Hui Min
male or female?NA
TING, Hui Tzu Katja
male or female?NA
HASKINS, Lesley Erica
male or female?f
MACDONAGH, Lesley Anne
male or female?f
YAMAMOTO, Soichi
male or female?NA
POON, Chee Wah
male or female?NA
LEE, Sang Hoon David
male or female?m
WILSON, Siân
male or female?NA
DELSOL, Ashton
male or female?NA
CHOE, Peng Sum
male or female?NA
DAVIES, Wyn Meredith
male or female?f
BRADBURY, Ashton Charles
male or female?m
CHEN, Yun
male or female?NA
ZHANG, Wei
male or female?NA
LUO, Gang
male or female?NA
SUN, Ming
male or female?NA
DENG, Tao
male or female?NA
LIU, Kang
male or female?NA
LUO, Gang
male or female?NA
ZHANG, Hui
male or female?NA
HSIEH, An-Ping
male or female?NA
CORBETT, Lesley Jane
male or female?f
CHENG, Chun Fun Clemence
male or female?NA
FOK, Kin Ning Canning
male or female?NA
COOPER, Calum
male or female?m
GALVIN, Lesley
male or female?NA
HOU, Qian
male or female?NA
WU, Gang
male or female?NA
BATTY, Lesley Anne
male o

DAVIS, Lesley Ann
male or female?f
WONG, Choon Wah
male or female?NA
ABBOTT, Chai Ming
male or female?NA
ABBOTT, Chai Ming
male or female?NA
KUOK, Meng Xiong
male or female?NA
TEO, Ching Leun
male or female?NA
BRAK, Anthonie Jacob Carel
male or female?m
CAPEL, Aubrey John
male or female?NA
LIU, Chao
male or female?NA
ZIERING, Sigi, Dr
male or female?NA
SHAH, Bharat Kumar
male or female?m
SHAH, Bharat Kumar Hansraj Devraj
male or female?m
FUNG, Wai -Yan
male or female?NA
XIONG, Xing
male or female?NA
PATERSON, Calum Macdonald
male or female?NA
TANG, Wai Foon
male or female?NA
QUAYLE, Huan
male or female?NA
DRAY, Lesley Jane
male or female?f
HOWELLS, Ceri
male or female?NA
OOI, Shu Mei
male or female?NA
GLOVER, Lesley Anne, Professor Dame
male or female?f
GLOVER, Lesley Anne, Professor Dame
male or female?f
GLOVER, Lesley Anne, Professor Dame
male or female?f
LAW, Kin Fat
male or female?NA
SMITH, Innes
male or female?NA
SMITH, Innes
male or female?NA
BEHARRELL, Lesley
male or female?NA
B

In [40]:
manual_gender

['m',
 'm',
 'f',
 'f',
 'f',
 'f',
 'f',
 'f',
 'f',
 'm',
 'm',
 'f',
 'm',
 'f',
 'f',
 'f',
 'f',
 'f',
 'm',
 'f',
 'm',
 'f',
 'm',
 'f',
 'f',
 'm',
 'f',
 'f',
 'f',
 'f',
 'm',
 'm',
 'm',
 'f',
 'm',
 'm',
 'm',
 'm',
 'm',
 'm',
 'm',
 'f',
 'f',
 'f',
 'NA',
 'NA',
 'NA',
 'NA',
 'm',
 'f',
 'f',
 'f',
 'f',
 'NA',
 'f',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'm',
 'm',
 'm',
 'm',
 'f',
 'm',
 'm',
 'm',
 'f',
 'NA',
 'm',
 'm',
 'm',
 'NA',
 'f',
 'f',
 'm',
 'm',
 'f',
 'f',
 'f',
 'NA',
 'NA',
 'NA',
 'f',
 'm',
 'f',
 'NA',
 'NA',
 'f',
 'f',
 'NA',
 'NA',
 'NA',
 'f',
 'm',
 'm',
 'f',
 'f',
 'f',
 'NA',
 'NA',
 'f',
 'NA',
 'f',
 'NA',
 'm',
 'm',
 'f',
 'NA',
 'NA',
 'NA',
 'm',
 'NA',
 'm',
 'NA',
 'NA',
 'NA',
 'f',
 'm',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'm',
 'm',
 'NA',
 'm',
 'NA',
 'f',
 'f',
 'NA',
 'm',
 'm',
 'm',
 'm',
 'm',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'NA',
 'f',
 'f',
 'NA',
 'f',
 'NA',
 'm',
 'm',
 'f',
 'NA',
 'f',


In [41]:
index_manual

[761,
 780,
 923,
 1111,
 1251,
 1257,
 2103,
 2202,
 2297,
 2346,
 2666,
 3163,
 3743,
 4213,
 4360,
 4361,
 4433,
 4475,
 4480,
 4688,
 5461,
 5604,
 5713,
 5781,
 5863,
 5984,
 6179,
 6659,
 8025,
 8751,
 11173,
 11813,
 12275,
 12445,
 12717,
 12822,
 12823,
 12843,
 13118,
 14066,
 14078,
 14232,
 14419,
 14423,
 14424,
 14426,
 14438,
 14792,
 15863,
 15954,
 16718,
 16728,
 17086,
 17726,
 17850,
 17986,
 17987,
 17992,
 17996,
 18017,
 18157,
 18402,
 18403,
 18404,
 18695,
 18723,
 18775,
 18916,
 18986,
 20263,
 21696,
 21837,
 21892,
 21990,
 22198,
 22325,
 22767,
 23227,
 23230,
 23391,
 23552,
 23976,
 24631,
 24745,
 25336,
 25350,
 25383,
 25402,
 25937,
 27000,
 27083,
 27346,
 27588,
 27781,
 28154,
 29012,
 29109,
 29110,
 29941,
 30480,
 30588,
 30616,
 31008,
 31294,
 31632,
 31695,
 31712,
 32129,
 33151,
 33284,
 33340,
 33361,
 34514,
 34972,
 35040,
 35252,
 35581,
 35582,
 35583,
 35649,
 36447,
 37969,
 38624,
 38810,
 39228,
 39412,
 39420,
 39624,
 39857,
 

In [65]:
andy_gender = pd.DataFrame(index=index_manual)
andy_gender['manual_gender'] = manual_gender
andy_gender.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 607 entries, 761 to 161638
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   manual_gender  607 non-null    object
dtypes: object(1)
memory usage: 9.5+ KB


In [71]:
andy_gender['manual_gender'][andy_gender['manual_gender'] == 'NA'] = np.nan

In [72]:
andy_gender.value_counts()

manual_gender
female           194
male             105
dtype: int64

In [106]:
# andy_gender.to_csv('andy_gender_18_19')

In [73]:
detailed_officers_18_19_test = pd.concat([detailed_officers_18_19, andy_gender], axis=1)

In [74]:
detailed_officers_18_19_test

Unnamed: 0,company_no,name,date_appointed,date_resigned,test,first_name,middle_name_1,gender_first_name,gender_middle_name,after_2019,manual_gender
0,SC016876,"VALENTINE, Dianne Davidson",2015-05-11,,Dianne Davidson,Dianne,Davidson,female,male,0.0,
1,SC016876,"CLUBB, Malcolm Richard",2018-05-09,,Malcolm Richard,Malcolm,Richard,male,male,0.0,
2,SC016876,"CUMMING, Alexander Coulson",,,Alexander Coulson,Alexander,Coulson,male,unknown,0.0,
3,SC016876,"CUMMING, Charles Nigel Coulson",,,Charles Nigel Coulson,Charles,Nigel,male,male,0.0,
4,SC016876,"GALT, Philip Callaghan",2014-08-19,,Philip Callaghan,Philip,Callaghan,male,unknown,0.0,
...,...,...,...,...,...,...,...,...,...,...,...
161651,01032611,"HIGGINSON, Anthony Mitchell",2002-09-19,2020-12-31,Anthony Mitchell,Anthony,Mitchell,male,male,0.0,
161652,01032611,"HIGGINSON, Julie Elizabeth",2007-08-12,2020-12-31,Julie Elizabeth,Julie,Elizabeth,female,female,0.0,
161653,01032611,"LOTGERINK, Ronald Emmanuel Maria",1998-06-05,2018-07-26,Ronald Emmanuel Maria,Ronald,Emmanuel,male,male,0.0,
161654,01032611,"STEVERS-VAN DER LAAN, Johanna Clazina Maria",,2016-06-30,Johanna Clazina Maria,Johanna,Clazina,female,female,0.0,


<font color='red'>

#### I dummified the gender columns and then for those that were listed as 'mostly female' and had a female middle name, I changed these to female. Similarly, if the first name was 'mostly male' and the middle name was male, I converted them to a male. 
    
#### I then combined all column data on gender to generate a count of male officers and female officers per country. This was then used to calculate the proportion of officers who were female. 
    
</font>

In [75]:
detailed_officers_18_19_test = pd.get_dummies(detailed_officers_18_19_test, columns = ['gender_first_name', 'gender_middle_name', 'manual_gender'])

In [76]:
detailed_officers_18_19_test.head()

Unnamed: 0,company_no,name,date_appointed,date_resigned,test,first_name,middle_name_1,after_2019,gender_first_name_andy,gender_first_name_female,...,gender_first_name_mostly_female,gender_first_name_mostly_male,gender_middle_name_andy,gender_middle_name_female,gender_middle_name_male,gender_middle_name_mostly_female,gender_middle_name_mostly_male,gender_middle_name_unknown,manual_gender_female,manual_gender_male
0,SC016876,"VALENTINE, Dianne Davidson",2015-05-11,,Dianne Davidson,Dianne,Davidson,0.0,0,1,...,0,0,0,0,1,0,0,0,0,0
1,SC016876,"CLUBB, Malcolm Richard",2018-05-09,,Malcolm Richard,Malcolm,Richard,0.0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,SC016876,"CUMMING, Alexander Coulson",,,Alexander Coulson,Alexander,Coulson,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,SC016876,"CUMMING, Charles Nigel Coulson",,,Charles Nigel Coulson,Charles,Nigel,0.0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,SC016876,"GALT, Philip Callaghan",2014-08-19,,Philip Callaghan,Philip,Callaghan,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [86]:
detailed_officers_18_19_test['middle_check_manual_female'] = 0
detailed_officers_18_19_test['middle_check_manual_male'] = 0

In [87]:
for i, line in enumerate(detailed_officers_18_19_test['gender_first_name_mostly_female']):
    if line == 1 and detailed_officers_18_19_test['gender_middle_name_female'][i] == 1:
        detailed_officers_18_19_test['middle_check_manual_female'][i] = 1
    else:
        detailed_officers_18_19_test['middle_check_manual_female'][i] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_officers_18_19_test['middle_check_manual_female'][i] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_officers_18_19_test['middle_check_manual_female'][i] = 1


In [88]:
for i, line in enumerate(detailed_officers_18_19_test['gender_first_name_mostly_male']):
    if line == 1 and detailed_officers_18_19_test['gender_middle_name_male'][i] == 1:
        detailed_officers_18_19_test['middle_check_manual_male'][i] = 1
    else:
        detailed_officers_18_19_test['middle_check_manual_male'][i] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_officers_18_19_test['middle_check_manual_male'][i] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_officers_18_19_test['middle_check_manual_male'][i] = 1


In [89]:
detailed_officers_18_19_test['middle_check_manual_female'].value_counts()

0    160844
1       812
Name: middle_check_manual_female, dtype: int64

In [90]:
detailed_officers_18_19_test['middle_check_manual_male'].value_counts()

0    160598
1      1058
Name: middle_check_manual_male, dtype: int64

In [91]:
detailed_officers_18_19_test.head()

Unnamed: 0,company_no,name,date_appointed,date_resigned,test,first_name,middle_name_1,after_2019,gender_first_name_andy,gender_first_name_female,...,gender_middle_name_andy,gender_middle_name_female,gender_middle_name_male,gender_middle_name_mostly_female,gender_middle_name_mostly_male,gender_middle_name_unknown,manual_gender_female,manual_gender_male,middle_check_manual_female,middle_check_manual_male
0,SC016876,"VALENTINE, Dianne Davidson",2015-05-11,,Dianne Davidson,Dianne,Davidson,0.0,0,1,...,0,0,1,0,0,0,0,0,0,0
1,SC016876,"CLUBB, Malcolm Richard",2018-05-09,,Malcolm Richard,Malcolm,Richard,0.0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,SC016876,"CUMMING, Alexander Coulson",,,Alexander Coulson,Alexander,Coulson,0.0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,SC016876,"CUMMING, Charles Nigel Coulson",,,Charles Nigel Coulson,Charles,Nigel,0.0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,SC016876,"GALT, Philip Callaghan",2014-08-19,,Philip Callaghan,Philip,Callaghan,0.0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [93]:
detailed_officers_18_19_test['female_final'] = detailed_officers_18_19_test.gender_first_name_female + detailed_officers_18_19_test.manual_gender_female + detailed_officers_18_19_test.middle_check_manual_female

In [94]:
detailed_officers_18_19_test['male_final'] = detailed_officers_18_19_test.gender_first_name_male + detailed_officers_18_19_test.manual_gender_male + detailed_officers_18_19_test.middle_check_manual_male

In [98]:
detailed_officers_18_19_test.female_final.value_counts()

0    134151
1     27505
Name: female_final, dtype: int64

In [107]:
detailed_officers_18_19_test[detailed_officers_18_19_test.female_final == 0 & detailed_officers_18_19_test.male_final == 0]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [108]:
detailed_officers_18_19_test['filter'] = 0

In [116]:
for i, line in enumerate(detailed_officers_18_19_test['female_final']):
    if line == 1 or detailed_officers_18_19_test['male_final'][i] == 1:
        detailed_officers_18_19_test['filter'][i] = 1
    else:
        detailed_officers_18_19_test['filter'][i] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_officers_18_19_test['filter'][i] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_officers_18_19_test['filter'][i] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [117]:
detailed_officers_18_19_test['filter'].value_counts()

1.0    153819
Name: filter, dtype: int64

In [118]:
detailed_officers_18_19_test.dropna(subset=['filter'], inplace=True)

In [154]:
detailed_officers_18_19_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 153819 entries, 0 to 161655
Data columns (total 26 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   company_no                        153819 non-null  object 
 1   name                              153819 non-null  object 
 2   date_appointed                    153819 non-null  object 
 3   date_resigned                     153819 non-null  object 
 4   test                              153819 non-null  object 
 5   first_name                        153819 non-null  object 
 6   middle_name_1                     114432 non-null  object 
 7   after_2019                        153819 non-null  float64
 8   gender_first_name_andy            153819 non-null  uint8  
 9   gender_first_name_female          153819 non-null  uint8  
 10  gender_first_name_male            153819 non-null  uint8  
 11  gender_first_name_mostly_female   153819 non-null  u

In [224]:
grouped_by_company = detailed_officers_18_19_test.groupby('company_no')
final_gendercount_18_19 = grouped_by_company.sum()

In [225]:
final_gendercount_18_19.columns

Index(['after_2019', 'gender_first_name_andy', 'gender_first_name_female',
       'gender_first_name_male', 'gender_first_name_mostly_female',
       'gender_first_name_mostly_male', 'gender_middle_name_andy',
       'gender_middle_name_female', 'gender_middle_name_male',
       'gender_middle_name_mostly_female', 'gender_middle_name_mostly_male',
       'gender_middle_name_unknown', 'manual_gender_female',
       'manual_gender_male', 'middle_check_manual_female',
       'middle_check_manual_male', 'female_final', 'male_final', 'filter'],
      dtype='object')

In [226]:
final_gendercount_18_19.drop(columns = ['after_2019', 'gender_first_name_andy',
       'gender_first_name_female', 'gender_first_name_male',
       'gender_first_name_mostly_female', 'gender_first_name_mostly_male',
       'gender_middle_name_andy', 'gender_middle_name_female',
       'gender_middle_name_male', 'gender_middle_name_mostly_female',
       'gender_middle_name_mostly_male', 'gender_middle_name_unknown',
       'manual_gender_female', 'manual_gender_male',
       'middle_check_manual_female', 'middle_check_manual_male', 'filter'], inplace=True)

In [227]:
final_gendercount_18_19['percent_female_officer'] = final_gendercount_18_19.female_final/(final_gendercount_18_19.female_final+final_gendercount_18_19.male_final)

In [228]:
# final_gendercount_18_19.to_csv('final_gendercount_18_19', index='company_no')

In [229]:
final_gendercount_18_19 = pd.read_csv('final_gendercount_18_19')

In [230]:
final_gendercount_18_19.set_index(keys = 'company_no', inplace=True)

In [233]:
final_gendercount_18_19.describe()

Unnamed: 0,female_final,male_final,percent_female_officer
count,7810.0,7810.0,7810.0
mean,3.521767,16.173367,0.187136
std,3.484793,8.873243,0.163375
min,0.0,0.0,0.0
25%,1.0,9.0,0.066667
50%,3.0,16.0,0.151515
75%,5.0,24.0,0.266667
max,24.0,35.0,1.0


In [246]:
final_gendercount_18_19[final_gendercount_18_19['percent_female_officer'] >= 0.5]

Unnamed: 0_level_0,female_final,male_final,percent_female_officer
company_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
00006400,23,12,0.657143
00231824,18,13,0.580645
00232081,17,14,0.548387
00353341,17,13,0.566667
00430051,5,3,0.625000
...,...,...,...
SC402097,3,2,0.600000
SC412450,6,3,0.666667
SC421599,1,1,0.500000
SC423096,1,1,0.500000


In [None]:
# interesting that only 500/ 7800 companies have 50% or more female officers whereas 7300 have >50% male officers. 