# Analysis of Gender Bias in Tech Job Postings

### CSE 184: Fall 2019 - Nikhil Dodd, Jerico Factor, Kyle O’Brien, Bryan Jimenez

## Motivation

It's common knowledge that a majority of software engineers identify as male. Despite gains in gender diversity across other STEM professions, software engineering remains largely stagnant. There is a myriad of factors influencing this reality.

We're interested in exploring the role the wording of job postings play in potentially discouraging non-male candidates from applying. This project is inspired by Textio. ----------------FIX

## Questions

1. Out of X randomly collected software engineering jobs, what percentage have gender-specific pronouns?
2. What’s the percentage breakdown by state/region?
3. What are the most common keywords that encourage/discourage applicants?
4. Does the company size or location influence a company's inclusion initiatives? ----------------FIX

## Data Source

As our data source, we'll be using hte LinkedIn job search tool. ----------------FIX

### Import Dependencies 

In [13]:
import numpy as np
import pandas as pd
import nltk
from matplotlib import pyplot as plt
import csv
import requests
import collections
from bs4 import BeautifulSoup
import geopandas as gpd
import pandas_bokeh
from shapely.geometry import Point, Polygon
from mpl_toolkits.axes_grid1 import make_axes_locatable
import matplotlib.pyplot as plt
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from math import pi
from bokeh.plotting import figure, output_file, show
import plotly
import plotly.figure_factory as ff
import plotly.graph_objects as go
import numpy as np
import pandas as pd
import matplotlib as plt
import csv
import requests
import collections
from bs4 import BeautifulSoup
import re

%matplotlib inline
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/jrko/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Corpus

----- Reasoning and sources

In [16]:
gender_pronouns = [
    "he",
    "him",
    "she",
    "her",
    "his",
    "hers"
]

masculine_themed_wording = [
    "competitive",
    "dominate",
    "leader",
    "rock star",
    "rockstar",
    "guru",
    "ninja",
    "hacker",
    "superhero",
    "prove themselves",
    "analyze",
    "determine",
    "crush it",
    "world class",
    "superior",
    "ambitious",
    "aggressive",
    "leader", 
]

states = [
    'AL',
    'AK',
    'AZ',
    'AR',
    'CA',
    'CO',
    'CT',
    'DE',
    'DC',
    'FL',
    'GA',
    'HI',
    'ID',
    'IL',
    'IN',
    'IA',
    'KS',
    'KY',
    'LA',
    'ME',
    'MD',
    'MA',
    'MI',
    'MN',
    'MS',
    'MO',
    'MT',
    'NE',
    'NV',
    'NH',
    'NJ',
    'NM',
    'NY',
    'NC',
    'ND',
    'OH',
    'OK',
    'OR',
    'PA',
    'RI',
    'SC',
    'SD',
    'TN',
    'TX',
    'UT',
    'VT',
    'VA',
    'WA',
    'WV',
    'WI',
    'WY' 
]

job_positions = [
    "developer",
    "engineer",
    "software engineer",
    "software developer",
    "product manager",
    "manager",
    "product developer",
    "tech lead",
    "lead",
    "analyst",
    "head",
    "designer"
]

common_job_roles = [
    "engineer", 
    "designer", 
    "manager",
    "analyst",
    "scientist",
    "director"
    "chief officer",
    "sales",
    "marketing",
]



## Load Dataset
Note: Breifly explain data collection process

In [14]:
all_jobs = pd.read_csv("../data/derived_job_data.csv")
all_jobs.head(15)

Unnamed: 0,job_title,company,location,description
0,MicroStrategy Developers HERE IN McLean VA (Fa...,Advansys Inc,VA,HelloPlease find the below requirement and do ...
1,Security Architect - Palo Alto Firewalls,Alagen,AZ,Alagen has combined over 20 years of recruitin...
2,Dynamics AX Senior System/Security Admin/Globa...,ConsultantFriends.com,IL,MS Dynamics System and Security Admin Global I...
3,IAM Consultant,Collabera,IL,Job ID5457_IAM_ILJob TitleIAM ConsultantJob Lo...
4,.Net Developer,Network Objects Inc.,CT,Net DeveloperHartford CTFULL TIMERequirements5...
5,DevOps Windows/ Linux Engineer,Associated Press,NJ,Position Application EngineerJob SummaryProvid...
6,Solution Architect,Caprus IT Inc.,MI,Hi We do have an urgent requirement Job Title...
7,Immediate openings for Physical Design Enginee...,Calsoft Labs,CA,Job Overview Floor planning PR timing closure...
8,Service Delivery Manager,Matlen Silver,NJ,MUST HAVE ENCRYPTION EXPERIENCE Looking for a ...
9,Management Consultant,Windsor Partners,CA,Duties and ResponsibilitiesConsultants play a ...


### Collect Gendered Jobs

In [15]:
gendered_jobs = pd.DataFrame()

for gendered_word in (gender_pronouns + masculine_themed_wording):
    results = all_jobs.where(all_jobs["description"].str.find(gendered_word) != -1)
    gendered_jobs = pd.concat([gendered_jobs, results])
    
gendered_jobs = gendered_jobs.drop_duplicates(keep=False)
gendered_jobs.shape[0]

NameError: name 'gender_pronouns' is not defined

## Question 1: What percentage has gender pronouns in the job posting?

Here we will find which job postings contain the actual pronouns in the job listings. We check if they are more biased towards masculine gender pronouns. 

In [17]:
def phraseCount(phrase, description): 
    #this counts the amount of times phrase shows up in description
    s = '\\b(' + phrase + ')\\b'
    regex = re.compile(s,re.IGNORECASE)
    return len(re.findall(regex, description))

In [26]:
def findNumberOfJobsWithPronouns(data_set): 
    mpronouns = ['he', 'him' ,'his']
    fpronouns = ['she', 'her', 'hers']
    pronounCounter = 0
    mascPronounCounter = 0
    femPronounCounter = 0
    y = gendered_jobs["job_title"].where(data_set["description"].str.find(role) > -1)
#     for index, row in data_set.iterrows():
#         mcount = 0
#         fcount = 0
#         for word in mpronouns: 
#             mcount += phraseCount(word, row.get('description'))
#         for word in fpronouns: 
#             fcount += phraseCount(word, row.get('description'))
#         if mcount + fcount > 0:
#             pronounCounter +=1 
#         if mcount > fcount: 
#             mascPronounCounter += 1
#         if mcount < fcount: 
#             femPronounCounter += 1

    print('Number of jobs that have gendered pronouns:', pronounCounter, " percentage: ", pronounCounter/data_set.shape[0] * 100)
    print('More masculine pronouns:', mascPronounCounter," percentage: ", mascPronounCounter/pronounCounter * 100)
    print('More feminine pronouns:', femPronounCounter," percentage: ", femPronounCounter/pronounCounter * 100)

In [25]:
findNumberOfJobsWithPronouns(all_jobs)

Number of jobs that have gendered pronouns: 337  percentage:  1.6792904125971695
More masculine pronouns: 123  percentage:  36.49851632047478
More feminine pronouns: 80  percentage:  23.738872403560833


We find that about 1.7 percent of the job listings in our data set have those gendered phrases. But of those listings, 36% contain more masculine pronouns while only 24$ contain more feminine pronouns.

## Question 3: What are the most common roles in tech that have gendered job postings?

Note: Explination and Process

In [115]:
common_roles_occurences = [
    "developer",
    "frontend",
    "backend",
    "UX",
    "engineer",
    "product",
    "manager",
    "lead",
    "analyst",
    "scientist",
    "designer",
    "sales",
    "marketing",
    "accountant",
    "devops"
]

jobs_captured = 0

for role in common_roles_occurences:
    x = all_jobs["job_title"].where(all_jobs["description"].str.find(role) > -1)
    y = gendered_jobs["job_title"].where(gendered_jobs["description"].str.find(role) > -1)
    count = 0
    desc_arr2 = [] # not sure if this array is needed yet, search for the positions

    for item, frame in y.iteritems():
        if pd.notnull(frame):
            count+=1
            desc_arr2.append(item)
            
    jobs_captured += count
    print(f"Uses the word(s) {role}: ", count)
    print(f"percentage {(count / all_jobs.shape[0]) * 100}\n\n")
    
print(jobs_captured)
# NEEEEDS VIZ

Uses the word(s) developer:  260
percentage 1.295594977077935


Uses the word(s) frontend:  65
percentage 0.32389874426948373


Uses the word(s) backend:  92
percentage 0.45844129958142316


Uses the word(s) UX:  82
percentage 0.4086107235399641


Uses the word(s) engineer:  339
percentage 1.6892565278054616


Uses the word(s) product:  559
percentage 2.7855292007175603


Uses the word(s) manager:  97
percentage 0.48335658760215267


Uses the word(s) lead:  290
percentage 1.4450867052023122


Uses the word(s) analyst:  44
percentage 0.21925453458241975


Uses the word(s) scientist:  8
percentage 0.039864460833167234


Uses the word(s) designer:  44
percentage 0.21925453458241975


Uses the word(s) sales:  59
percentage 0.2940003986446083


Uses the word(s) marketing:  39
percentage 0.19433924656169027


Uses the word(s) accountant:  0
percentage 0.0


Uses the word(s) devops:  3
percentage 0.014949172812437711


1981


## Question 4: Out of the gendered words/phrases, which are the most common? 