# {Breaking the Glass Firewall: Why Women Leave Tech Careers and Why Those Who Stay Don’t Advance}📝

![Banner](./assets/banner.jpeg)

## Topic
*What problem are you (or your stakeholder) trying to address?*
📝 <!-- Answer Below -->

This project explores why women in technology are more likely to experience limited career progression and leave the industry at higher rates than their male counterparts. Understanding these differences in career trajectories between men and women is essential for promoting fairness in the workplace, reducing costs associated with turnover, and improving overall organizational success by retaining diverse talent.

## Project Question
*What specific question are you seeking to answer with this project?*
*This is not the same as the questions you ask to limit the scope of the project.*
📝 <!-- Answer Below -->

How do promotion and retention rates for women compare to those for men at similar career stages in the tech industry?

## What would an answer look like?
*What is your hypothesized answer to your question?*
📝 <!-- Answer Below -->

**Hypothesis:**  
Women in technology experience lower promotion rates and leave the industry at higher rates compared to their male counterparts, even when they have similar qualifications and experience. This disparity is driven by factors that disproportionately affect women, including a higher likelihood of layoffs and gender-based discrimination. As a result, women are less likely to reach senior positions or remain in the tech industry long-term.


## Data Sources
*What 3 data sources have you identified for this project?*
*How are you going to relate these datasets?*
📝 <!-- Answer Below -->

**Data Sources**  
*HackerRank 2018 Developer Survey (Kaggle):* Published in 2018, based on 2017 survey responses.  
*Pew Research Center 2017 STEM Survey (zip file):* Based on 2017 responses.  
*NSF's National Center for Science and Engineering Statistics (Web-Scraped Tables):* Spans several years, ending in 2019, with specific data points from 2017 for comparison.

**Relating the Data**  
The datasets can be linked based on the shared year (2017) and gender as a common variable. Gender will serve as a primary key or part of a composite key for linking.

## Approach and Analysis
*What is your approach to answering your project question?*
*How will you use the identified data to answer your project question?*
📝 <!-- Start Discussing the project here; you can add as many code cells as you need -->

After joining the data from the various sources, I will generate specific visualizations to confirm or reject my hypothesis.

**Planned Visualizations to Support the Hypothesis:**
1. *Line Chart: Gender and Age Distribution in Technology*  
A line chart will display the number of employees in the technology sector, separated by gender and age groups. The x-axis will represent the age groups (e.g., 20-25, 26-30, etc.), and the y-axis will show the number of employees. This chart will highlight the drop-off point where women begin to leave the industry earlier than men, helping visualize retention issues.

2. *100% Stacked Column Chart: Men vs Women in Different Tech Roles*  
A 100% stacked column chart will show the proportional representation of men and women across different tech roles (e.g., Junior Developers, Senior Developers, Managers, Executives). Each column will represent a different role, and the stacked columns will show the gender distribution within that role as a percentage. This will provide a clear visual of how underrepresented women are in higher-level positions.

3. *Side-by-Side Column Chart: Workplace Concerns for Men vs Women*  
A side-by-side column chart will compare the key workplace concerns between men and women, such as issues with career progression, work-life balance, pay disparity, and workplace discrimination. Each concern will have two columns—one representing men and one representing women. This will make it easy to see where concerns overlap and where significant differences exist between the genders.

#### Package Imports

In [3]:
#import packages
import os
from dotenv import load_dotenv
load_dotenv(override=True)

import opendatasets as od
import pandas as pd
import requests
import re # for string manipulation
import os # to create subfolder for data organization

from zipfile import ZipFile
from urllib.request import urlretrieve
from bs4 import BeautifulSoup


#### Import Dataset 1:

In [5]:
# import dataset from Kaggle using URL 
dataset_url = "https://www.kaggle.com/datasets/hackerrank/developer-survey-2018/data"
od.download(dataset_url, data_dir="./data")

Skipping, found downloaded files in "./data\developer-survey-2018" (use force=True to force download)


#### Convert dataset to a pandas dataframe and inspect data

In [6]:
# Define the data directory
data_dir = './data/developer-survey-2018'

# Read each CSV file into its own DataFrame with meaningful names
country_code_df = pd.read_csv(os.path.join(data_dir, 'Country-Code-Mapping.csv'))
dev_survey_codebook_df = pd.read_csv(os.path.join(data_dir, 'HackerRank-Developer-Survey-2018-Codebook.csv'))
dev_survey_numeric_mapping_df = pd.read_csv(os.path.join(data_dir, 'HackerRank-Developer-Survey-2018-Numeric-Mapping.csv'))
dev_survey_numeric_df = pd.read_csv(os.path.join(data_dir, 'HackerRank-Developer-Survey-2018-Numeric.csv'))
dev_survey_values_df = pd.read_csv(os.path.join(data_dir, 'HackerRank-Developer-Survey-2018-Values.csv'))

# Display the first 5 records from each DataFrame
display(country_code_df.head(5))
display(dev_survey_codebook_df.head(5))
display(dev_survey_numeric_mapping_df.head(5))
display(dev_survey_numeric_df.head(5))
display(dev_survey_values_df.head(5))

  dev_survey_numeric_df = pd.read_csv(os.path.join(data_dir, 'HackerRank-Developer-Survey-2018-Numeric.csv'))
  dev_survey_values_df = pd.read_csv(os.path.join(data_dir, 'HackerRank-Developer-Survey-2018-Values.csv'))


Unnamed: 0,Value,Label
0,4,Afghanistan
1,6,Albania
2,7,Algeria
3,8,American Samoa
4,9,Andorra


Unnamed: 0,Data Field,Survey Question,Notes
0,RespondentID,,Respondent ID
1,StartDate,,When did they start (date and time)
2,EndDate,,When did they end (date and time)
3,CountryNumeric2,,see Country-Code-Mapping.csv
4,q1AgeBeginCoding,At what age did you start coding,


Unnamed: 0,Data Field,Value,Label
0,q1AgeBeginCoding,1,5 - 10 years old
1,q1AgeBeginCoding,2,11 - 15 years old
2,q1AgeBeginCoding,3,16 - 20 years old
3,q1AgeBeginCoding,4,21 - 25 years old
4,q1AgeBeginCoding,5,26 - 30 years old


Unnamed: 0,RespondentID,StartDate,EndDate,CountryNumeric2,q1AgeBeginCoding,q2Age,q3Gender,q4Education,q0004_other,q5DegreeFocus,...,q30LearnCodeOther,q0030_other,q31Level3,q32RecommendHackerRank,q0032_other,q33HackerRankChallforJob,q34PositiveExp,q34IdealLengHackerRankTest,q0035_other,q36Level4
0,6464453728,10/19/17 11:51,10/20/17 12:05,148.0,3,3,2,3,,1,...,1,datacamp,1,1,,2,,#NULL!,,2
1,6478031510,10/26/17 6:18,10/26/17 7:49,164.0,3,4,1,7,,2,...,0,,1,1,,2,,#NULL!,,2
2,6464392829,10/19/17 10:44,10/19/17 10:56,98.0,2,2,2,3,,2,...,0,,1,1,,2,,#NULL!,,2
3,6481629912,10/27/17 1:51,10/27/17 2:05,43.0,2,2,1,5,,1,...,0,,1,1,,2,,#NULL!,,3
4,6488385057,10/31/17 11:46,10/31/17 11:59,,3,4,2,5,,0,...,1,Blogs/articles by industry leaders,1,1,,2,,#NULL!,,3


Unnamed: 0,RespondentID,StartDate,EndDate,CountryNumeric2,q1AgeBeginCoding,q2Age,q3Gender,q4Education,q0004_other,q5DegreeFocus,...,q30LearnCodeOther,q0030_other,q31Level3,q32RecommendHackerRank,q0032_other,q33HackerRankChallforJob,q34PositiveExp,q34IdealLengHackerRankTest,q0035_other,q36Level4
0,6464453728,10/19/17 11:51,10/20/17 12:05,South Korea,16 - 20 years old,18 - 24 years old,Female,Some college,,Computer Science,...,Other (please specify),datacamp,num%2 == 0,Yes,,No,,#NULL!,,Queue
1,6478031510,10/26/17 6:18,10/26/17 7:49,Ukraine,16 - 20 years old,25 - 34 years old,Male,"Post graduate degree (Masters, PhD)",,"Other STEM (science, technology, engineering, ...",...,,,num%2 == 0,Yes,,No,,#NULL!,,Queue
2,6464392829,10/19/17 10:44,10/19/17 10:56,Malaysia,11 - 15 years old,12 - 18 years old,Female,Some college,,"Other STEM (science, technology, engineering, ...",...,,,num%2 == 0,Yes,,No,,#NULL!,,Queue
3,6481629912,10/27/17 1:51,10/27/17 2:05,Curaçao,11 - 15 years old,12 - 18 years old,Male,College graduate,,Computer Science,...,,,num%2 == 0,Yes,,No,,#NULL!,,Hashmap
4,6488385057,10/31/17 11:46,10/31/17 11:59,,16 - 20 years old,25 - 34 years old,Female,College graduate,,,...,Other (please specify),Blogs/articles by industry leaders,num%2 == 0,Yes,,No,,#NULL!,,Hashmap


In [7]:
# Find United States in the country_code_df to use for filtering purposes

# Rename columns
country_code_df.rename(columns={'Value': 'country_code', 'Label': 'country'}, inplace=True)

# Filter for the United States
us_country_code_df = country_code_df[country_code_df['country'] == 'United States']

# Display the resulting DataFrame
display(us_country_code_df)

Unnamed: 0,country_code,country
0,167,United States


In [14]:
# Filter the DataFrame for United States questionnaire responses
us_dev_survey_numeric_df = dev_survey_numeric_df[dev_survey_numeric_df['CountryNumeric2'] == 167]

# Display the number of records
num_records_numeric = us_dev_survey_numeric_df.shape[0]
display(f"Number of records: {num_records_numeric}")

# Display the resulting DataFrame
display(us_dev_survey_numeric_df.head(5))

'Number of records: 4937'

Unnamed: 0,RespondentID,StartDate,EndDate,CountryNumeric2,q1AgeBeginCoding,q2Age,q3Gender,q4Education,q0004_other,q5DegreeFocus,...,q30LearnCodeOther,q0030_other,q31Level3,q32RecommendHackerRank,q0032_other,q33HackerRankChallforJob,q34PositiveExp,q34IdealLengHackerRankTest,q0035_other,q36Level4
5,6463843138,10/19/17 3:02,10/19/17 3:18,167.0,8,5,1,5,,1,...,1,SoloLearn,1,1,,2,,#NULL!,,2
6,6458326054,10/17/17 3:18,10/17/17 3:33,167.0,3,6,1,7,,1,...,0,,1,1,,1,1.0,4,,2
7,6467198274,10/21/17 8:55,10/21/17 9:06,167.0,3,3,1,5,,2,...,0,,1,1,,2,,#NULL!,,1
24,6460870080,10/18/17 1:18,10/18/17 1:25,167.0,3,3,1,5,,2,...,0,,1,1,,2,,#NULL!,,2
40,6479693460,10/26/17 7:10,10/26/17 7:20,167.0,2,6,1,5,,1,...,1,IRC,1,1,,2,,#NULL!,,2


In [20]:
# Reduce dataframe columns to only those relevant to supporting or disproving my hypothesis

# List of relevant columns to keep
columns_to_keep = [
    'RespondentID', 'q2Age', 'q3Gender', 'q10Industry',
    'q8JobLevel', 'q9CurrentRole',
    'q12JobCritPrefTechStack', 'q12JobCritCompMission',
    'q12JobCritCompCulture', 'q12JobCritWorkLifeBal',
    'q12JobCritCompensation', 'q12JobCritProximity',
    'q12JobCritPerks', 'q12JobCritSmartPeopleTeam',
    'q12JobCritImpactwithProduct', 'q12JobCritInterestProblems',
    'q12JobCritFundingandValuation', 'q12JobCritStability',
    'q12JobCritProfGrowth', 'q16HiringManager',
    'q17HirChaNoDiversCandidates', 'q20CandYearExp',
    'q20CandCompScienceDegree', 'q20CandCodingBootcamp',
    'q20CandSkillCert', 'q20CandHackerRankActivity',
    'q20CandOtherCodingCommAct', 'q20CandGithubPersProj',
    'q20CandOpenSourceContrib', 'q20CandHackathonPart',
    'q20CandPrevWorkExp', 'q20CandPrestigeDegree',
    'q20CandLinkInSkills', 'q20CandGithubPersProj2'
]

# Create a new DataFrame with only the selected columns
filtered_us_dev_survey_numeric_df = us_dev_survey_numeric_df[columns_to_keep]

# Display information about the DataFrame 
filtered_us_dev_survey_numeric_df.info()

# Display the resulting DataFrame
display(filtered_us_dev_survey_numeric_df.head(5))

<class 'pandas.core.frame.DataFrame'>
Index: 4937 entries, 5 to 9892
Data columns (total 34 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   RespondentID                   4937 non-null   int64 
 1   q2Age                          4937 non-null   object
 2   q3Gender                       4937 non-null   object
 3   q10Industry                    4937 non-null   object
 4   q8JobLevel                     4937 non-null   int64 
 5   q9CurrentRole                  4937 non-null   object
 6   q12JobCritPrefTechStack        4937 non-null   int64 
 7   q12JobCritCompMission          4937 non-null   int64 
 8   q12JobCritCompCulture          4937 non-null   int64 
 9   q12JobCritWorkLifeBal          4937 non-null   int64 
 10  q12JobCritCompensation         4937 non-null   int64 
 11  q12JobCritProximity            4937 non-null   int64 
 12  q12JobCritPerks                4937 non-null   int64 
 13  q12JobCr

Unnamed: 0,RespondentID,q2Age,q3Gender,q10Industry,q8JobLevel,q9CurrentRole,q12JobCritPrefTechStack,q12JobCritCompMission,q12JobCritCompCulture,q12JobCritWorkLifeBal,...,q20CandSkillCert,q20CandHackerRankActivity,q20CandOtherCodingCommAct,q20CandGithubPersProj,q20CandOpenSourceContrib,q20CandHackathonPart,q20CandPrevWorkExp,q20CandPrestigeDegree,q20CandLinkInSkills,q20CandGithubPersProj2
5,6463843138,5,1,14,0,0,0,0,1,1,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!
6,6458326054,6,1,5,9,19,0,1,0,0,...,0,0,0,0,0,0,1,1,0,0
7,6467198274,3,1,4,2,0,0,0,1,0,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!
24,6460870080,3,1,0,1,18,0,0,0,1,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!
40,6479693460,6,1,14,10,11,1,0,1,0,...,0,0,0,0,0,0,1,0,0,1


In [17]:
# Filter the DataFrame for CountryNumeric2 = "United States"
us_dev_survey_values_df = dev_survey_values_df[dev_survey_values_df['CountryNumeric2'] == "United States"]

# Display the number of records
num_records_values = us_dev_survey_values_df.shape[0]
display(f"Number of records: {num_records_values}")

# Display the resulting DataFrame
display(us_dev_survey_values_df.head(5))

'Number of records: 4937'

Unnamed: 0,RespondentID,StartDate,EndDate,CountryNumeric2,q1AgeBeginCoding,q2Age,q3Gender,q4Education,q0004_other,q5DegreeFocus,...,q30LearnCodeOther,q0030_other,q31Level3,q32RecommendHackerRank,q0032_other,q33HackerRankChallforJob,q34PositiveExp,q34IdealLengHackerRankTest,q0035_other,q36Level4
5,6463843138,10/19/17 3:02,10/19/17 3:18,United States,41 - 50 years old,35 - 44 years old,Male,College graduate,,Computer Science,...,Other (please specify),SoloLearn,num%2 == 0,Yes,,No,,#NULL!,,Queue
6,6458326054,10/17/17 3:18,10/17/17 3:33,United States,16 - 20 years old,45 - 54 years old,Male,"Post graduate degree (Masters, PhD)",,Computer Science,...,,,num%2 == 0,Yes,,Yes,1.0,2 to 4 hours,,Queue
7,6467198274,10/21/17 8:55,10/21/17 9:06,United States,16 - 20 years old,18 - 24 years old,Male,College graduate,,"Other STEM (science, technology, engineering, ...",...,,,num%2 == 0,Yes,,No,,#NULL!,,Set
24,6460870080,10/18/17 1:18,10/18/17 1:25,United States,16 - 20 years old,18 - 24 years old,Male,College graduate,,"Other STEM (science, technology, engineering, ...",...,,,num%2 == 0,Yes,,No,,#NULL!,,Queue
40,6479693460,10/26/17 7:10,10/26/17 7:20,United States,11 - 15 years old,45 - 54 years old,Male,College graduate,,Computer Science,...,Other (please specify),IRC,num%2 == 0,Yes,,No,,#NULL!,,Queue


In [21]:
# Reduce dataframe columns to only those relevant to supporting or disproving my hypothesis

# List of relevant columns to keep
columns_to_keep = [
    'RespondentID', 'q2Age', 'q3Gender', 'q10Industry',
    'q8JobLevel', 'q9CurrentRole',
    'q12JobCritPrefTechStack', 'q12JobCritCompMission',
    'q12JobCritCompCulture', 'q12JobCritWorkLifeBal',
    'q12JobCritCompensation', 'q12JobCritProximity',
    'q12JobCritPerks', 'q12JobCritSmartPeopleTeam',
    'q12JobCritImpactwithProduct', 'q12JobCritInterestProblems',
    'q12JobCritFundingandValuation', 'q12JobCritStability',
    'q12JobCritProfGrowth', 'q16HiringManager',
    'q17HirChaNoDiversCandidates', 'q20CandYearExp',
    'q20CandCompScienceDegree', 'q20CandCodingBootcamp',
    'q20CandSkillCert', 'q20CandHackerRankActivity',
    'q20CandOtherCodingCommAct', 'q20CandGithubPersProj',
    'q20CandOpenSourceContrib', 'q20CandHackathonPart',
    'q20CandPrevWorkExp', 'q20CandPrestigeDegree',
    'q20CandLinkInSkills', 'q20CandGithubPersProj2'
]

# Create a new DataFrame with only the selected columns
filtered_us_dev_survey_values_df = us_dev_survey_values_df[columns_to_keep]

# Display information about the DataFrame 
filtered_us_dev_survey_values_df.info()

# Display the resulting DataFrame
display(filtered_us_dev_survey_values_df.head(5))

<class 'pandas.core.frame.DataFrame'>
Index: 4937 entries, 5 to 9892
Data columns (total 34 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   RespondentID                   4937 non-null   int64 
 1   q2Age                          4937 non-null   object
 2   q3Gender                       4937 non-null   object
 3   q10Industry                    4416 non-null   object
 4   q8JobLevel                     4661 non-null   object
 5   q9CurrentRole                  4666 non-null   object
 6   q12JobCritPrefTechStack        971 non-null    object
 7   q12JobCritCompMission          743 non-null    object
 8   q12JobCritCompCulture          2095 non-null   object
 9   q12JobCritWorkLifeBal          2688 non-null   object
 10  q12JobCritCompensation         2342 non-null   object
 11  q12JobCritProximity            860 non-null    object
 12  q12JobCritPerks                488 non-null    object
 13  q12JobCr

Unnamed: 0,RespondentID,q2Age,q3Gender,q10Industry,q8JobLevel,q9CurrentRole,q12JobCritPrefTechStack,q12JobCritCompMission,q12JobCritCompCulture,q12JobCritWorkLifeBal,...,q20CandSkillCert,q20CandHackerRankActivity,q20CandOtherCodingCommAct,q20CandGithubPersProj,q20CandOpenSourceContrib,q20CandHackathonPart,q20CandPrevWorkExp,q20CandPrestigeDegree,q20CandLinkInSkills,q20CandGithubPersProj2
5,6463843138,35 - 44 years old,Male,Technology,,,,,Company culture,Good work/life balance,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!
6,6458326054,45 - 54 years old,Male,Financial Services,Director / VP of Engineering,Unemployed,,Company mission,,,...,,,,,,,Previous work experience,Prestige of degree,,
7,6467198274,18 - 24 years old,Male,Education,New grad,,,,Company culture,,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!
24,6460870080,18 - 24 years old,Male,,Student,Student,,,,Good work/life balance,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!
40,6479693460,45 - 54 years old,Male,Technology,Founder / CEO / CTO,Software Architect,Preferred tech stack,,Company culture,,...,,,,,,,Previous work experience,,,Github or personal projects


#### Import Dataset 2:

In [5]:
# import zip file from Pew Research
file_handle, _ = urlretrieve("https://www.pewresearch.org/wp-content/uploads/sites/20/2019/04/2017-Pew-Research-Center-STEM-survey.zip")
zipfile = ZipFile(file_handle, "r")
zipfile.extractall("./data")
zipfile.close()

#### Import Dataset 3:

In [6]:
# scrape HTML file to extract tables
os.makedirs("data/ncses", exist_ok=True) # create new subfolder for tables scraped from website

url = "https://ncses.nsf.gov/pubs/nsb20212/participation-of-demographic-groups-in-stem"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser") # parse the html as a string

tables = soup.find_all("table")


# loop through each table to extract the title and data
for i in range(len(tables)):
    table = tables[i]

    # Extract the table's title attribute and clean it up for a filename
    title = table.get('title')  # Find the title attribute in the table

    if title:
        title = title.strip()  # Use the title if it exists
    else:
        title = f'table_{i}'  # Generic name if the table does not have a title
    
    # Clean title: replace spaces with hyphens and remove special characters except apostrophes
    title = title.replace("'", "")  # Remove apostrophes
    title = re.sub(r"[^\w\s]", "-", title)  # Replace special characters with hyphens
    title = title.replace(" ", "-")  # Replace spaces with hyphens

    # Truncate the title if it exceeds 50 characters for filename compatibility
    if len(title) > 50:
        title = title[:50]

    file_name = f"data/ncses/{title}.csv" # generate an empty csv for each table

    # convert each table to a pandas df
    df = pd.read_html(str(table))[0]

    # save the pandas df as a csv
    df.to_csv(file_name, index=False)

  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]
  df = pd.read_html(str(table))[0]


## Resources and References
*What resources and references have you used for this project?*
📝 <!-- Answer Below -->

- https://it4063c.github.io/course-notes/working-with-data/data-sources for methods to import the various data types  
- https://www.kaggle.com/datasets/hackerrank/developer-survey-2018/data for the Kaggle dataset 
- https://www.pewresearch.org/social-trends/2018/01/09/women-and-men-in-stem-often-at-odds-over-workplace-equity/  for the link to the Pew Research Survey
- https://ncses.nsf.gov/pubs/nsb20212/participation-of-demographic-groups-in-stem for the html scraped dataset



In [1]:
# ⚠️ Make sure you run this cell at the end of your notebook before every submission!
!jupyter nbconvert --to python source.ipynb

[NbConvertApp] Converting notebook source.ipynb to python
[NbConvertApp] Writing 7308 bytes to source.py
