# 2020 Stack Overflow Survey - Focus: Diversity

This notebook is using the <b>CRISP-DM</b> process to analyze the 2020 Stack Overflow survey focussing on aspects related to diversity in tech.

## 1. Business Understanding
International Women’s Day 2021 again sparked many discussions about diversity and inclusion. Having just started the Udacity Nanodegree in Data Science as the only female in my company, the day triggered me to investigate the status quo in the tech community and to provide current facts & figures in order to contribute to the overall diversity discussion.
1. What is the demographic setup of today’s developer community? What profile is typical for a person writing code these days?
2. How inclusive is the community? Do underrepresented groups feel equally welcome?
3. Are there differences regarding compensation? Is there a gender pay gap?

## 2. Data Understanding
The above stated questions are examined using the results of the largest, most current study of developers globally: the 2020 Stack Overflow Annual Developer Survey with nearly 65,000 respondents. The dataset is readily accessible from Stack Overflow and covers a wide breadth of topics. The following steps are undertaken to get a better understanding of the data.

#### Import required libraries and main dataset with survey responses

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Styling the visualizations
sns.set()
sns.set_style("whitegrid")

#Read in survey responses dataset
df = pd.read_csv('./survey_results_public.csv')
df.head()

#### Examining original survey questions to gain better understanding of responses

In [None]:
#Adapt pandas default setting to display all survey questions in full length
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

#Read in schema to display survey questions
df_schema = pd.read_csv('./survey_results_schema.csv')
df_schema

#### Examine shape and column names of main dataset

In [None]:
df.shape

In [None]:
df.columns

#### Finding: 
The relevant columns used for anwering the posed questions are: Age, ConvertedComp (yearly compensation converted into USD), Country, Ethnicity, Gender, SOAccount (whether a respondent has a Stack Overflow account), SOComm (whether a respondent considers themselves a member of the Stack Overflow community), Sexuality, Trans(gender), YearsCodePro (professional coding experience). 

#### Missing values

In [None]:
#Adapt pandas default setting to display all results
pd.set_option('display.max_rows', None)

#Show missing values
round(df.isnull().mean().sort_values(ascending=False), 2)

#### Finding: 
There are missing values in the relevant columns for examining diversity in tech: 
ConvertedComp (46%), Sexuality (32%), Age and Ethnicity (29%), YearsCodePro(28%), Trans (23%), Gender (22%), SOAccount & SOComm (12%) and Country (1%).
As the notebook deals with descriptive statistics, no manipulation such as imputing values will be undertaken at this point. This ensures that there is the maximum data available for the different dimensions that will be analyzed and that there is no bias introduced by just keeping data of respondents that were willing to disclose everything in the survey.

## 3. Preparing Data

In the following cells, several steps will be undertaken to prepare data for analysis and visualization:
- replacing multiple answers where adequate (e.g. to display cleaner visualizations)
- shortening responses where adequate (e.g. to display data in a cleaner fashion)
- adding sorting possibility to ranked categorical data
- convert strings into numeric data where adequate

In [None]:
df['Ethnicity'] = df['Ethnicity'].replace(to_replace=r'(.*;.*)', value='Multiple answers', regex=True)
df['Ethnicity'] = df['Ethnicity'].replace(to_replace=r'(Indigenous.*)', value='Indigeneous', regex=True)

df['Gender'] = df['Gender'].replace(to_replace=r'(.*;.*)', value='Multiple answers', regex=True)

df['Sexuality'] = df['Sexuality'].replace(to_replace=r'(.*;.*)', value='Multiple answers', regex=True)

df['SOComm'] = df['SOComm'].replace(
    ['Neutral', 'No, not at all', 'No, not really', 'Not sure', 'Yes, definitely', 'Yes, somewhat'],
    ['3 - Neutral', '5 - No, not at all', '4 - No, not really', '6 - Not sure', '1 - Yes, definitely', '2 - Yes, somewhat'])

df['YearsCodePro'] = df['YearsCodePro'].replace(['Less than 1 year', 'More than 50 years'],['0', '51'])
df['YearsCodePro'] = pd.to_numeric(df['YearsCodePro'], errors='coerce')

## 4. Data Modeling / Analysis

In the following cells, each question is tackled with the preprocessed data.

### QUESTION 1: Demographic setup

#### Stereotypical developer

In [None]:
#Calculate and print "stereotypical" features of developers, mode for categorical data, median for numerical data
avg_gender = df.Gender.mode()[0]
avg_age = round(df.Age.median())
avg_sexuality = df.Sexuality.mode()[0]
avg_ethnicity = df.Ethnicity.mode()[0]
avg_trans = df.Trans.mode()[0]

print(f'The average developer is a {avg_age}-year old, {avg_sexuality}, {avg_trans}-trans, {avg_ethnicity} {avg_gender}.')

#### Gender

In [None]:
#Count values and plot as pie chart
gender_counts = df['Gender'].value_counts()

gender_counts.plot.pie(
    autopct="%.1f%%", 
    explode=(0, 0.1, 0.1, 0.5), 
    figsize=(10, 10));

#### Trans(gender)

In [None]:
#Count values and show as percentages
trans_counts = df.Trans.value_counts()
trans_counts/sum(trans_counts)

#### Sexuality

In [None]:
#Count values and plot as pie chart
sexuality_counts = df['Sexuality'].value_counts()

sexuality_counts.plot.pie(
    autopct="%.1f%%", 
    explode=(0, 0.1, 0.1, 0.1, 0.3), 
    figsize=(10, 10));

#### Ethnicity

In [None]:
#Count values and plot as pie chart
ethnicity_counts = df['Ethnicity'].value_counts()

ethnicity_counts.plot.pie(
    autopct="%.1f%%", 
    explode=(0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.4, 0.5), 
    figsize=(10, 10));

#### Age by Gender Group

In [None]:
#Create age buckets and gender groups
age_bins = [15, 20, 25, 30, 35, 40, 45, 50, 55, 60]
age_men = df['Age'][df.Gender.eq("Man")].dropna()
age_women = df['Age'][df.Gender.eq("Woman")].dropna()

#Plot as labelled histogram
gender_label = ['Men', 'Women']
plt.hist([age_men, age_women], bins=age_bins, density=True, label=gender_label)
plt.legend();

In [None]:
#Calculate and print median age
median_age_women = round(age_women.median())
median_age_men = round(age_men.median())

print(f'The median age is: {median_age_women}-years old (women) and {median_age_men}-years old (men).')

### QUESTION 2: Inclusiveness of Stack Overflow community

In [None]:
#Only include respondents with a Stack Overflow account
members_only = df[df.SOAccount.eq("Yes")].groupby(['Gender', 'SOComm'])['Respondent'].count()

#Group members by gender
members_gender = df[df.SOAccount.eq("Yes")].groupby(['Gender'])['Respondent'].count()

#Show percentages of responses based on gender
feeling_member = members_only.div(members_gender, level="Gender")
feeling_member

In [None]:
#Plot as horizontally stacked bar chart and show legend
feeling_member.unstack().plot.barh(stacked=True)
plt.legend(loc="center right");

### QUESTION 3: Salary

#### Salaries across countries

In [None]:
#Create compensation bins
comp_bins = [0, 20000, 40000, 60000, 80000, 100000, 120000, 140000, 160000, 180000, 200000, 220000, 240000]

#Filtering compensation by country
comp_usa = df['ConvertedComp'][df.Country.eq("United States")].dropna()
comp_uk = df['ConvertedComp'][df.Country.eq("United Kingdom")].dropna()
comp_india = df['ConvertedComp'][df.Country.eq("India")].dropna()

#Plotting data as histogram and show legend
country_label = ['USA', 'UK', 'India']
plt.hist([comp_usa, comp_uk, comp_india], bins=comp_bins, density=True, label=country_label);
plt.legend();

In [None]:
#Calculate and print median compensation by country
median_comp_usa = round(comp_usa.median())
median_comp_uk = round(comp_uk.median())
median_comp_india = round(comp_india.median())

print(f'The median compensation in USD is: ${median_comp_usa} (USA), ${median_comp_uk} (UK), ${median_comp_india} (India)')

### Salary by gender in United States

In [None]:
#Filter dataset for United States
usa = df[df.Country.eq("United States")]

#Create gender groups
comp_usa_men = usa['ConvertedComp'][df.Gender.eq("Man")].dropna()
comp_usa_women = usa['ConvertedComp'][df.Gender.eq("Woman")].dropna()

#Plot as labelled histogram and show legend
plt.hist([comp_usa_men, comp_usa_women], bins=comp_bins, density=True, label=gender_label)
plt.legend();

In [None]:
#Calculate median compensation and print
median_comp_usa_women = round(comp_usa_women.median())
median_comp_usa_men = round(comp_usa_men.median())

print(f'The median compensation in USD is: ${median_comp_usa_women} for women and ${median_comp_usa_men} for men.')

In [None]:
#Plot histogram of professional coding experience
df['YearsCodePro'].plot.hist();

In [None]:
#Print relevant statistics
df['YearsCodePro'].describe()

In [None]:
#Filter experience by gender group
code_men = df['YearsCodePro'][df.Gender.eq("Man")]
code_women = df['YearsCodePro'][df.Gender.eq("Woman")].dropna()

#Plot as labelled histogram and show legend
plt.hist([code_men, code_women], density=True, label=gender_label)
plt.legend();

In [None]:
#Calculate median professional coding experience
median_code_women = round(code_women.median())
median_code_men = round(code_men.median())

print(f'The median professional coding experience is: {median_code_women} years for women and {median_code_men} years for men.')

In [None]:
#Create experience bins and show counts to evaluate if group sizes are large enough
years_bins = [0, 2, 5, 10, 15, 51]

usa_years_gender = df.groupby(['Gender', pd.cut(df.YearsCodePro, years_bins)])
usa_years_gender.size().unstack()

In [None]:
#Cut coding experience into bins
df['ProCodingExperience'] = pd.cut(df['YearsCodePro'], years_bins)

#Remove groups with too low numbers
usa_men_women = df[~df['Gender'].isin(["Multiple answers", "Non-binary, genderqueer, or gender non-conforming"])]

#Create seaborn boxplot aand hide outliers to improve readability
sns.boxplot(
    x="ProCodingExperience", 
    y="ConvertedComp", 
    hue="Gender", 
    data=usa_men_women, 
    showfliers=False);

In [None]:
usa_men_women.groupby(['ProCodingExperience','Gender'])['ConvertedComp'].median()

## 5. Evaluation of results
The results will be primarily described in my Medium Blog post: https://juttarichter.medium.com/top-facts-about-diversity-inclusion-equ-al-ity-in-todays-tech-community-dd33916f6b11

Here is a quick overview:
1. What is the demographic setup of today’s developer community? What profile is typical for a person writing code these days? 
In all analyzed dimensions, there is one single group dominating by a large percentage. The typical developer is a 29-year old, heterosexual, non-transgender, white man.

2. How inclusive is the community? Do underrepresented groups feel equally welcome?
Half of all men consider themselves a member of the community, the percentages are much lower for women (29%) and non-binary, genderqueer, or gender non-conforming respondents (29%).

3. Are there differences regarding compensation? Is there a gender pay gap?
There are differences, however much depends on factors such as country of residence or professional coding experience. When filtering for the US and splitting by experience, females even tend to earn more early in their career (median compensation), with 15+ years of experience, the median compensation for men is slightly higher. Additional factors such as chosen career, education or programming language were not further examined as the data would be split into too small buckets.