<a href="https://colab.research.google.com/github/annemarija/storytelling-with-data/blob/master/data-stories/ceo-wage-gap/ceo_wage_gap_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

This demonstration notebook provides a suggested set of libraries that you might find useful in crafting your data stories.  You should comment out or delete libraries that you don't use in your analysis.

In [2]:
!pip install davos
import davos #this provides the "smuggle" keyword-- used as a more robust version of "import"

Collecting davos
  Downloading davos-0.1.0-py3-none-any.whl (76 kB)
[?25l[K     |████▎                           | 10 kB 19.2 MB/s eta 0:00:01[K     |████████▋                       | 20 kB 9.9 MB/s eta 0:00:01[K     |████████████▉                   | 30 kB 6.3 MB/s eta 0:00:01[K     |█████████████████▏              | 40 kB 5.7 MB/s eta 0:00:01[K     |█████████████████████▍          | 51 kB 2.7 MB/s eta 0:00:01[K     |█████████████████████████▊      | 61 kB 3.2 MB/s eta 0:00:01[K     |██████████████████████████████  | 71 kB 3.5 MB/s eta 0:00:01[K     |████████████████████████████████| 76 kB 1.9 MB/s 
Installing collected packages: davos
Successfully installed davos-0.1.0


In [3]:
#number crunching
smuggle numpy as np
smuggle pandas as pd
smuggle statsmodels.formula.api as smf


#data visualization
smuggle plotly # pip: plotly==4.14.3
smuggle plotly.express as px
smuggle seaborn as sns
smuggle bokeh as bk
from matplotlib smuggle pyplot as plt
smuggle plotnine as pn
smuggle hypertools as hyp
smuggle folium as fm
from mpl_toolkits.mplot3d smuggle Axes3D



  import pandas.util.testing as tm


Collecting plotly==4.14.3
  Downloading plotly-4.14.3-py2.py3-none-any.whl (13.2 MB)
[K     |████████████████████████████████| 13.2 MB 4.4 MB/s 
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... [?25l[?25hdone
  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11447 sha256=0be6e8295cd07cb400e079d72d1950a9fb3e31bd884fb381237dd981cb11b273
  Stored in directory: /root/.cache/pip/wheels/f9/8d/8d/f6af3f7f9eea3553bc2fe6d53e4b287dad18b06a861ac56ddf
Successfully built retrying
Installing collected packages: retrying, plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 5.5.0
    Uninstalling plotly-5.5.0:
      Successfully uninstalled plotly-5.5.0
Successfully installed plotly-4.14.3 retrying-1.3.3
Collecting hypertools
  Downloading hypertools-0.8.0-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 2.2 

# Project team

Annemarija Apine (annemarija), Elisa Brosera (elisabrosera). Annemarija was mostly responsible for the code and Elisa for the story but it was a collaborative effort.

# Background and overview

I saw an article with a click-bait style title about female CEOs outearning their male counterparts (https://fortune.com/2016/05/10/female-ceos-out-earn-men/). Since I took a class in gender/labour economics, I was curious whether we could replicate the results using Execucomp data.

# Approach

We will be using descriptive statistics and linear regressions with controls to see what is the gender wage gap like at the CEO level.

# Quick summary

Female CEOs earning more (at least in Execucomp data set) is due to lack of controls for time and other variables.


# Data

Briefly describe your dataset(s), including links to original sources.  Provide any relevant background information specific to your data sources.

In [4]:
#A share of female CEOs over years
#Originally data is from Compustat Execucomp (1992-2020)
#but this has been modified and cleaned up in Stata for another project I was working on;
#the same is applicable for the other files
#Available at: https://wrds-www.wharton.upenn.edu/pages/grid-items/compustat-execucomp-basics/
share=pd.read_excel('https://github.com/annemarija/storytelling-with-data/raw/master/data-stories/ceo-wage-gap/CEOshare.xlsx')
share.head()

Unnamed: 0,Year,Percentage of female CEOs
0,1992.0,0.57269
1,1993.0,0.52016
2,1994.0,0.71219
3,1995.0,0.79554
4,1996.0,0.86304


In [5]:
#Total compensation of executives in real terms, 1000s of USD
tdc1=pd.read_excel('https://github.com/annemarija/storytelling-with-data/raw/master/data-stories/ceo-wage-gap/TDCyear.xlsx')
tdc1.head()


Unnamed: 0,Year,Average Real Total Compensation (1000s USD)
0,1992,2447.927
1,1993,2583.702
2,1994,3016.517
3,1995,3209.516
4,1996,4069.685


In [6]:
#Dataset for running regressions; has been modified to only contain CEOs
exec=pd.read_stata('https://github.com/annemarija/storytelling-with-data/raw/master/data-stories/ceo-wage-gap/execucomp_92_20.dta')
exec.head()

Unnamed: 0,gvkey,execid,year,age,real_tdc,female,CEO,log_r_tdc
0,10553,2,1992.0,,2091.282227,0.0,1.0,7.645533
1,10553,2,1993.0,38.0,3411.758789,0.0,1.0,8.134983
2,10553,2,1994.0,39.0,4850.436523,0.0,1.0,8.486824
3,10553,2,1995.0,40.0,10640.047852,0.0,1.0,9.27238
4,10553,2,1996.0,41.0,5959.479492,0.0,1.0,8.692739


# Analysis

Briefly describe each step of your analysis, followed by the code implementing that part of the analysis and/or producing the relevant figures.  (Copy this text block and the following code block as many times as are needed.)

In [7]:
#Calculate means by sex (dummy var 'female') to see whether female CEOs earn more ('real_tdc')
exec.groupby('female').mean()

Unnamed: 0_level_0,year,age,real_tdc,CEO,log_r_tdc
female,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0.0,2005.890706,56.37933,5280.220703,1.0,7.925685
1.0,2010.522758,54.319542,5480.783203,1.0,8.073177


In [None]:
#Female CEOs earn about 200,000 USD more

In [8]:
#Perform a simple OLS regression to check whether being female (female dummy is 1)
#has an impact on log real compensation
result=smf.ols(formula='log_r_tdc ~ female', data=exec).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:              log_r_tdc   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     29.25
Date:                Wed, 16 Mar 2022   Prob (F-statistic):           6.38e-08
Time:                        07:37:25   Log-Likelihood:            -1.1587e+05
No. Observations:               71314   AIC:                         2.317e+05
Df Residuals:                   71312   BIC:                         2.318e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      7.9257      0.005   1697.237      0.0

\

In [None]:
#It's about 14 log points

In [9]:
#Even if women earn more, they are hired less
#Show how the % of female CEOs has changed over years
fig = px.line(share, x="Year", y="Percentage of female CEOs", title='Share of Female CEOs (1992-2020)', width=800)
fig.show()

In [11]:
#What if this can explain the differences in compensation?
result=smf.ols(formula="log_r_tdc ~ female + C(year) ", data=exec).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:              log_r_tdc   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.039
Method:                 Least Squares   F-statistic:                     101.1
Date:                Wed, 16 Mar 2022   Prob (F-statistic):               0.00
Time:                        07:38:06   Log-Likelihood:            -1.1445e+05
No. Observations:               71314   AIC:                         2.290e+05
Df Residuals:                   71284   BIC:                         2.292e+05
Df Model:                          29                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             7.3108      0.02

In [12]:
#Why such a strong impact from controlling for years?
#The CEO compensation has increased over the years
#Visually:
fig = px.line(tdc1, x="Year", y="Average Real Total Compensation (1000s USD)", title='Average total compensation of CEOs (1992-2020)', width=800)
fig.show()

In [13]:
#The female coefficient was not significant anymore. Does this mean equality?
#Can try controlling for something else, e.g., firm fixed effects
#This is the only part that we unfortunately did not manage in python
#But we did it in stata
#Because the package to install absorb function for regression was not compatible with something
#We used the same regression as before but controlling for firm fixed effects, using gvkey

# Interpretations and conclusions

Describe and discuss your findings and say how they answer your question (or how they failed to answer your question).  Also describe the current state of your project-- e.g., is this a "complete" story, or is further exploration needed?

Our findings show that while over the 30 year period female CEOs do out-earn their male counterparts, it is not an accurate representation of the gender pay gap among CEOs which becomes more detailed once time controls are added.
More controls are definitely needed to more accurately describe the characteristics of female CEOs (education, firm size, etc.) and gender pay gap.

# Future directions

How do female CEOs compare to their male counterparts in terms of education, achievements, etc.? Is it the case that female CEOs are more qualified due to discrimination?