# Gender inequality across OECD-countries
In this assignment, we conduct a descriptive analysis of the development in the gender gap across a selection of OECD-countries. In addition, we investigate potential causes of differences in the gender gap by plotting the development in net child care costs, access to formal child care and length of paternity leave. The goal is to gain a deeper understanding of the potential drivers of gender inequality. 

All data has been downloaded from the OECD-data bank (https://stats.oecd.org/) and uploaded to our repository.

# Code struture

a) Import packages and load data. 

b) Gender gap.

c) Net childcare costs. 

d) Parental leave and formal child care. 


# a. Import packages and load data

In [325]:
# Importing packages
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import ipywidgets as widgets
#from matplotlib_venn import venn2


# Autoreloading modules when code is run
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [326]:
# Loading the data from OECD
df_gg = pd.read_csv('Gender gap.csv')
df_pl = pd.read_csv('Parental leave.csv')
df_f = pd.read_csv('Family.csv')
df_nc = pd.read_csv('netchildcare costs.csv')

# b. Gender gap
1) We dispaly the OECD-data set for the gender gap across countries 
2) We drop irrelevant columns and rename 
3) We clean the data-set 
4) We select some of the countries from OECD
5) We make an interactive plot that illustrates the development in the gender gap over time for the selection of OECD-countries
6) Changing the data frame in order to create an overview for the selected countries and for the interactive time series figure
7) We create an interactive time series figure with options for each country
8) We create a data set with only 2020-observations for later use

In [327]:
# 1) Displaying the gender gap dataset "uncleaned" by a Pandas DataFrame
df_gg.head(5)

Unnamed: 0,COUNTRY,Country,SEX,Sex,SERIES,Series,TIME,Time,Unit Code,Unit,PowerCode Code,PowerCode,Reference Period Code,Reference Period,Value,Flag Codes,Flags
0,KOR,Korea,MW,All persons,GWG9,Gender wage gap at 9th decile (top),2005,2005,,,0,Units,,,41.906664,,
1,KOR,Korea,MW,All persons,GWG9,Gender wage gap at 9th decile (top),2006,2006,,,0,Units,,,40.365519,,
2,KOR,Korea,MW,All persons,GWG9,Gender wage gap at 9th decile (top),2007,2007,,,0,Units,,,41.081022,,
3,KOR,Korea,MW,All persons,GWG9,Gender wage gap at 9th decile (top),2008,2008,,,0,Units,,,41.352529,,
4,KOR,Korea,MW,All persons,GWG9,Gender wage gap at 9th decile (top),2009,2009,,,0,Units,,,40.141383,,


In [328]:
# 2) Structuring the dataset by dropping collumns that we do not need
drop_these = ['COUNTRY','SEX','Sex','Series','TIME','Unit','Unit Code','PowerCode Code','PowerCode','Reference Period Code','Reference Period','Flag Codes','Flags']
df_gg.drop(drop_these, axis=1, inplace=True)

# Renaming columns
df_gg = df_gg.rename(columns = {"SERIES" : "Type", "Value": "Gender_gap", "Time": "Year"})
df_gg.head(5)

Unnamed: 0,Country,Type,Year,Gender_gap
0,Korea,GWG9,2005,41.906664
1,Korea,GWG9,2006,40.365519
2,Korea,GWG9,2007,41.081022
3,Korea,GWG9,2008,41.352529
4,Korea,GWG9,2009,40.141383


In [329]:
# 3) Keeping only the gender gap at the median 
I = df_gg.Type.str.contains('GWG5')
df_gg = df_gg.loc[I==True] 

# Reset old index
df_gg.reset_index(inplace = True, drop = True) # Drop old index too
df_gg.drop('Type', axis=1, inplace=True)

# Sort
df_gg = df_gg.sort_values(['Country','Year'])

# Reset old index
df_gg.reset_index(inplace = True, drop = True) # Drop old index too
df_gg.head(5)

Unnamed: 0,Country,Year,Gender_gap
0,Australia,2005,15.777778
1,Australia,2006,16.666667
2,Australia,2007,15.4
3,Australia,2008,11.937378
4,Australia,2009,16.363636


In [330]:
# 4) Country specific dataset
countries = ['Denmark', 'United States', 'Sweden','Germany','United Kingdom','OECD countries']
df_c_g = df_gg[df_gg['Country'].isin(countries)]
df_c_g = df_c_g.round(1)

# Reset old index
df_c_g.reset_index(inplace = True, drop = True) # Drop old index too
df_c_g

Unnamed: 0,Country,Year,Gender_gap
0,Denmark,2005,10.2
1,Denmark,2006,10.2
2,Denmark,2007,9.9
3,Denmark,2008,10.2
4,Denmark,2009,10.2
...,...,...,...
95,United States,2017,18.2
96,United States,2018,18.9
97,United States,2019,18.5
98,United States,2020,17.7


In [331]:
# 5) Times series plot of the Gender Gap
import plotly.graph_objects as go

fig = go.Figure()

for country, df_c_g in df_c_g.groupby('Country'):
    if country == 'OECD countries':  # add condition to change line style for OECD countries
        fig.add_trace(go.Scatter(x=df_c_g['Year'], y=df_c_g['Gender_gap'], name=country, line=dict(color='black', dash='dash')))
    else:
        fig.add_trace(go.Scatter(x=df_c_g['Year'], y=df_c_g['Gender_gap'], name=country))

fig.update_layout(title='Development in Gender Gap over time',
                  xaxis_title='Year',
                  yaxis_title='Pct.')

fig.show()

The gender gap has generally been decreasing over the considered time period across all selected OECD-countries. On average, the gender gap has decreased from around 16 pct. in 2005 to approximately 12 pct. in 2021. The gender gap is generally lower in Denmark and Sweden and higher in United States and United Kingdom. In Germany, the gender gap is above the OECD-average over the entire period. 

In [332]:
# 6) Changing the data frame in order to create an overview for the selected countries
countries = ['Denmark', 'United States', 'Sweden','Germany','United Kingdom','OECD countries']
df_c_g_pivot = df_gg[df_gg['Country'].isin(countries)]
df_pivot = df_c_g_pivot.pivot(index='Year', columns='Country', values='Gender_gap')
df_pivot.columns.name = None
df_pivot.reset_index(inplace=True)
df_pivot.head(20)

Unnamed: 0,Year,Denmark,Germany,OECD countries,Sweden,United Kingdom,United States
0,2005,10.16733,16.814159,16.037072,11.304348,22.063958,18.975069
1,2006,10.172061,18.542522,15.724404,11.016949,21.718084,19.246299
2,2007,9.850298,16.959064,15.046713,11.836735,21.634781,19.843342
3,2008,10.183549,16.959064,15.119709,10.588235,21.857869,20.050125
4,2009,10.172005,16.958599,14.56209,9.541985,20.686296,19.78022
5,2010,8.895098,16.694716,14.020081,9.363296,19.231437,18.81068
6,2011,7.947197,16.944444,13.846861,9.157509,18.246005,17.788462
7,2012,6.999845,16.304348,13.969857,9.285714,17.783883,19.086651
8,2013,6.768912,14.323288,13.864665,9.375,17.482014,17.906977
9,2014,6.323866,17.1875,13.827959,9.152542,17.382743,17.451206


In [333]:
# 7) Creating an interactive time series figure with options for each country

def _plot_timeseries(df_pivot, country, years):
    
    fig = plt.figure(dpi=100)
    ax = fig.add_subplot(1,1,1)
    
    I = (df_pivot['Year'] >= years[0]) & (df_pivot['Year'] <= years[1])
    
    x = df_pivot.loc[I,'Year']
    y = df_pivot.loc[I, country]
    ax.plot(x,y)
    
    ax.set_xticks(list(range(years[0], years[1] + 1, 5)))
    ax.set_xlabel('Year')
    ax.set_ylabel('Gender Gap')
    ax.set_title(country)    
    
def plot_timeseries(df_pivot):
    
    country_options = list(df_pivot.columns[1:])
    
    widgets.interact(_plot_timeseries, 
    df_pivot = widgets.fixed(df_pivot),
    country = widgets.Dropdown(
        description='Country', 
        options=country_options, 
        value=country_options[0]),
    years=widgets.IntRangeSlider(
        description="Years",
        min=df_pivot['Year'].min(),
        max=df_pivot['Year'].max(),
        value=[df_pivot['Year'].min(), df_pivot['Year'].max()],
        continuous_update=False,
    )                 
); 


In [334]:
plot_timeseries(df_pivot)

interactive(children=(Dropdown(description='Country', options=('Denmark', 'Germany', 'OECD countries', 'Sweden…

In [335]:
# 8) 2020 dataset for later use
df_gg_2020 = df_gg.loc[df_gg['Year']==2020]
df_gg_2020 = df_gg_2020.sort_values(['Country','Year'])
df_gg_2020.reset_index(inplace = True, drop = True) # Drop old index too

# c. Net child care costs
1) We read the OECD-data set for net child care costs across countries. We drop irrelevant columns and rename
2) We clean the data-set 
3) We make an interactive plot that illustrates the development in net child care costs over time for a selection of OECD-countries
4) We create a data set with only 2020-observations for later use

In our inagural project, we added a disutility of working in the labor market for women in order to make the model predictions fit data. We interpreted this as norms. According to Kleven et al. (2019), only female labor supply is affected by children, i.e., the child penalty only strikes women where having children is basically a non event for fathers. We therefore incorperated norms in the model based on an assumption that part of home production is related to child care. A lack of access to child care in society could potentially augment the child penalty as this could force women to stay at home when they have children (or at least work less). We investigate this hypothesis by plotting the net child care costs (as pct. of household income) across our selection of OECD-countries as we should expect child care costs to all things equal be higher when acess to formal child care is low. 

In [336]:
# Net child care costs 
# Drop collumns that we do not need
drop_these = ['LOCATION','Type of indicator','Net childcares cost by item','Family type','Earnings of the first adult','HBTOPUPS','TIME','Flag Codes','Flags']
df_nc.drop(drop_these, axis=1, inplace=True)

#Rename columns
df_nc = df_nc.rename(columns = {"TYPE" : "Type", "Value": "Cost", "Time": "Year"})

In [337]:
# Get one observation pr. year pr. country. 
# We consider percentage of net household income for two earner family with wages at the average. 
# We include social benefits and housing benefits
I = (df_nc['Type'] == 1) & (df_nc['FAMILY'] == '2EARNERC2C_67AW') & (df_nc['Include social assistance benefits'] == 'Yes') & (df_nc['Include housing benefits'] == 'Yes') & (df_nc['COMPONENTS'] == 5) & (df_nc['EARNINGS'] == '67AW') 

df_nc = df_nc.loc[I==True] 

# Reset old index
df_nc.reset_index(inplace = True, drop = True) # Drop old index too
df_nc.drop('Type', axis=1, inplace=True)
df_nc.drop('FAMILY', axis=1, inplace=True)

# Sort
df_nc = df_nc.sort_values(['Country','Year'])

# Reset old index
df_nc.reset_index(inplace = True, drop = True) # Drop old index too

In [338]:
# Keep only relevant countries
countries_2 = ['Denmark', 'United States', 'Sweden', 'Germany','United Kingdom','OECD - Total']
df_c_nc = df_nc[df_nc['Country'].isin(countries_2)]
df_c_nc = df_c_nc.round(1)

In [339]:
fig = go.Figure()

for country, df_c_nc in df_c_nc.groupby('Country'):
    if country == 'OECD - Total':  # add condition to change line style for OECD countries
        fig.add_trace(go.Scatter(x=df_c_nc['Year'], y=df_c_nc['Cost'], name=country, line=dict(color='black', dash='dash')))
    else:
        fig.add_trace(go.Scatter(x=df_c_nc['Year'], y=df_c_nc['Cost'], name=country))

fig.update_layout(title='Net costs for a household with two children using childcare',
                  xaxis_title='Year',
                  yaxis_title='Pct. of household income')

fig.show()

As expected, net costs for a household with two children using childcare are generally higher in the UK and the US where the gender gap is also higher. It is interesting that the costs in Denmark are generally higher than in Sweden and Germany when the gender gap is the smallest here. This may be due to data differences and how countries determine and calculate costs. 

# d. Parental leave, formal child care and gender gap

1) We read the OECD-data set for parental leave and formal child care across countries. We drop irrelevant columns and rename. We keep only observatins for 2020. 
2) We clean the two data-sets
3) We merge the two data-sets and add the gender gap data for 2020 as well 
4) We make an interactive plot that illustrates the development in net child care costs over time for a selection of OECD-countries
5) We create a bubble diagram that shows formal child care costs as pct. of GDP and number of weeks fathers can take parental leave. The bubble size indicates the size of the gender gap.


In [340]:
# Parental leave dataset 
drop_these = ['COU','Indicator','SEX','Sex','AGE','Age Group','TIME','Unit','Unit Code','PowerCode Code','PowerCode','Reference Period Code','Reference Period','Flag Codes','Flags']
df_pl.drop(drop_these, axis=1, inplace=True)

#Rename columns
df_pl = df_pl.rename(columns = {"IND" : "Type", "Value": "Father_leave", "Time": "Year"})

In [341]:
#Keep only EMP18_PAT variable
I = df_pl.Type.str.contains('EMP18_PAT')
df_pl = df_pl.loc[I==True] 

# Reset old index
df_pl.reset_index(inplace = True, drop = True) # Drop old index too
df_pl.drop('Type', axis=1, inplace=True)

# Sort
df_pl = df_pl.sort_values(['Country','Year'])

# Reset old index
df_pl.reset_index(inplace = True, drop = True) # Drop old index too

In [342]:
# 2020
df_pl_2020 = df_pl.loc[df_pl['Year']==2020]
df_pl_2020 = df_pl_2020.sort_values(['Country','Year'])
df_pl_2020.reset_index(inplace = True, drop = True) # Drop old index too

In [343]:
# Child benefits dataset
drop_these = ['COU','Indicator','SEX','Sex','YEAR','Unit','Unit Code','PowerCode Code','PowerCode','Reference Period Code','Reference Period','Flag Codes','Flags']
df_f.drop(drop_these, axis=1, inplace=True)

In [344]:
#Keep FAM13
I = df_f.IND.str.contains('FAM13')
df_fam13 = df_f.loc[I==True] 
df_fam13 = df_fam13.rename(columns = {"IND" : "Type", "Value": "Formal_child_care"})

# Reset old index
df_fam13.reset_index(inplace = True, drop = True) # Drop old index too
df_fam13.drop("Type", axis=1, inplace=True)

# Sort
df_fam13 = df_fam13.sort_values(['Country','Year'])

# Reset old index
df_fam13.reset_index(inplace = True, drop = True) # Drop old index too

In [345]:
# 2020
df_fam13_2020 = df_fam13.loc[df_fam13['Year']==2020]
df_fam13_2020 = df_fam13_2020.sort_values(['Country','Year'])
df_fam13_2020.reset_index(inplace = True, drop = True) # Drop old index too

In [346]:
# Merge data-set
df_merged = pd.merge(df_gg_2020, df_pl_2020, on=['Country','Year'], how='outer')
df_merged = pd.merge(df_merged, df_fam13_2020, on=['Country','Year'], how='outer')

In [347]:
# Drop observations where gender gap is NaN
df_merged = df_merged.round(1)
df_merged.fillna(0, inplace=True)
index_to_drop = df_merged[df_merged['Gender_gap'] == 0].index
df_merged = df_merged.drop(index=index_to_drop)

In [348]:
# Make bubble plot

highlighted_countries = ['Denmark', 'Sweden', 'Germany','United Kingdom','United States']

# Create a color map for the markers
marker_colors = df_merged['Country'].apply(lambda country: 'red' if country in highlighted_countries else 'blue')

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=df_merged['Father_leave'],
    y=df_merged['Formal_child_care'],
    mode='markers',
    marker=dict(
        size=df_merged['Gender_gap'],
        sizemode='area',
        sizeref=0.1,
        sizemin=5,
        color=marker_colors,  # Set the marker color based on the color map
        opacity=0.7,  # Set the opacity to make the markers semi-transparent
        line=dict(width=0.5, color='white'),  # Add a white border around the markers
    ),
    text=df_merged['Country'],
))

# Set the plot layout
fig.update_layout(
    title='Paternal leave and formal child care by country (2020)',
    xaxis_title='Paternity leave (Weeks)',
    yaxis_title='Formal child care (Pct. of GDP)',
)

# Display the plot in an interactive window
fig.show()

It seems as thoug formal child care and paternity leave are important drivers for lowering the gender gap. However, some there are some wierd outliers  (France, Korea and Japan). E.g. in Korea the gender gap is large but formal child care costs are high and it seems as though fathers have good acess to paternity leave. 

Again, there may be a problem with the validity of OECD-data and how countries differ in their determination and calculations. 

# Conclusion