# Gender inequality across OECD-countries
In this assignment, we conduct a descriptive analysis of the development in the gender gap across a selection of OECD-countries. In addition, we investigate potential causes of differences in the gender gap by plotting the development in net child care costs, access to formal child care and length of paternity leave. The goal is to gain a deeper understanding of the potential drivers of gender inequality. 

All data has been downloaded from the OECD-data bank (https://stats.oecd.org/) and uploaded to our repository.

# Code struture

a) Import packages and load data. 

b) Gender gap.

c) Net childcare costs. 

d) Parental leave and access to formal child care. 


# a. Import packages and load data

In [114]:
# Importing packages
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import ipywidgets as widgets
#from matplotlib_venn import venn2


# Autoreloading modules when code is run
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [115]:
# Loading the data from OECD
df_gg = pd.read_csv('Gender gap.csv')
df_pl = pd.read_csv('Parental leave.csv')
df_f = pd.read_csv('Family.csv')
df_nc = pd.read_csv('netchildcare costs.csv')

# b. Gender gap
1) We dispaly the OECD-data set for the gender gap across countries 
2) We drop irrelevant columns and rename 
3) We clean the data-set 
4) We select some of the countries from OECD
5) We make an interactive plot that illustrates the development in the gender gap over time for the selection of OECD-countries
6) Changing the data frame in order to create an overview for the selected countries and for the interactive time series figure
7) We create an interactive time series figure with drop down menu for each country
8) We create a data set with only 2020-observations for later use

In [116]:
# 1) Displaying the gender gap dataset "uncleaned" by a Pandas DataFrame
df_gg.head(5)

Unnamed: 0,COUNTRY,Country,SEX,Sex,SERIES,Series,TIME,Time,Unit Code,Unit,PowerCode Code,PowerCode,Reference Period Code,Reference Period,Value,Flag Codes,Flags
0,KOR,Korea,MW,All persons,GWG9,Gender wage gap at 9th decile (top),2005,2005,,,0,Units,,,41.906664,,
1,KOR,Korea,MW,All persons,GWG9,Gender wage gap at 9th decile (top),2006,2006,,,0,Units,,,40.365519,,
2,KOR,Korea,MW,All persons,GWG9,Gender wage gap at 9th decile (top),2007,2007,,,0,Units,,,41.081022,,
3,KOR,Korea,MW,All persons,GWG9,Gender wage gap at 9th decile (top),2008,2008,,,0,Units,,,41.352529,,
4,KOR,Korea,MW,All persons,GWG9,Gender wage gap at 9th decile (top),2009,2009,,,0,Units,,,40.141383,,


In [117]:
# 2) Structuring the dataset by dropping collumns that we do not need
drop_these = ['COUNTRY','SEX','Sex','Series','TIME','Unit','Unit Code','PowerCode Code','PowerCode','Reference Period Code','Reference Period','Flag Codes','Flags']
df_gg.drop(drop_these, axis=1, inplace=True)

# Renaming columns
df_gg = df_gg.rename(columns = {"SERIES" : "Type", "Value": "Gender_gap", "Time": "Year"})
df_gg.head(5)

Unnamed: 0,Country,Type,Year,Gender_gap
0,Korea,GWG9,2005,41.906664
1,Korea,GWG9,2006,40.365519
2,Korea,GWG9,2007,41.081022
3,Korea,GWG9,2008,41.352529
4,Korea,GWG9,2009,40.141383


In [118]:
# 3) Keeping only the gender gap at the median 
I = df_gg.Type.str.contains('GWG5')
df_gg = df_gg.loc[I==True] 

# Reset old index
df_gg.reset_index(inplace = True, drop = True) # Drop old index too
df_gg.drop('Type', axis=1, inplace=True)

# Sort
df_gg = df_gg.sort_values(['Country','Year'])

# Reset old index
df_gg.reset_index(inplace = True, drop = True) # Drop old index too
df_gg.head(5)

Unnamed: 0,Country,Year,Gender_gap
0,Australia,2005,15.777778
1,Australia,2006,16.666667
2,Australia,2007,15.4
3,Australia,2008,11.937378
4,Australia,2009,16.363636


In [119]:
# 4) Country specific dataset
countries = ['Denmark', 'United States', 'Sweden','Germany','United Kingdom','OECD countries']
df_c_g = df_gg[df_gg['Country'].isin(countries)]
df_c_g = df_c_g.round(1)

# Reset old index
df_c_g.reset_index(inplace = True, drop = True) # Drop old index too
df_c_g

Unnamed: 0,Country,Year,Gender_gap
0,Denmark,2005,10.2
1,Denmark,2006,10.2
2,Denmark,2007,9.9
3,Denmark,2008,10.2
4,Denmark,2009,10.2
...,...,...,...
95,United States,2017,18.2
96,United States,2018,18.9
97,United States,2019,18.5
98,United States,2020,17.7


In [120]:
# 5) Times series plot of the Gender Gap
import plotly.graph_objects as go

fig = go.Figure()

for country, df_c_g in df_c_g.groupby('Country'):
    if country == 'OECD countries':  # add condition to change line style for OECD countries
        fig.add_trace(go.Scatter(x=df_c_g['Year'], y=df_c_g['Gender_gap'], name=country, line=dict(color='black', dash='dash')))
    else:
        fig.add_trace(go.Scatter(x=df_c_g['Year'], y=df_c_g['Gender_gap'], name=country))

fig.update_layout(title='Development in Gender Gap over time',
                  xaxis_title='Year',
                  yaxis_title='Pct.')

fig.show()

The gender gap has generally been decreasing over the considered time period across all selected OECD-countries. On average, the gender gap has decreased from around 16 pct. in 2005 to approximately 12 pct. in 2021. The gender gap is generally lower in Denmark and Sweden and higher in United States and United Kingdom. In Germany, the gender gap is above the OECD-average over the entire period. 

In [121]:
# 6) Changing the data frame in order to create an overview for the ALL countries
df_pivot1 = df_gg.pivot(index='Year', columns='Country', values='Gender_gap')
df_pivot1.columns.name = None
df_pivot1.reset_index(inplace=True)
df_pivot1.head(5)

Unnamed: 0,Year,Australia,Austria,Belgium,Bulgaria,Canada,Chile,Colombia,Costa Rica,Croatia,...,Portugal,Romania,Slovak Republic,Slovenia,Spain,Sweden,Switzerland,Türkiye,United Kingdom,United States
0,2005,15.777778,22.032086,11.511093,,21.25,,,,,...,,,20.099256,,,11.304348,,,22.063958,18.975069
1,2006,16.666667,21.85513,10.253268,,21.123321,5.555556,,,,...,12.808461,8.396947,17.802768,6.557377,13.533333,11.016949,21.318132,3.206997,21.718084,19.246299
2,2007,15.4,21.634276,9.873618,,20.808235,,6.756667,,,...,,,17.521705,,,11.836735,,,21.634781,19.843342
3,2008,11.937378,20.91732,8.915754,,20.454545,,10.105758,,,...,,,16.44477,,,10.588235,21.307506,,21.857869,20.050125
4,2009,16.363636,19.355966,7.480461,,20.131868,9.090909,10.072222,,,...,,,16.432721,,,9.541985,,,20.686296,19.78022


In [122]:
# 7) Creating an interactive time series figure with drop down menu for each country

def _plot_timeseries(df_pivot1, country, years):
    
    fig = plt.figure(dpi=100)
    ax = fig.add_subplot(1,1,1)
    
    I = (df_pivot1['Year'] >= years[0]) & (df_pivot1['Year'] <= years[1])
    
    x = df_pivot1.loc[I,'Year']
    y = df_pivot1.loc[I, country]
    ax.plot(x,y)
    
    ax.set_xticks(list(range(years[0], years[1] + 1, 5)))
    ax.set_xlabel('Year')
    ax.set_ylabel('Gender Gap')
    ax.set_title(country)    
    
def plot_timeseries(df_pivot1):
    
    country_options = list(df_pivot1.columns[1:])
    
    widgets.interact(_plot_timeseries, 
    df_pivot1 = widgets.fixed(df_pivot1),
    country = widgets.Dropdown(
        description='Country', 
        options=country_options, 
        value=country_options[0]),
    years=widgets.IntRangeSlider(
        description="Years",
        min=df_pivot1['Year'].min(),
        max=df_pivot1['Year'].max(),
        value=[df_pivot1['Year'].min(), df_pivot1['Year'].max()],
        continuous_update=False,
    )                 
); 

In [123]:
plot_timeseries(df_pivot1)

interactive(children=(Dropdown(description='Country', options=('Australia', 'Austria', 'Belgium', 'Bulgaria', …

In [124]:
# 8) 2020 dataset for later use
df_gg_2020 = df_gg.loc[df_gg['Year']==2020]
df_gg_2020 = df_gg_2020.sort_values(['Country','Year'])
df_gg_2020.reset_index(inplace = True, drop = True) # Drop old index too

# c. Net child care costs
1) We display the uncleanned net child care costs for all countries in the OECD-data set
2) We drop irrelevant columns and rename
3) We clean the data-set and focus on the selection of OECD countries as in section b)
4) We display the net child care costs for each selected country
5) We make an interactive plot that illustrates the development in net child care costs over time

In our inagural project, we added a disutility of working in the labor market for women in order to make the model predictions fit data. We interpreted this as norms. According to Kleven et al. (2019), only female labor supply is affected by children, i.e., the child penalty only strikes women where having children is basically a non event for fathers. We therefore incorperated norms in the model based on an assumption that part of home production is related to child care. A lack of access to child care in society could potentially augment the child penalty as this could force women to stay at home when they have children (or at least work less). We investigate this hypothesis by plotting the net child care costs (as pct. of household income) across our selection of OECD-countries as we should expect child care costs to all things equal be higher when acess to formal child care is low. 

In [125]:
# 1) Displaying the uncleanned data set of net child care costs
df_nc.head(5)

Unnamed: 0,LOCATION,Country,TYPE,Type of indicator,COMPONENTS,Net childcares cost by item,FAMILY,Family type,EARNINGS,Earnings of the first adult,SATOPUPS,Include social assistance benefits,HBTOPUPS,Include housing benefits,TIME,Year,Value,Flag Codes,Flags
0,AUS,Australia,0,National currency,1,Gross childcare fees,SINGLE2C,Single person with 2 children,MIN,Minimum Wage,1,Yes,1,Yes,2004,2004,17472,,
1,AUS,Australia,0,National currency,1,Gross childcare fees,SINGLE2C,Single person with 2 children,MIN,Minimum Wage,1,Yes,1,Yes,2008,2008,21632,,
2,AUS,Australia,0,National currency,1,Gross childcare fees,SINGLE2C,Single person with 2 children,MIN,Minimum Wage,1,Yes,1,Yes,2012,2012,28353,,
3,AUS,Australia,0,National currency,1,Gross childcare fees,SINGLE2C,Single person with 2 children,MIN,Minimum Wage,1,Yes,1,Yes,2015,2015,33280,,
4,AUS,Australia,0,National currency,1,Gross childcare fees,SINGLE2C,Single person with 2 children,MIN,Minimum Wage,1,Yes,1,Yes,2018,2018,38272,,


In [126]:
# 2) Dropping columns that we not need for the analysis
drop_these = ['LOCATION','Type of indicator','Net childcares cost by item','Family type','Earnings of the first adult','HBTOPUPS','TIME','Flag Codes','Flags']
df_nc.drop(drop_these, axis=1, inplace=True)

#Renaming columns
df_nc = df_nc.rename(columns = {"TYPE" : "Type", "Value": "Cost", "Time": "Year"})

In [127]:
# 3) Getting one observation pr. year pr. country. 
# We consider percentage of net household income for two earner family with wages on average. 
# We include social benefits and housing benefits
I = (df_nc['Type'] == 1) & (df_nc['FAMILY'] == '2EARNERC2C_67AW') & (df_nc['Include social assistance benefits'] == 'Yes') & (df_nc['Include housing benefits'] == 'Yes') & (df_nc['COMPONENTS'] == 5) & (df_nc['EARNINGS'] == '67AW') 

df_nc = df_nc.loc[I==True] 

# Resetting the old index
df_nc.reset_index(inplace = True, drop = True) # Drop old index too
df_nc.drop('Type', axis=1, inplace=True)
df_nc.drop('FAMILY', axis=1, inplace=True)

# Sorting
df_nc = df_nc.sort_values(['Country','Year'])

# Resetting the old index
df_nc.reset_index(inplace = True, drop = True) # Drop old index too

In [128]:
# Keeping only relevant countries
countries_2 = ['Denmark', 'United States', 'Sweden', 'Germany','United Kingdom','OECD - Total']
df_c_nc = df_nc[df_nc['Country'].isin(countries_2)]
df_c_nc = df_c_nc.round(1)

In [129]:
# 4) Displaying cleanned data by a pivot table 
df_pivot2 = df_c_nc.pivot(index='Year', columns='Country', values='Cost')
df_pivot2.head()

Country,Denmark,Germany,OECD - Total,Sweden,United Kingdom,United States
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2004,12,6,15,9,26,31
2008,12,13,13,6,22,31
2012,12,11,13,5,23,30
2015,11,6,12,5,25,27
2018,11,5,11,5,25,29


In [130]:
# 5) Creating an interactive plot of the development in net child care costs over time
fig = go.Figure()

for country, df_c_nc in df_c_nc.groupby('Country'):
    if country == 'OECD - Total':  # add condition to change line style for OECD countries
        fig.add_trace(go.Scatter(x=df_c_nc['Year'], y=df_c_nc['Cost'], name=country, line=dict(color='black', dash='dash')))
    else:
        fig.add_trace(go.Scatter(x=df_c_nc['Year'], y=df_c_nc['Cost'], name=country))

fig.update_layout(title='Net costs for a household with two children using childcare',
                  xaxis_title='Year',
                  yaxis_title='Pct. of household income')

fig.show()

As expected, net costs for a household with two children using childcare are generally higher in the UK and the US where the gender gap is also higher. It is interesting that the costs in Denmark are generally higher than in Sweden and Germany when the gender gap is the smallest here. This may be due to data differences and how countries determine and calculate costs. 

# d. Parental leave, access formal childcare and gender gap

1) We read the OECD-data set for parental leave care across countries and drop irrelevant columns and rename. We keep only observatins for 2020 
2) We do the same for the formal child care (proportion of children aged 0-2 in formal childcare)
3) We merge the two data-sets and add the gender gap data for 2020 as well 
4) We create a bubble diagram that shows the access to formal child care and the number of weeks fathers can take parental leave. The bubble size indicates the size of the gender gap


The following section aims to show the connection between parental leave, the access to formal childcare (proportion of children aged 0-2 enrolled childcare) and gender gap across all OECD-countries in year 2020. 

In [131]:
# 1) Cleaning the parental leave dataset 
drop_these = ['COU','Indicator','SEX','Sex','AGE','Age Group','TIME','Unit','Unit Code','PowerCode Code','PowerCode','Reference Period Code','Reference Period','Flag Codes','Flags']
df_pl.drop(drop_these, axis=1, inplace=True)

# Renaming columns
df_pl = df_pl.rename(columns = {"IND" : "Type", "Value": "Father_leave", "Time": "Year"})


In [132]:
# Keep only EMP18_PAT variable
I = df_pl.Type.str.contains('EMP18_PAT')
df_pl = df_pl.loc[I==True] 

# Reset old index
df_pl.reset_index(inplace = True, drop = True) # Drop old index too
df_pl.drop('Type', axis=1, inplace=True)

# Sort
df_pl = df_pl.sort_values(['Country','Year'])

# Reset old index
df_pl.reset_index(inplace = True, drop = True) # Drop old index too

In [133]:
# 2020-data
df_pl_2020 = df_pl.loc[df_pl['Year']==2020]
df_pl_2020 = df_pl_2020.sort_values(['Country','Year'])
df_pl_2020.reset_index(inplace = True, drop = True) # Drop old index too


In [134]:
# 2) Cleaning the child benefits dataset
drop_these = ['COU','Indicator','SEX','Sex','YEAR','Unit','Unit Code','PowerCode Code','PowerCode','Reference Period Code','Reference Period','Flag Codes','Flags']
df_f.drop(drop_these, axis=1, inplace=True)

In [135]:
# Keeping FAM13 (pct. of the aged 0-2 years old that are in formal childcare)
I = df_f.IND.str.contains('FAM13')
df_fam13 = df_f.loc[I==True] 
df_fam13 = df_fam13.rename(columns = {"IND" : "Type", "Value": "Formal_child_care"})

# Reset old index
df_fam13.reset_index(inplace = True, drop = True) # Drop old index too
df_fam13.drop("Type", axis=1, inplace=True)

# Sort
df_fam13 = df_fam13.sort_values(['Country','Year'])

# Reset old index
df_fam13.reset_index(inplace = True, drop = True) # Drop old index too

In [136]:
# 2020-data
df_fam13_2020 = df_fam13.loc[df_fam13['Year']==2020]
df_fam13_2020 = df_fam13_2020.sort_values(['Country','Year'])
df_fam13_2020.reset_index(inplace = True, drop = True) # Drop old index too

In [137]:
# 3) Merging the data-set
df_merged = pd.merge(df_gg_2020, df_pl_2020, on=['Country','Year'], how='outer')
df_merged = pd.merge(df_merged, df_fam13_2020, on=['Country','Year'], how='outer')

In [138]:
# Drop observations where gender gap is NaN
df_merged = df_merged.round(1)
df_merged.fillna(0, inplace=True)
index_to_drop = df_merged[df_merged['Gender_gap'] == 0].index
df_merged = df_merged.drop(index=index_to_drop)

# Displaying the merged data set
df_merged.head()

Unnamed: 0,Country,Year,Gender_gap,Father_leave,Formal_child_care
0,Australia,2020,10.5,2.0,44.9
1,Austria,2020,12.4,13.0,20.2
2,Belgium,2020,1.2,19.3,56.9
3,Bulgaria,2020,2.6,0.0,15.0
4,Canada,2020,16.1,5.0,0.0


In [139]:
# 4) Making the bubble plot

highlighted_countries = ['Denmark', 'Sweden', 'Germany','United Kingdom','United States']

# Create a color map for the markers
marker_colors = df_merged['Country'].apply(lambda country: 'red' if country in highlighted_countries else 'blue')

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=df_merged['Father_leave'],
    y=df_merged['Formal_child_care'],
    mode='markers',
    marker=dict(
        size=df_merged['Gender_gap'],
        sizemode='area',
        sizeref=0.1,
        sizemin=5,
        color=marker_colors,  # Set the marker color based on the color map
        opacity=0.7,  # Set the opacity to make the markers semi-transparent
        line=dict(width=0.5, color='white'),  # Add a white border around the markers
    ),
    text=df_merged['Country'],
))

# Set the plot layout
fig.update_layout(
    title='Paternal leave and formal child care by country (2020)',
    xaxis_title='Paternity leave (Weeks)',
    yaxis_title='Formal child care (Pct.)',
)

# Displaying the plot in an interactive window
fig.show()

The bubble plot shows the connection between the paternity leave, access to formal childcare and the gender gap. The bubble size indicates the size of the gender gap. We would therefore expect, that the size of the bubbles would be relatively larger where acces to formal childcare and paternity leave are relatively lower. It seems that access to formal child care and paternity leave are important drivers for lowering the gender gap, but there appears to be some wierd outliers (France, Korea and Japan). E.g. in Korea the gender gap is large but acces to formal child care are high and fathers have good acess to paternity leave. This may be a problem with the validity of OECD-data and how countries differ in their determination and calculations. 

# Conclusion

Overall, the average gender gap and net child costs for all OECD countries have been decreasing during the period of 2005-2020. Our dataanaysis has mainly focussed on a selection of OECD countries where we can conclude that the selected countries with relatively high net child costs, all things equal, have a higher gender gap. Our hypothesis was further, that the lack of acces to formal childcare and lower paternity leave for men, is "forcing" women to work less and thereby increasing the gender gap. The hypothesis cannot be verified during this analysis and there appear to be some major outliers, but for most OECD-countries the drivers seem to be of importance for gender inequality. 