# Inequality in wage between men and women

In this project we seek to analyze the wages for men and women and compare it to the danish GDP.

We use data from Danmarks Statistikbank where we look at the wage for men and women in the years 2013 - 2021 and the GDP for the years 2010 - 2022.

Dataset **wagemen.xlsx** shows the wages for men in different age groups. 
Dataset **wagewomen.xlsx** shows the wages for women in different age groups. 
Furthermore, we load and clean the dataset for GDP in Denmark for the period 2010-2022, so we are able to see if the development in women and men's wages is following the GDP trend. 

**The structure of this dataproject:**
1. We import the three datasets from Excel sheets
2. We clean the three datasets
3. We explore the wage datasets using interactive figures
4. We first merge the two wage datasets using left join method
5. We merge the above dataset with the GDP dataset using left join method
6. We create a summary statistics table for men and women's wages
7. We create a figure of the wage for men and women and the GDP for the period 2013 - 2021


Imports and set magics:

In [None]:
# Importing modules
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams.update({"axes.grid":True,"grid.color":"black","grid.alpha":"0.25","grid.linestyle":"--"})
plt.rcParams.update({'font.size': 14})
plt.style.use('seaborn-whitegrid')

import ipywidgets as widgets
from matplotlib_venn import venn2

import pydst
dst = pydst.Dst(lang='en')  

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject


# Read and clean data

We will be importing the three datasets manually - wagemen, wagewomen and gdp.

First we will import and clean the dataset for wages for men and afterwards clean the dataset for wages for women and the same for GDP. 

**Dataset for men's wages:**

In [None]:
# Loading dataset for men's wages and skipping the first two rows
wagemen = pd.read_excel('wagemen.xlsx', skiprows=2)

# Dropping 'Unnamed: 0' 'Unnamed: 1' 'Unnamed: 2' 'Unnamed: 3' 'Unnamed: 4'
drop_these = ['Unnamed: ' + str(num) for num in range(5)] 
wagemen.drop(drop_these, axis=1, inplace=True)
print(drop_these)

# Renaming variables
wagemen.rename(columns = {'Unnamed: 5':'age_intervals'}, inplace=True)

# Renaming the columns
col_dict = {}
for i in range(2013, 2021+1): 
    col_dict[str(i)] = f'wagemen{i}' 
col_dict
wagemen.rename(columns = col_dict, inplace=True)

# Showing the cleaned dataset
wagemen

In [None]:
# We now change the dataset from a wide dataset to a long dataset to be able to merge the datasets later on
wagemen_long = pd.wide_to_long(wagemen, stubnames='wagemen', i='age_intervals', j='year')
wagemen_long.head(10)

**Dataset for women's wages:**

In [None]:
# Loading dataset for women's wages and skipping the first two rows
wagewomen = pd.read_excel('wagewomen.xlsx', skiprows=2)

# Dropping 'Unnamed: 0' 'Unnamed: 1' 'Unnamed: 2' 'Unnamed: 3' 'Unnamed: 4'
drop_these = ['Unnamed: ' + str(num) for num in range(5)] 
wagewomen.drop(drop_these, axis=1, inplace=True)
print(drop_these)

# Renaming variables
wagewomen.rename(columns = {'Unnamed: 5':'age_intervals'}, inplace=True)

# Renaming the columns
col_dict = {}
for i in range(2013, 2021+1): 
    col_dict[str(i)] = f'wagewomen{i}' 
col_dict
wagewomen.rename(columns = col_dict, inplace=True)

# Showing the cleaned dataset
wagewomen

In [None]:
# We now change the dataset from a wide dataset to a long dataset to be able to merge the datasets later on
wagewomen_long = pd.wide_to_long(wagewomen, stubnames='wagewomen', i='age_intervals', j='year')
wagewomen_long.head(10)

**Dataset for GDP:**

In [None]:
# Loading dataset for gsp and skipping the first two rows
gdp = pd.read_excel('GDP.xlsx', skiprows=2)

# Dropping 'Unnamed: 0' 
drop_these = ['Unnamed: ' + str(num) for num in range(1)] 
gdp.drop(drop_these, axis=1, inplace=True)
print(drop_these)

# Renaming the columns
col_dict = {}
for i in range(2010, 2022+1):
    col_dict[str(i)] = f'gdp{i}' 
col_dict
gdp.rename(columns = col_dict, inplace=True)

# Renaming variable
gdp.rename(columns = {'Unnamed: 1':'GDP'}, inplace=True)

# Only showing the first coloumn
gdp = gdp.iloc[[0]]

# Showing the cleaned dataset
gdp

In [None]:
# We now change the dataset from a wide dataset to a long dataset to be able to merge the datasets later on
gdp_long = pd.wide_to_long(gdp, stubnames='gdp', i='GDP', j='year')
gdp_long.head(10)

## Explorering each data set

To be able to make further analysis we start of by doing an interactive plot for men and women's wages for the years 2013 to 2021. We construct an interactive plot where it is possible to select different age intervals.

**Interactive plot for men** :

In [None]:
# We start of by resetting the index
wagemen_long = wagemen_long.reset_index()
wagemen_long.loc[wagemen_long.age_intervals == 'Alder i alt', :]

In [None]:
# Defining our function to construct the interactive plot
def plot_men(df, age_intervals): 
    I = df['age_intervals'] == age_intervals
    ax=df.loc[I,:].plot(x='year', y='wagemen', style='-o', legend=False)
    ax.xaxis.set_ticks(np.arange(2013, 2022, 1))
    ax.set_ylabel('Wage in million DKK')
    ax.set_title('Interactive plot for different age groups for men\'s wage')

In [None]:
# Plotting men's wages
widgets.interact(plot_men, 
    df = widgets.fixed(wagemen_long),
    age_intervals = widgets.Dropdown(description='Age groups', 
                                    options=wagemen_long.age_intervals.unique(), 
                                    value='Alder i alt')
); 

From the interactive plot above we see that most age groups follow an almost linear trend which is increasing over the years. However, we notice that the age group "under 20 years" stagnate from 2015 to 2016. Furthermore, we see that the age group "60 years and above" are rather flat from 2013 to 2016, but afterwards increases dramatically until 2021.

**Interactive plot for women** :

In [None]:
# We start of by resetting the index
wagewomen_long = wagewomen_long.reset_index()
wagewomen_long.loc[wagewomen_long.age_intervals == 'Alder i alt', :]

In [None]:
# Defining our function to construct the interactive plot
def plot_women(df, age_intervals): 
    I = df['age_intervals'] == age_intervals
    ax=df.loc[I,:].plot(x='year', y='wagewomen', style='-o', legend=False)
    ax.xaxis.set_ticks(np.arange(2013, 2022, 1))
    ax.set_ylabel('Wage in million DKK')
    ax.set_title('Interactive plot for different age groups for women\'s wage')

In [None]:
# Plotting women's wages
widgets.interact(plot_women, 
    df = widgets.fixed(wagewomen_long),
    age_intervals = widgets.Dropdown(description='Age groups', 
                                    options=wagewomen_long.age_intervals.unique(), 
                                    value='Alder i alt')
); 

From the figure above we see that the overall tendency is increasing for all age groups. For the age group "under 20 years" we see a considerable drop in the wages for women for the period 2015 to 2016.

# Merge data sets

We start off by merging the two wage datasets.

In [None]:
# Merging the two wage datasets over year and age intervals
mergedwage = pd.merge(wagewomen_long, wagemen_long, how='left', on=['year', 'age_intervals'])
mergedwage.head(10)

We use left-join since we would like to include everything in the merged wage dataset, but we only want to include the years in the gdp dataset that are in both datasets (2013-2021).

In [None]:
# Merging the dataset above for men and women's wages and gdp
mergedall = pd.merge(mergedwage, gdp_long, how='left', on=['year'])
mergedall.head(10)

A left join will keep all observations in the merged wage dataset and subset only from gdp. 

# Analysis

We start off by creating a summary table for men and women's wage in the given age intervals, where we look at the mean, std, min, max and the three fractils (25%, 50% and 75%).

In [None]:
mergedall.groupby(['age_intervals'])['wagemen', 'wagewomen'].describe().head(11)

In the table above we see that the mean wage for men is lowest for the age group "under 20 years" and highest for the age group "45-49 years". However, the maximal value for men's wage is given in the age group "50-54 years".

For women the highest and lowest mean is the same age groups, but the maximal value of wage for women is given in the age group "45-49 years".

In general, the mean for men's wages are overall higher than women's wages in all age groups except the age group "under 20 years".

In [None]:
# We create a figure
fig = plt.figure()
ax = fig.add_subplot(1,1,1)

# We group by years and choose only to look at the wages for men and women
mergedall.groupby('year')['wagemen', 'wagewomen'].mean().plot(ax=ax,style='-o')

# We set the labels and title
ax.xaxis.set_ticks(np.arange(2013, 2022, 1))
ax.set_ylabel('Wage in million DKK')
ax.set_title('Wage development for men & women');

# We dublicate the above figure to be able to make two y-axis
ax2=ax.twinx()

# Creating the plot
ax2.plot(mergedall['year'], mergedall["gdp"],color="maroon",marker="o",label='GDP')
ax2.set_ylabel("GDP")
ax2.legend(loc='center left', bbox_to_anchor=(1.15, 0.63))
ax.legend(loc='center left', bbox_to_anchor=(1.15, 0.75))


Overall for the figure above we see an increasing trend for all three variables. 

For the two wage variables we see that men's wages are higher than women's wage for the whole period, but that the gap is decreasing. 

GDP is decreasing from 2019 to 2020, but we see that this have no direct impact on the wages.

# Conclusion

Through this project we have shown that the average wage for men is higher than the average wage for women, and that the GDP does not have any direct impact on the wages in the period 2013 to 2021. 