# Inequality in wage between men and women

In this project we seek to analyze the wages for men and women. 

We use data from Danmarks Statistikbank where we look at the wage for men and women in the years 2013 - 2021.

Dataset **wagemen.xlsx** shows the wages for men in different age groups. 
Dataset **wagewomen.xlsx** shows the wages for women in different age groups. 


Imports and set magics:

In [48]:
# Importing modules
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams.update({"axes.grid":True,"grid.color":"black","grid.alpha":"0.25","grid.linestyle":"--"})
plt.rcParams.update({'font.size': 14})
plt.style.use('seaborn-whitegrid')

import ipywidgets as widgets
from matplotlib_venn import venn2

import pydst
dst = pydst.Dst(lang='en')  

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Read and clean data

We will be importing the two datasets manually - wagemen and wagewomen.

First we will import and clean the dataset for wages for men and afterwards clean the dataset for wages for women. 

In [49]:
# Loading dataset for men's wages and skipping the first two rows
wagemen = pd.read_excel('wagemen.xlsx', skiprows=2)

# Dropping 'Unnamed: 0' 'Unnamed: 1' 'Unnamed: 2' 'Unnamed: 3' 'Unnamed: 4'
drop_these = ['Unnamed: ' + str(num) for num in range(5)] 
wagemen.drop(drop_these, axis=1, inplace=True)
print(drop_these)

# Renaming variables
wagemen.rename(columns = {'Unnamed: 5':'age_intervals'}, inplace=True)

# Renaming the columns
col_dict = {}
for i in range(2013, 2021+1): # range goes from 2008 to but not including 2018
    col_dict[str(i)] = f'wagemen{i}' 
col_dict
wagemen.rename(columns = col_dict, inplace=True)

# Showing the cleaned dataset
wagemen

['Unnamed: 0', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']


Unnamed: 0,age_intervals,wagemen2013,wagemen2014,wagemen2015,wagemen2016,wagemen2017,wagemen2018,wagemen2019,wagemen2020,wagemen2021
0,Alder i alt,309.21,312.75,315.84,318.39,324.98,334.3,340.88,346.28,352.15
1,Under 20 år,116.27,119.26,122.7,122.71,124.81,127.1,128.59,131.26,139.71
2,20-24 år,179.52,182.77,186.73,189.26,194.14,200.92,202.93,205.53,209.94
3,25-29 år,238.14,241.58,244.36,248.56,254.57,263.53,269.17,273.53,279.24
4,30-34 år,284.7,286.74,288.26,291.61,298.79,307.44,314.42,320.42,327.12
5,35-39 år,318.2,320.5,323.47,326.31,333.2,342.0,347.92,352.83,358.49
6,40-44 år,337.04,341.27,346.18,349.73,356.58,366.24,372.72,377.34,383.26
7,45-49 år,343.65,347.78,352.39,357.63,367.18,377.95,386.38,392.18,398.22
8,50-54 år,336.81,341.22,347.06,353.69,362.64,374.81,385.04,390.88,398.58
9,55-59 år,327.33,332.27,336.7,342.28,350.0,361.74,371.4,379.9,388.47


In [50]:
# We know change the dataset from a wide dataset to a long dataset
wagemen_long = pd.wide_to_long(wagemen, stubnames='wagemen', i='age_intervals', j='year')
wagemen_long.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,wagemen
age_intervals,year,Unnamed: 2_level_1
Alder i alt,2013,309.21
Under 20 år,2013,116.27
20-24 år,2013,179.52
25-29 år,2013,238.14
30-34 år,2013,284.7
35-39 år,2013,318.2
40-44 år,2013,337.04
45-49 år,2013,343.65
50-54 år,2013,336.81
55-59 år,2013,327.33


In [51]:
# Loading dataset for women's wages and skipping the first two rows
wagewomen = pd.read_excel('wagewomen.xlsx', skiprows=2)

# Dropping 'Unnamed: 0' 'Unnamed: 1' 'Unnamed: 2' 'Unnamed: 3' 'Unnamed: 4'
drop_these = ['Unnamed: ' + str(num) for num in range(5)] 
wagewomen.drop(drop_these, axis=1, inplace=True)
print(drop_these)

# Renaming variables
wagewomen.rename(columns = {'Unnamed: 5':'age_intervals'}, inplace=True)

# Renaming the columns
col_dict = {}
for i in range(2013, 2021+1): # range goes from 2008 to but not including 2018
    col_dict[str(i)] = f'wagewomen{i}' 
col_dict
wagewomen.rename(columns = col_dict, inplace=True)

# Showing the cleaned dataset
wagewomen

['Unnamed: 0', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']


Unnamed: 0,age_intervals,wagewomen2013,wagewomen2014,wagewomen2015,wagewomen2016,wagewomen2017,wagewomen2018,wagewomen2019,wagewomen2020,wagewomen2021
0,Alder i alt,270.72,275.58,280.45,283.6,290.11,297.94,304.03,312.18,316.83
1,Under 20 år,117.71,120.68,123.34,122.21,124.53,128.04,131.86,135.8,143.56
2,20-24 år,162.92,166.21,168.62,171.49,175.59,180.75,183.43,188.22,191.8
3,25-29 år,229.47,233.84,238.45,241.88,247.55,253.33,258.32,265.24,269.96
4,30-34 år,266.14,271.25,276.03,278.93,285.57,291.77,297.66,306.5,312.26
5,35-39 år,279.79,284.76,290.93,295.52,302.59,309.95,315.8,324.53,329.08
6,40-44 år,285.87,291.24,297.68,302.51,309.8,318.1,324.72,333.97,339.84
7,45-49 år,288.3,293.39,299.32,304.88,313.24,321.94,329.77,338.43,344.87
8,50-54 år,284.18,289.11,294.85,300.68,309.67,320.18,327.92,336.76,343.19
9,55-59 år,282.09,286.76,291.17,296.24,303.28,311.93,319.44,329.06,335.66


In [52]:
# We know change the dataset from a wide dataset to a long dataset
wagewomen_long = pd.wide_to_long(wagewomen, stubnames='wagewomen', i='age_intervals', j='year')
wagewomen_long.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,wagewomen
age_intervals,year,Unnamed: 2_level_1
Alder i alt,2013,270.72
Under 20 år,2013,117.71
20-24 år,2013,162.92
25-29 år,2013,229.47
30-34 år,2013,266.14
35-39 år,2013,279.79
40-44 år,2013,285.87
45-49 år,2013,288.3
50-54 år,2013,284.18
55-59 år,2013,282.09


## Explore each data set

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

**Interactive plot for men** :

In [53]:
wagemen_long = wagemen_long.reset_index()
wagemen_long.loc[wagemen_long.age_intervals == 'Alder i alt', :]

Unnamed: 0,age_intervals,year,wagemen
0,Alder i alt,2013,309.21
11,Alder i alt,2014,312.75
22,Alder i alt,2015,315.84
33,Alder i alt,2016,318.39
44,Alder i alt,2017,324.98
55,Alder i alt,2018,334.3
66,Alder i alt,2019,340.88
77,Alder i alt,2020,346.28
88,Alder i alt,2021,352.15


In [59]:
import ipywidgets as widgets
def plot_men(df, age_intervals): 
    I = df['age_intervals'] == age_intervals
    ax=df.loc[I,:].plot(x='year', y='wagemen', style='-o', legend=False)

In [60]:
widgets.interact(plot_men, 
    df = widgets.fixed(wagemen_long),
    age_intervals = widgets.Dropdown(description='age_intervals', 
                                    options=wagemen_long.age_intervals.unique(), 
                                    value='Alder i alt')
); 

interactive(children=(Dropdown(description='age_intervals', options=('Alder i alt', 'Under 20 år', '20-24 år',…

**Interactive plot for women** :

In [56]:
wagewomen_long = wagewomen_long.reset_index()
wagewomen_long.loc[wagewomen_long.age_intervals == 'Alder i alt', :]

Unnamed: 0,age_intervals,year,wagewomen
0,Alder i alt,2013,270.72
11,Alder i alt,2014,275.58
22,Alder i alt,2015,280.45
33,Alder i alt,2016,283.6
44,Alder i alt,2017,290.11
55,Alder i alt,2018,297.94
66,Alder i alt,2019,304.03
77,Alder i alt,2020,312.18
88,Alder i alt,2021,316.83


In [61]:
import ipywidgets as widgets
def plot_women(df, age_intervals): 
    I = df['age_intervals'] == age_intervals
    ax=df.loc[I,:].plot(x='year', y='wagewomen', style='-o', legend=False)

In [62]:
widgets.interact(plot_women, 
    df = widgets.fixed(wagewomen_long),
    age_intervals = widgets.Dropdown(description='age_intervals', 
                                    options=wagewomen_long.age_intervals.unique(), 
                                    value='Alder i alt')
); 

interactive(children=(Dropdown(description='age_intervals', options=('Alder i alt', 'Under 20 år', '20-24 år',…

Explain what you see when moving elements of the interactive plot around. 

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

In [None]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('Data X', 'Data Y'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.