# **DATA PROJECT 2024: DO INTERENATIONAL WORKERS IN DENMARK REMEDY LABOR SHORTAGE?**

By Emma Knippel, Anna Abildsjov and Oscar Nyholm

# Table of contents
* [Setup](#toc0_)   

* [Introduction](#toc1_) 

* [Read and clean data](#toc2_)    

* [Exploring the data sets](#toc3_)    

* [Merging data sets of employment and international labor](#toc4_)   

* [Analysis](#toc5_) 

* [Conclusion](#toc6_) 

## <a id='toc0_'></a>[Setup](#toc0_)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from matplotlib_venn import venn2
import json

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2


In [2]:
# installing API reader, that will allow to load data from DST.
%pip install git+https://github.com/alemartinello/dstapi
%pip install pandas-datareader

import pandas_datareader # install with `pip install pandas-datareader`
from dstapi import DstApi # install with `pip install git+https://github.com/alemartinello/dstapi`

import dataproject #importing our own py-file with our code.

Collecting git+https://github.com/alemartinello/dstapi
  Cloning https://github.com/alemartinello/dstapi to /private/var/folders/24/czmv85dj1_3dcc2tc4x8kd0r0000gn/T/pip-req-build-nszwfjnt
  Running command git clone --quiet https://github.com/alemartinello/dstapi /private/var/folders/24/czmv85dj1_3dcc2tc4x8kd0r0000gn/T/pip-req-build-nszwfjnt
  Resolved https://github.com/alemartinello/dstapi to commit d9eeb5a82cbc70b7d63b2ff44d92632fd77123a4
  Preparing metadata (setup.py) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## <a id='toc1_'></a>[Introduction](#toc1_)

## <a id='toc1_'></a>[Read and clean data](#toc1_)

In [3]:
# importing the actual data from DST
employees = DstApi('LBESK03')
lb_short_service = DstApi('KBS2')
lb_short_manu = DstApi('BARO3')
lb_short_cons = DstApi('KBYG33')

**Cleaning the data on International workers sorted from JobIndsats**

In [4]:
int_labor = dataproject.clean_json_data()
int_labor.head(5)

Before cleaning, the JSON datafile from JobIndsats contains 1089 observations and 5 variables.
We have removed two columns and renamed the remaining.
The dataset now contains 1089 observations and 3 variables.
All our observations are of type: <class 'str'>. We want them to be integers.
The observations are now of type: <class 'numpy.float64'> and the first observation is: 2.184
The observations are now of type: <class 'numpy.int64'> and the first observation is: 2184
We convert our time Variable into datetime variables.
We now convert the DataFrame using the .pivot method, using time as index, industries as columns and international labor as our observations.
All our industries are in Danish, so we rename them to English.
For our dataset to match the data from DST, we sum over all industries to get the total and combine four of the industires so that they match
Lastly, we drop the industries, that we have just combined to make new ones.
The cleaned dataset now contains 8 columns (indu

industry,hotels_restaurents,information_communictaion,cleaning_etc,transport,research_consultancy,total,finance_real_estate,culture_leisure_other
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2014M01,17609,6657,26549,14777,9039,86158,4684,6843
2014M02,17957,6815,26792,14732,9137,86993,4690,6870
2014M03,18481,7001,27667,14975,9292,89550,4841,7293
2014M04,19209,7136,28547,15177,9611,92029,4928,7421
2014M05,19909,7342,29802,15380,9539,94728,5103,7653


In [5]:
empl = dataproject.clean_dst_empl(employees)
empl.head(5)

Since we have extracted all the data from the source on DST, we need to select only the variables that are relevant for our analysis
For the employment data, we first define our parameters so that we get only data from january 2014 to january 2024 and only for the total of industries.
Then, we retract the parameters we defined, into our DataFrame, drop the industry since we do not need to split the data on industry, and rename the columns to english, simple titles.
The cleaned dataset contains 2 columns and 121 observations.


Unnamed: 0,time,employees
0,2014M01,2561675
1,2014M02,2563945
2,2014M03,2566733
3,2014M04,2569268
4,2014M05,2570962


In [6]:
lab_short_service = dataproject.clean_dst_shortage1(lb_short_service)
lab_short_service.head(5)

Again, as for all the DST data, we need to select only the variables that are relevant for our analysis
For the labor shortage data, we need to sort through the dataset a bit more when defining out variables:
We need to specify which industries we want to get data from, since the dataset contains both broad and narrow categories.
Furhtermore, we want to get data only for the labor shortage and from january 2014 to january 2024.
We retrieve the parameters and sort the data by time and industry.
Then, we drop the column, TYPE, since we only have data for the labor shortage anyways, and this column would otherwise be used to split the data into diffeereeent categories of production limitations.
We also drop the old index and reset it.
We rename the industry codes to the industry names, so that they match the industries in the international labor data.
We convert the time variable into datetime variables.
The cleaned dataset contains 3 columns and 968 observations.


Unnamed: 0,industry,time,labor_shortage
0,culture_leisure,2014-01-01,6
1,cleaning_etc,2014-01-01,13
2,information_communication,2014-01-01,9
3,research_consultancy,2014-01-01,19
4,finance_real_estate,2014-01-01,0


In [7]:
lab_short_manu = dataproject.clean_dst_shortage2(lb_short_manu)
lab_short_manu.head(5)

Again, as for all the DST data, we need to select only the variables that are relevant for our analysis
We retreieve the parameters we defined into the DataFrame and sort the variables by time.
We then rename the columns to english, simple titles and reset the index.
We drop the industry and type columns, since we onle neeed data for the total industry
Finally, we set the time variable to datetime variables.
The cleaned dataset contains 2 columns and 41 observations.
The reason that the number of observations differ, is that manufacturing labor shortagae data is only publishedc once a quarter.


Unnamed: 0,time,labor_shortage
0,2014-01-01,1
1,2014-04-01,3
2,2014-07-01,2
3,2014-10-01,3
4,2015-01-01,2


In [8]:
lab_short_cons = dataproject.clean_dst_shortage3(lb_short_cons)
lab_short_cons.head(5)

The method for the cleaning of this dataset is exactly the same as for the manufacturinng sector.
The cleaned dataset contains 2 columns and 121 observations.


Unnamed: 0,time,labor_shortage
0,2014-01-01,3
1,2014-02-01,2
2,2014-03-01,3
3,2014-04-01,3
4,2014-05-01,2


## Explore each data set

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

Statisk plot af:
- mangel på medarbejdere i de 3 sektorer, januar 2014 til januar 2024

In [9]:
data = {'Time': ['2014M01', '2014M02', '2014M03']}
df = pd.DataFrame(data)

df[['Year', 'Month']] = df['Time'].str.extract(r'(\d{4})M(\d{2})')

df['Time'] = pd.to_datetime(df[['Year', 'Month']].assign(day=1))

df.drop(['Year', 'Month'], axis=1, inplace=True)

lab_short_service['Time'] = pd.to_datetime(lab_short_service['Time'], format='%YM%m')
lab_short_manu['Time'] = pd.to_datetime(lab_short_manu['Time'], format='%Y-%m-%d')
lab_short_cons['Time'] = pd.to_datetime(lab_short_cons['Time'], format='%Y-%m-%d')

plt.plot(lab_short_service['Time'], lab_short_service['Labor Shortage'], label='Service sector')
plt.plot(lab_short_manu['Time'], lab_short_manu['Labor Shortage'], label='Manufacturing sector')
plt.plot(lab_short_cons['Time'], lab_short_cons['Labor Shortage'], label='Construction sector')

plt.xlabel('Time')
plt.ylabel('Labor Shortage')
plt.title('Labor Shortage across Sectors (January 2014 - January 2024)')
plt.legend()

plt.show()


KeyError: 'Time'

**Interactive plot** :

Her skal vi have udviklingen i antal internationale medarbejdere i hver af servicebrancherne med drop-down

In [None]:
def plot_func():
    # Function that operates on data set
    pass

widgets.interact(plot_func, 
    # Let the widget interact with data through plot_func()    
); 


Explain what you see when moving elements of the interactive plot around. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.