### Purpose

Decompose changes in Australian employment levels over 5 yearly intervals due to:
* australian born
* migrants that arrived in the last five years, and 
* migrants that were already in Australia 5 years ago.

The 5 year period reflects that ABS labour force data identifies overseas born employed by time since arrival in 5 year intervals.

### Data

The LM7 datacube from [ABS 6291.0.55.001 (Labour Force, Detailed)](http://www.abs.gov.au/ausstats/abs@.nsf/mf/6291.0.55.001) contains monthly data on the numbers of:
* emloyed full-time
* employed part-time
* unemployed looked for full-time work
* unemployed looked for only part-time work
* not in the labour force (nilf)

by
* the number of years since arrival for migrants (in 5 year intervals up to 20 years, and > 20 years)
* gender
* place of birth (Australia, main english speaking countries, other than main enslish speaking countries and 'Not Stated / Inadequately Described / Born at sea')
* state

This analysis uses additional derived data from this dataset. The loaded dataframe includes:
* employed_total (sum of full-, and part-time employed)
* labor_force (sum of employed_total and and unemployed)
* population (sum of labor_force and nilf)
* COB (where place of birth is mapped to 'Australia', 'overseas' and 'Unkown')


### Deriving Australian born and migrant contributions to 5 year employment changes

Identifying the contributions of australian born, recent arrivals and established migrants can be derived by:

\begin{align}
\Delta E_5 & = E_{t} - E_{t-5} \\
& = (aus\_born_{t} + migrant_{t}) - (aus\_born_{t-5} + migrant_{t-5}) \\
& = (aus\_born_{t} - aus\_born_{t-5}) + (migrant_{t} - migrant_{t-5}) \\
& = \Delta aus\_born_5 + \Delta migrant_5
\end{align}

Where, in any given month t,  $E$ is the total number employed, $aus\_born$ is the number of australian born employed, $migrant$ is the number of overseas born employed.

$t-5$ refers to the month 5 years prior to the month $t$.

The change in employment levels of migrants, $\Delta migrant_5$, can be seperated into addtions from recent arrivals and changes in employment levels for established migrants.

\begin{equation}
\Delta migrant_5 = migrant_{arrived\_in\_last\_5\_years} + \Delta migrant_{arrived\_more\_than\_5\_years\_ago}
\end{equation}

That is, the decomposition of changes in 5 yearly total employment levels is:

\begin{equation}
\Delta E_5 = \Delta aus\_born_5 + migrant_{arrived\_in\_last\_5\_years} + \Delta migrant_{arrived\_more\_than\_5\_years\_ago}
\end{equation}

As the LM7 data cube contains data on the number of people employed by whether australian born or overseas born, together with 5 yearly arrival intervals for overseas born, $\Delta migrant_{arrived\_more_than\_5\_years\_ago}$  can be derived by substitution:

\begin{equation}
\Delta migrant_{arrived\_more\_than\_5\_years\_ago} = \Delta E_5 - \Delta aus\_born_5 - migrant_{arrived\_in\_last\_5\_years}
\end{equation}

### Calculations

#### Required libraries

In [1]:
import pandas as pd

from pathlib import Path


#### Get LM7 data

In [2]:
# Set the file path to the folder containing the data

# Assume the LM7 dataset is in the same folder as this notebook
data_folder = Path('.')

# The path for my file structure for data sets and should be ignored
# data_folder = Path(f'{Path.home()}/Documents/Analysis/Australian economy/Data/ABS')


In [3]:
# The data contained in sheet 'Data 1' in the LM7 datacube has been extracted, the additional items defined above calculated (such as employment_total, etc)
# and stored as a parquet file
# This statement reads in that created datafile
# You could do this direcctly from the LM7 with pd.read_excel() statement: see last cell in this notebook for an example

df = pd.read_parquet(data_folder / 'LM7.parquet')

df.tail(3)

Unnamed: 0_level_0,sex,MESC,elapsed_years_since_arrival,state,employed_full_time,employed_part_time,unemployed_looked_full_time,unemployed_looked_part_time_only,nilf,COB,labor_force,employed_total,population
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2018-09-30,Females,Not Stated / Inadequately Described / Born at sea,Not stated / Inadequately described / Born at sea,Tasmania,0.0,0.0,0.0,0.0,2.96,unknown,0.0,0.0,2.96
2018-09-30,Females,Not Stated / Inadequately Described / Born at sea,Not stated / Inadequately described / Born at sea,Northern Territory,0.0,0.0,0.0,0.0,0.56,unknown,0.0,0.0,0.56
2018-09-30,Females,Not Stated / Inadequately Described / Born at sea,Not stated / Inadequately described / Born at sea,Australian Capital Territory,0.0,0.0,0.0,0.0,1.44,unknown,0.0,0.0,1.44


In [4]:
# Check the data read in
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 69762 entries, 1991-01-31 to 2018-09-30
Data columns (total 13 columns):
sex                                 69762 non-null object
MESC                                69762 non-null object
elapsed_years_since_arrival         69762 non-null object
state                               69762 non-null object
employed_full_time                  69762 non-null float64
employed_part_time                  69762 non-null float64
unemployed_looked_full_time         69762 non-null float64
unemployed_looked_part_time_only    69762 non-null float64
nilf                                69762 non-null float64
COB                                 69762 non-null object
labor_force                         69762 non-null float64
employed_total                      69762 non-null float64
population                          69762 non-null float64
dtypes: float64(8), object(5)
memory usage: 7.5+ MB


In [5]:
# check no 'countries' missed
df.COB.unique()

array(['overseas', 'Australia', 'unknown'], dtype=object)

#### Australian born & migrant contribution to employment growth

In [6]:
def make_employed_by_duration(df, month=6):
    '''
    A function to extract employment levels for Aus. born, and OS born by time in Australia into a simple matrix from the LM7 datacube 
    
    Parameters:
    -----------
        df: the LM7 dataset (ie sheet: Data 1 from LM7 loaded in a dataframe)
        month: integer or None
            the month to use (eg 6 for financial) if doing annual calculations, if None then return all data
        
    Returns
    -------
        employed: pandas dataframe
    '''
    
    # Remove unknown COB
    idx = df.MESC != 'Not Stated / Inadequately Described / Born at sea'  # or idx = df.COB != 'unknown'


    arrived_order = ['Born in Australia',
                     'Arrived within last 5 years',
                     'Arrived 5-9 years ago',
                     'Arrived 10-14 years ago',
                     'Arrived 15-19 years ago',
                     'Arrived 20 or more years ago',
                     'total'
                    ]


    employed = (df.loc[idx]
                  .groupby([df.loc[idx].index, 'elapsed_years_since_arrival'])['employed_total']
                  .sum()
                  .unstack('elapsed_years_since_arrival')
                  .drop(columns=['Not stated / Inadequately described / Born at sea'])
                  .sort_index(axis=1, ascending=False)
                  .reindex(labels=arrived_order, axis='columns')
                  .assign(total = lambda x: x.sum(axis='columns'))
                  .rename_axis(None, axis='columns')
        )

    if month is None:
        return employed
    else:
        idx = employed.index.month == month
        return employed[idx]



In [7]:
month = 6 # use 6 for  analysis on an Australian financial year basis; use 12 for calendar, or any month as suits your analysis

employed = make_employed_by_duration(df, month=month)
employed.tail()

Unnamed: 0_level_0,Born in Australia,Arrived within last 5 years,Arrived 5-9 years ago,Arrived 10-14 years ago,Arrived 15-19 years ago,Arrived 20 or more years ago,total
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2014-06-30,8177.87,511.22,664.48,404.53,322.09,1487.54,11568
2015-06-30,8334.24,499.5,702.92,423.82,355.15,1454.74,11770
2016-06-30,8447.64,514.74,686.29,510.12,379.38,1463.92,12002
2017-06-30,8503.27,557.57,709.65,604.48,397.35,1489.18,12261
2018-06-30,8654.89,614.21,697.47,689.63,438.97,1522.47,12618


#### Contribution in level terms

In [8]:
# if the 'employed' dataframe data is on an annual basis, set time_delta to 5 (years).
# else if 'employed' is on a monthly basis, set the time_delta to 60 (months == 5 years)
# Comment out the unused time_delta below

if month is not None:
    # employed has annual year data
    time_delta = 5
else:
    # employed has monthly data
    time_delta = 60

idx = ['Born in Australia', 'total' ]

delta = (employed[idx]
             .diff(time_delta)
        )



delta_order = ['Born in Australia',
               'Arrived within last 5 years',
               'arrived_more_than_5_years',
               'total'
              ]


delta = (pd
             .concat([delta, employed['Arrived within last 5 years']], axis='columns')
             .assign(arrived_more_than_5_years = lambda x: x.total - x['Born in Australia'] - x['Arrived within last 5 years'])
             .reindex(labels=delta_order, axis='columns')
        )
         

delta.tail()

Unnamed: 0_level_0,Born in Australia,Arrived within last 5 years,arrived_more_than_5_years,total
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-06-30,322.76,511.22,-32.3,801.68
2015-06-30,329.23,499.5,-64.72,764.02
2016-06-30,365.62,514.74,-96.06,784.3
2017-06-30,379.57,557.57,-34.56,902.58
2018-06-30,474.71,614.21,27.9,1116.82


#### Contribution in percentage terms

In [9]:
idx = ['Born in Australia', 'Arrived within last 5 years', 'arrived_more_than_5_years']

delta_share = (delta[idx]
                   .divide(delta.total, axis='rows') * 100
              )
       
(delta_share
     .dropna(axis='index', how='any')
     .round(0)
     .astype(int)
     .tail()
)

Unnamed: 0_level_0,Born in Australia,Arrived within last 5 years,arrived_more_than_5_years
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-06-30,40,64,-4
2015-06-30,43,65,-8
2016-06-30,47,66,-12
2017-06-30,42,62,-4
2018-06-30,43,55,2


#### How to: read in LM7 as an excel file

In [10]:
# label defintions

col_names = {'Month': 'date',
             'Sex': 'sex',
             'Main English-speaking countries': 'MESC',
             'Elapsed years since arrival': 'elapsed_years_since_arrival',
             'State and territory (STT): ASGS (2011)': 'state',
             "Employed full-time ('000)": 'employed_full_time',
             "Employed part-time ('000)": 'employed_part_time',
             "Unemployed looked for full-time work ('000)": 'unemployed_looked_full_time',
             "Unemployed looked for only part-time work ('000)": 'unemployed_looked_part_time_only',
             "Not in the labour force (NILF) ('000)": 'nilf',
             }

OSB = {'Main English-speaking countries': 'overseas',
       'Other than main English-speaking countries': 'overseas',
       'Australia (includes External Territories)': 'Australia',
       'Not Stated / Inadequately Described / Born at sea': 'unknown'
       }

idx_labor_force = ['employed_full_time', 'employed_part_time', 'unemployed_looked_full_time',
                   'unemployed_looked_part_time_only']

In [11]:
%%time
# get data
# The top 3 rows in the sheet 'Data 1' of the LM7 notebook should be unmerged, otherwise, it will take several minutes to read in the data (as opposed to ~6s on my machine machine)

df = (pd
          .read_excel(data_folder / 'LM7.xlsx',
                      usecols='A:J',
                      sheet_name='Data 1',
                      skiprows=3,
                      parse_dates=[0], infer_datetime_format=True,
                      )
          .rename(columns=col_names)
          # derive additional variables
          .assign(date=lambda x: x.date + pd.offsets.MonthEnd(0))
          .assign(COB=lambda x: x.MESC.map(OSB))
          .assign(labor_force=lambda x: x[idx_labor_force].sum(axis=1))
          .assign(employed_total=lambda x: x.employed_full_time + x.employed_part_time)
          .assign(population=lambda x: x.nilf + x.labor_force)
          .set_index('date')
          )

for col in df.select_dtypes(include='object').columns:
        df[col] = df[col].astype('category')
        
df.tail()

CPU times: user 6.6 s, sys: 114 ms, total: 6.71 s
Wall time: 6.33 s
