## Investigating Data Distribution

In this notebook, we attempt to quantify how seriously imbalanced the data distribution is. This has important implications for model training as well as the theoretical upper-limit for our model performance.

#### Set options and import data

In [3]:
import os
os.chdir('..')

In [75]:
import pandas as pd
pd.options.display.max_rows = 100

In [5]:
data = pd.read_csv('Data/Processed/Training/train-aws/train_full.csv')
SSOC_2020 = pd.read_csv('Data/Processed/Training/train-aws/SSOC_2020.csv')

In [106]:
mcf_final = pd.read_csv('Data/Processed/MCF_Training_Set_Full.csv', low_memory = False)
original_ssocs = pd.read_csv('Data/Processed/Original SSOCs.csv', low_memory = False)

#### Quick rehash

1D SSOC breakdown:

* 1: Legislators, Senior Officials and Managers
* 2: Professionals
* 3: Associate Professionals and Technicians
* 4: Clerical Support Workers
* 5: Service and Sales Workersd
* 6: Agricultural and Fishery Workers
* 7: Craftsmen and Related Trades Workers
* 8: Plant and Machine Operators and Assemblers
* 9: Cleaners, Labourers and Related Workers

#### Basic data investigation

Distribution of 1D SSOCs highly skewed to 1, 2, and 3 (managers and professionals). 6 (agri and fishery workers) is almost non-existent.

In [16]:
data['SSOC 2020'].astype('str').str.slice(0, 1).value_counts()

2    15882
3     9306
1     7807
4     4039
5     2054
8     1643
9     1103
7     1001
6        7
Name: SSOC 2020, dtype: int64

Of the top 5 most common SSOCs, 4 of them are "others" categories. This is not very useful for us.

* 21499: Other engineering professionals n.e.c
* 33499: Other administrative and related associate professionals n.e.c
* 13499: Other professional, financial, community, and social services managers n.e.c
* 25121: Software developer
* 41109: Other administrative clerks (e.g. public relations clerk)

In [67]:
data['SSOC 2020'].astype('str').str.slice(0, 5).value_counts().sort_values(ascending = False).head(10)

21499    3720
33499    2621
13499    1899
25121    1654
41109    1512
12211    1490
24212    1273
33299    1001
25190     898
43112     841
Name: SSOC 2020, dtype: int64

There are 583 SSOCs that are not featured at all in our current dataset (with predictions). This represents more than half of the available SSOCs, which will be problematic for us when trying to finetune the data.

In [120]:
SSOC_2020_counts = SSOC_2020.merge(data.groupby('SSOC 2020').count()['Cleaned_Description'].reset_index(),
                                   on = 'SSOC 2020',
                                   how = 'left').fillna(0)
SSOC_2020_counts.columns = ['SSOC 2020', 'Description', 'Predicted_Count']

In [129]:
print(f"Number of SSOCs with 0 counts: {len(SSOC_2020_counts[SSOC_2020_counts['Predicted_Count'] == 0])}")
print(f"Number of SSOCs with <10 counts: {len(SSOC_2020_counts[SSOC_2020_counts['Predicted_Count'] < 10])}")

Number of SSOCs with 0 counts: 583
Number of SSOCs with <10 counts: 730


This is unfortunately only slightly worse than the declared SSOCs, with 464 SSOCs not being featured at all.

In [123]:
#original_ssocs = original_ssocs[~original_ssocs['Reported_SSOC_2020'].str.contains('X')]
original_ssocs['Reported_SSOC_2020'] = original_ssocs['Reported_SSOC_2020'].astype('int64')
data_w_original = pd.concat([data, mcf_final[['MCF_Job_Ad_ID']]], axis = 1).merge(original_ssocs,
                                                                                  on = 'MCF_Job_Ad_ID',
                                                                                  how = 'left')

In [124]:
SSOC_2020_counts = SSOC_2020_counts.merge(data_w_original.groupby('Reported_SSOC_2020').count()['Cleaned_Description'].reset_index(),
                                          left_on = 'SSOC 2020',
                                          right_on = 'Reported_SSOC_2020',
                                          how = 'left').fillna(0)
SSOC_2020_counts.drop('Reported_SSOC_2020', axis = 1, inplace = True)
SSOC_2020_counts.columns = ['SSOC 2020', 'Description', 'Predicted_Count', 'Reported_Count']

In [128]:
print(f"Number of SSOCs with 0 counts: {len(SSOC_2020_counts[SSOC_2020_counts['Reported_Count'] == 0])}")
print(f"Number of SSOCs with <10 counts: {len(SSOC_2020_counts[SSOC_2020_counts['Reported_Count'] < 10])}")

Number of SSOCs with 0 counts: 464
Number of SSOCs with <10 counts: 718


Comparing the top 10 SSOCs for both the predicted and reported SSOCs... and there is a clear similarity here.

In [143]:
data_w_original['Reported_SSOC_2020'].astype('str').str.slice(0, 5).value_counts().sort_values(ascending = False).head(10)

21499    2045
33499    1820
25121    1538
13499    1189
41109    1142
12211    1100
29090    1095
39990    1049
33299     936
24212     865
Name: Reported_SSOC_2020, dtype: int64

In [145]:
data['SSOC 2020'].astype('str').str.slice(0, 5).value_counts().sort_values(ascending = False).head(10)

21499    3720
33499    2621
13499    1899
25121    1654
41109    1512
12211    1490
24212    1273
33299    1001
25190     898
43112     841
Name: SSOC 2020, dtype: int64