## Visualising SSOC 2020 Embeddings in 2D space

**Author:** Shaun Khoo  
**Date:** 3 Oct 2021  
**Context:** Interesting to see how "close" the SSOC 2020 descriptions are to each other to assess how difficult it would be to separate them in some dimensional space - the closer they are, presumably we shouldn't have too high hopes for accuracy  
**Objective:** Generate 2D embeddings for all SSOC 2020 to visualise in Tableau

#### A) Importing libraries and data

We clean the SSOC definitions data by combining the "tasks" specified in the 4D SSOC level with the "description" specified in the 5D SSOC level.

In [1]:
import spacy
import numpy as np
import pandas as pd

In [2]:
SSOC_Definitions = pd.read_excel('../Data/Raw/SSOC2020 Detailed Definitions.xlsx', skiprows = 4)

  warn("""Cannot parse header or footer so it will be ignored""")


In [3]:
SSOC_4D = SSOC_Definitions[SSOC_Definitions['SSOC 2020'].apply(len) == 4][['SSOC 2020', 'Tasks']]
SSOC_4D.columns = ['4D SSOC', 'Tasks']
SSOC_5D = SSOC_Definitions[(SSOC_Definitions['SSOC 2020'].apply(len) == 5) & ~SSOC_Definitions['SSOC 2020'].str.contains('X')].reset_index(drop = True)
SSOC_5D['4D SSOC'] = SSOC_5D['SSOC 2020'].str.slice(0, 4)
SSOC_5D.drop('Tasks', axis = 1, inplace = True)
SSOC_Final = SSOC_5D.merge(SSOC_4D, how = 'left', on = '4D SSOC')
SSOC_Final['Description'] = SSOC_Final['Detailed Definitions'] + " " + SSOC_Final['Tasks']
data = SSOC_Final[['SSOC 2020', 'Description']]

In [None]:
SSOC_Final

Unnamed: 0,SSOC 2020,SSOC 2020 Title,Groups Classified Under this Code,Detailed Definitions,Notes,Examples of Job Classified Under this Code,Examples of Job Classified Elsewhere,4D SSOC,Tasks,Description
0,11110,Legislator,<Blank>,"Legislator determines, formulates and directs ...",<Blank>,• President (government)\n• Attorney general...,<Blank>,1111,- presiding over or participating in the proce...,"Legislator determines, formulates and directs ..."
1,11121,Senior government official,<Blank>,"Senior government official plans, organises an...",<Blank>,• Director-general\n• High commissioner (gov...,"• Commissioned police officer, see 33551",1112,- advising government and legislators on polic...,"Senior government official plans, organises an..."
2,11122,Senior statutory board official,<Blank>,"Senior statutory board official plans, organis...",Senior statutory board official may be designa...,• Chairman (statutory board)\n• Chief execut...,"• Chief executive (company), see 11201\n• Ex...",1112,- advising government and legislators on polic...,"Senior statutory board official plans, organis..."
3,11140,Senior official of political party organisation,<Blank>,Senior official of political party organisatio...,<Blank>,• Administrator of political party organisation,<Blank>,1114,"- determining and formulating the policies, ru...",Senior official of political party organisatio...
4,11150,"Senior official of employers', workers' and ot...",<Blank>,"Senior official of employers', workers' and ot...",<Blank>,• Administrator of business association\n• A...,<Blank>,1115,"- determining and formulating the policies, ru...","Senior official of employers', workers' and ot..."
...,...,...,...,...,...,...,...,...,...,...
992,96272,Concierge (hotel),<Blank>,Concierge (hotel) serves as the point of conta...,<Blank>,<Blank>,"• Hotel front office agent, see 42242\n• Apa...",9627,- co-ordinating and carrying out customers' re...,Concierge (hotel) serves as the point of conta...
993,96291,Leaflet and newspaper distributor/deliverer,<Blank>,Leaflet and newspaper distributor/deliverer ha...,<Blank>,• Newspaper delivery man,<Blank>,9629,- handing out leaflets and free newspapers at ...,Leaflet and newspaper distributor/deliverer ha...
994,96292,Meter reader/Vending-machine collector,<Blank>,Meter reader/Vending-machine collector reads e...,<Blank>,• Parking meter reader\n• Coin machine colle...,<Blank>,9629,- handing out leaflets and free newspapers at ...,Meter reader/Vending-machine collector reads e...
995,96293,Odd job person,<Blank>,Odd job person performs tasks of a simple and ...,<Blank>,• Labourer\n• Handyman,<Blank>,9629,- handing out leaflets and free newspapers at ...,Odd job person performs tasks of a simple and ...


#### B) Generating word embeddings

Using the GloVE word embeddings from `spacy`

In [6]:
import os
os.chdir('..')

In [1]:
from spacy.language import Language

In [2]:
nlp = spacy.load('en_core_web_lg', disable = ['tagger', 'parser', 'ner', 'lemmatizer'])

In [3]:
stopwords = nlp.Defaults.stop_words

@Language.component("additional_preprocessing")
def additional_preprocessing(doc):
    lemma_list = [tok for tok in doc
                  if tok.is_alpha and tok.text.lower() not in stopwords] 
    return lemma_list

nlp.add_pipe('additional_preprocessing', last = True)

<function __main__.additional_preprocessing(doc)>

In [7]:
import pandas as pd
data = pd.read_csv('Data/Processed/Training/train-aws/SSOC_2020.csv')

In [9]:
job_desc = list(nlp.pipe(data['Description']))

In [11]:
import numpy as np

In [12]:
job_vecs = []
for i, desc in enumerate(job_desc):
    if i % 100 == 0:
        print(f'Job description {i}/{len(job_desc)}...\r', end = '')
    if len(desc) == 0:
        job_vecs.append(np.array([0]*300))
    else:
        job_vecs.append(np.mean([token.vector for token in desc], axis = 0))

Job description 900/997...

#### C) Reducing the embeddings to 2D space

Using `umap` which is faster in dimensionality reduction

In [9]:
import umap

In [12]:
reducer = umap.UMAP()

In [13]:
job_vecs_umap = reducer.fit_transform(job_vecs)

In [14]:
job_vecs_umap.shape

(997, 2)

In [17]:
data['SSOC 2020 Title'] = SSOC_Final['SSOC 2020 Title']
data['x'] = job_vecs_umap[:, 0]
data['y'] = job_vecs_umap[:, 1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['SSOC 2020 Title'] = SSOC_Final['SSOC 2020 Title']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['x'] = job_vecs_umap[:, 0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['y'] = job_vecs_umap[:, 1]


In [19]:
data.to_csv('../Data/Processed/SSOC_2020_UMAP.csv', index = False)

In [20]:
data

Unnamed: 0,SSOC 2020,Description,SSOC 2020 Title,x,y
0,11110,"Legislator determines, formulates and directs ...",Legislator,2.668436,2.965331
1,11121,"Senior government official plans, organises an...",Senior government official,2.754100,2.935222
2,11122,"Senior statutory board official plans, organis...",Senior statutory board official,2.775443,2.977211
3,11140,Senior official of political party organisatio...,Senior official of political party organisation,2.796750,2.969191
4,11150,"Senior official of employers', workers' and ot...","Senior official of employers', workers' and ot...",2.850779,3.035420
...,...,...,...,...,...
992,96272,Concierge (hotel) serves as the point of conta...,Concierge (hotel),-0.668444,8.182647
993,96291,Leaflet and newspaper distributor/deliverer ha...,Leaflet and newspaper distributor/deliverer,-2.004738,9.974261
994,96292,Meter reader/Vending-machine collector reads e...,Meter reader/Vending-machine collector,-2.003415,10.021562
995,96293,Odd job person performs tasks of a simple and ...,Odd job person,-2.053076,9.994143
