# Hierarchical Classification

Developing code to classify SSOCs using the taxonomy, see [this article by Shopify](https://shopify.engineering/categorizing-products-at-scale) for the inspiration.

## Attempt 1: Using Kesler's construction and logistic regression

### Importing code and data

Make sure you have `spacy` and `en_core_web_lg` loaded into your environment first.

In [1]:
import pandas as pd
import spacy
import numpy as np

In [15]:
nlp = spacy.load("en_core_web_lg")

In [32]:
mcf_labelled = pd.read_csv('../Data/Processed/Artifacts/MCF_Subset_WithLabels.csv')

In [33]:
mcf_labelled.head()

Unnamed: 0,Job_ID,Title,Description,SSOC_2015,Cleaned_Description,Predicted SSOC,Reported SSOC Desc,Predicted SSOC Desc
0,MCF-2020-0035227,pega solution architect (1 year contract),<p>Technical specialists will be responsible f...,21499,pega solution architect year contract technica...,29090,Other engineering professionals n.e.c.,"Other professionals n.e.c. (eg patent agent, t..."
1,MCF-2020-0002456,architectural coordinator,<ul>\n <li>Qualified and Experienced Architec...,21649,architectural coordinator qualified and experi...,13499,Other related planners (eg traffic planner),"Other professional, financial, community and s..."
2,MCF-2020-0183160,conveyancing secretary,<p>We are currently looking for Conveyancing S...,41201,conveyancing secretary we are currently lookin...,44170,Secretary,Legal clerk
3,MCF-2020-0228411,partner & alliance sales manager,<p>Based in <strong>Singapore</strong> and rep...,12211,partner alliance sales manager based in singap...,12211,Sales and marketing manager,Sales and marketing manager
4,MCF-2020-0117401,assistant chef / chef,<p>Position Purpose</p>\n<p>• Lead the kitchen...,34340,assistant chef chef position purpose lead the ...,94101,,Kitchen assistant


### Preparing the data

Both the SSOC 2020 detailed definitions and tasks and the SSOC 2015v18 to SSOC 2020 mapping are obtained from the DOS website. The list of SSOCs for SSOC 2015v18 is obtained from Lucas.

In [114]:
SSOC_Definitions = pd.read_excel('../Data/Raw/SSOC2020 Detailed Definitions.xlsx', skiprows = 4)

  warn("""Cannot parse header or footer so it will be ignored""")


In [115]:
ssoc_v18_2020_mapping = pd.read_excel('../Data/Raw/Correspondence Tables between SSOC2020 and 2015v18.xlsx', skiprows = 4, sheet_name = 'SSOC2015(v2018)-SSOC2020')

In [121]:
ssoc_v18 = pd.read_csv('../Data/Raw/ssoc_v2018.csv', encoding='iso-8859-1')
ssoc_v18.dropna(inplace = True)
ssoc_v18['SSOC 2015 (Version 2018)'] = ssoc_v18['ssoc_f'].astype('float').astype('int').astype('str')
ssoc_v2020 = ssoc_v18.merge(ssoc_v18_2020_mapping, how = 'left', on = 'SSOC 2015 (Version 2018)')[['SSOC 2015 (Version 2018)', 'SSOC 2015 (Version 2018) Title', 'SSOC 2020', 'SSOC 2020 Title']]

In [124]:
ssoc_v2020.head(5)

Unnamed: 0,SSOC 2015 (Version 2018),SSOC 2015 (Version 2018) Title,SSOC 2020,SSOC 2020 Title
0,11110,Legislator,11110,Legislator
1,11121,Senior government official,11121,Senior government official
2,11122,Senior statutory board official,11122,Senior statutory board official
3,11140,Senior official of political party organisation,11140,Senior official of political party organisation
4,11150,"Senior official of employers', workers' and ot...",11150,"Senior official of employers', workers' and ot..."


Create the one-hot encoding table which includes the full SSOC taxonomy (from 1D to 5D levels) as the columns, and each 5D SSOC as the rows.

In [66]:
def compare(col, ssoc):
    if len(ssoc) >= len(col):
        return 1 if col == ssoc[0:len(col)] else 0
    else:
        return 0

ssoc_pivoted = pd.DataFrame([], columns = ssoc_v18[~ssoc_v18['SSOC 2020'].str.contains('X')]['SSOC 2020'].tolist())
for idx, ssoc in enumerate(ssoc_v18[~ssoc_v18['SSOC 2020'].str.contains('X')]['SSOC 2020'].tolist()):
    ssoc_pivoted.loc[idx,:] = [compare(col, ssoc) for col in ssoc_pivoted.columns]
ssoc_pivoted['SSOC'] = ssoc_v18[~ssoc_v18['SSOC 2020'].str.contains('X')]['SSOC 2020'].tolist()

In [67]:
ssoc_final = ssoc_pivoted.set_index('SSOC')

In [126]:
ssoc_final.head()

Unnamed: 0_level_0,1,11,111,1111,11110,1112,11121,11122,1114,11140,...,96262,96269,9627,96271,96272,9629,96291,96292,96293,96299
SSOC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11110,1,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11121,1,1,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11122,1,1,1,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
11140,1,1,1,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
11150,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Predicting only 1D SSOCs due to computational complexities

In [14]:
ssoc_final_1d = ssoc_final[[str(i) for i in range(1, 10)]]

In [128]:
ssoc_final_1d.columns.tolist()

['1', '2', '3', '4', '5', '6', '7', '8', '9']

Generating the text data and the SSOC codes

In [4]:
SSOC_4D = SSOC_Definitions[SSOC_Definitions['SSOC 2020'].apply(len) == 4][['SSOC 2020', 'Tasks']]
SSOC_4D.columns = ['4D SSOC', 'Tasks']

In [5]:
SSOC_5D = SSOC_Definitions[(SSOC_Definitions['SSOC 2020'].apply(len) == 5) & ~SSOC_Definitions['SSOC 2020'].str.contains('X')].reset_index(drop = True)
SSOC_5D['4D SSOC'] = SSOC_5D['SSOC 2020'].str.slice(0, 4)
SSOC_5D.drop('Tasks', axis = 1, inplace = True)

In [6]:
SSOC_Final = SSOC_5D.merge(SSOC_4D, how = 'left', on = '4D SSOC')

In [7]:
SSOC_Final['Description'] = SSOC_Final['Detailed Definitions'] + " " + SSOC_Final['Tasks']

In [10]:
data = SSOC_Final[['SSOC 2020', 'Description']]

### Testing the implementation of Kesler's construction and logistic regression

For each feature vector, we have the 300-dimensional word embedding from `spacy` and the 9-dimensional one-hot encoding for the 1D SSOC taxonomy. Kesler's construction will also explode the number of rows (multiplied by 9). Final matrix should be a `(9n, 309)` matrix, where `n` = original number of rows.

In [132]:
%%time

# Initialise the output lists
output = []
labels = []

# For each SSOC and its accompanying description
for desc, ssoc in zip(data['Description'], data['SSOC 2020']):
    
    # Print the SSOC so we know how many more to go
    print(ssoc + '\r', end = "")
    
    # Generate the embedding vector
    feature_vector = nlp(desc).vector
    
    # Iterate through each 1D SSOC
    for target_class in ssoc_final_1d.columns.tolist():
        
        # Generate the label - if it is the first digit then the label should be 1, else 0
        if target_class == ssoc[0]:
            labels.append(1)
        else:
            labels.append(0)
            
        # Concatenate the word embedding and one-hot encoding into a single feature vector and append it
        output.append(np.concatenate([feature_vector, ssoc_final_1d.loc[str(ssoc),:].tolist()], axis = None, dtype = 'float32'))

Wall time: 22.2 s


In [133]:
X = np.array(output, dtype = 'float32')
y = np.array(labels, dtype = 'int32')
print(X.shape)
print(y.shape)

(8973, 309)
(8973,)


Run our (vanilla) logistic regression model now

In [134]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter = 10000)
model.fit(X, y)

LogisticRegression(max_iter=10000)

Checking the last 9 coefficients (for the one-hot encodings) - these don't look promising as they are very small numbers.

In [135]:
model.coef_[0][-9:]

array([-0.00233757, -0.00238159, -0.00218955, -0.00157428, -0.002279  ,
       -0.00293471, -0.00239002, -0.00238908, -0.00251154])

Generate testing data for the first MCF description - note how we have to explode the number of rows as well

In [136]:
idx = 0
ssoc = mcf_labelled['SSOC_2015'][idx]
feature_vec = nlp(mcf_labelled['Description'][idx]).vector
testing_example = []
for i in range(1, 10):
    row = np.concatenate([feature_vec, [1 if j == i else 0 for j in range(1, 10)]], axis = None)
    testing_example.append(row)
np.array(testing_example).shape

(9, 309)

Generate predictions and predicted probabilities

In [139]:
model.predict(np.array(testing_example))

array([0, 0, 0, 0, 0, 0, 0, 0, 0])

In [140]:
model.predict_proba(np.array(testing_example))

array([[0.88884097, 0.11115903],
       [0.88884532, 0.11115468],
       [0.88882635, 0.11117365],
       [0.88876553, 0.11123447],
       [0.88883518, 0.11116482],
       [0.88889996, 0.11110004],
       [0.88884615, 0.11115385],
       [0.88884606, 0.11115394],
       [0.88885816, 0.11114184]])

The predicted probabilities change, but not by much at all. The predicted probabilities are also close to the class proportions (8/9 and 1/9). We test this further by adding class weights to the logistic regression.

In [141]:
model2 = LogisticRegression(max_iter = 10000, class_weight = {0: 1, 1: 20})
model2.fit(X, y)

LogisticRegression(class_weight={0: 1, 1: 20}, max_iter=10000)

In [143]:
model2.predict(np.array(testing_example))

array([1, 1, 1, 1, 1, 1, 1, 1, 1])

In [142]:
model2.predict_proba(np.array(output))

array([[0.28548713, 0.71451287],
       [0.28548713, 0.71451287],
       [0.28548713, 0.71451287],
       ...,
       [0.28587308, 0.71412692],
       [0.28587308, 0.71412692],
       [0.28587308, 0.71412692]])

Seems like this is not really working out - the model is simply predicting using the class proportions and weights. Next step is to try using a neural network layer instead.