## Developing Hierarchical/Straight through Classification Approach

**Author:** Shaun Khoo  
**Date:** 1 Oct 2021  
**Context:** Adapting Shopify's approach to classifying products using a hierarchical classifier (see reference below)  
**Objective:** Develop code for training a hierarchical classifier neural network

Some references:

* [this article by Shopify](https://shopify.engineering/introducing-linnet-using-rich-image-text-data-categorize-products)
* [How to do transfer learning on PyTorch / Transformers](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb)

#### A) Importing libraries and data

Changing the working directory to the top-level project folder

In [1]:
import os
os.chdir('..')

Importing the required libraries

In [2]:
import pandas as pd
import numpy as np
import copy

import torch
from torch.utils.data import Dataset, DataLoader
from torch.autograd import Variable
from transformers import DistilBertModel, DistilBertTokenizer, DistilBertForSequenceClassification
import time
from datetime import datetime

# Enable debugging while on GPU
# This doesn't seem to work for me though
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

Importing our training functions from our own codebase

In [3]:
from ssoc_autocoder import model_training

Filling in the required parameters for the model training

In [30]:
colnames = {
    'SSOC': 'Predicted_SSOC_2020',
    'job_description': 'description',
    'job_title': 'title'
}

parameters = {
    'architecture': 'hierarchical',
    'version': 'V2pt2',
    'sequence_max_length': 512,
    'max_level': 5,
    'training_batch_size': 32,
    'validation_batch_size': 32,
    'epochs': 4,
    'learning_rate': 0.0005,
    'pretrained_tokenizer': 'C:\\Users\\shaun\\PycharmProjects\\ssoc-autocoder\\Models\\distilbert-tokenizer-pretrained-7epoch',
    'pretrained_model': 'C:\\Users\\shaun\\PycharmProjects\\ssoc-autocoder\\Models\\mcf-pretrained-7epoch', #'distilbert-base-uncased',
    'local_files_only': True,
    'num_workers': 4,
    'loss_weights': {
        'SSOC_1D': 20,
        'SSOC_2D': 5,
        'SSOC_3D': 3,
        'SSOC_4D': 2,
        'SSOC_5D': 1
    },
    'device': 'cuda'
}

In [5]:
train = pd.read_csv('Data/Train/Train.csv')
test = pd.read_csv('Data/Train/Test.csv')
SSOC_2020 = pd.read_csv('Data/Reference/SSOC_2020.csv')

#### B) Preparing the model and data for training

Encoding the SSOCs into indices for the model

In [6]:
encoding = model_training.generate_encoding(SSOC_2020)
encoded_train = model_training.encode_dataset(train, encoding, colnames)
encoded_test = model_training.encode_dataset(test, encoding, colnames)

In [7]:
encoded_train[encoded_train['SSOC'] == 94102]

Unnamed: 0,Title,Text,SSOC,SSOC_1D,SSOC_2D,SSOC_3D,SSOC_4D,SSOC_5D
969,Food/Drink stall assistant,Food/Drink stall assistant assists in serving ...,94102,8,40,140,405,969
1131,Kitchen Assistant (Coffee Shop),Kitchen Assistant (Coffee Shop) Full Time / Pa...,94102,8,40,140,405,969
4325,Hawker Assistant,Jiak Song Mee Hoon Kway is looking for an hour...,94102,8,40,140,405,969
11668,Hawker Assistant,Assistant to head chef:Cutting of Vegetables.W...,94102,8,40,140,405,969
12339,NUS New Canteen Fruit Juice Stall Assistant,"New canteen, spacious and friendly working env...",94102,8,40,140,405,969


In [8]:
encoded_test[encoded_test['SSOC'] == 94102]

Unnamed: 0,Title,Text,SSOC,SSOC_1D,SSOC_2D,SSOC_3D,SSOC_4D,SSOC_5D
1691,Coffee Shop Assistant,Table cleaning and clearing plates to dishwash...,94102,8,40,140,405,969


Loading the DistilBERT tokenizer

In [9]:
tokenizer = DistilBertTokenizer.from_pretrained(parameters['pretrained_tokenizer'])

Creating the `DataLoader` object for both the train and test sets, as well as initialising the model

In [10]:
train_loader, test_loader = model_training.prepare_data(encoded_train, encoded_test, tokenizer, colnames, parameters)
model, loss_function, optimizer = model_training.prepare_model(encoding, parameters)

Some weights of the model checkpoint at C:\Users\shaun\PycharmProjects\ssoc-autocoder\Models\mcf-pretrained-7epoch were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### C) Running the training

In [31]:
model_training.train_model(model, loss_function, optimizer, train_loader, test_loader, parameters)

Training started on: 11 Jan 2022 - 08:53:57
> Epoch 1 started on: 11 Jan 2022 - 08:53:57
--------------------------------------------------------------------
>> Training Loss per 50 steps: 15.6882 
>> Training Accuracy per 50 steps: 36.94%
>> Batch of 50 took 3.37 mins
>> Training Loss per 50 steps: 16.1021 
>> Training Accuracy per 50 steps: 36.84%
>> Batch of 50 took 3.27 mins
>> Training Loss per 50 steps: 15.9761 
>> Training Accuracy per 50 steps: 37.17%
>> Batch of 50 took 3.23 mins
>> Training Loss per 50 steps: 15.9697 
>> Training Accuracy per 50 steps: 36.81%
>> Batch of 50 took 3.23 mins
>> Training Loss per 50 steps: 15.9663 
>> Training Accuracy per 50 steps: 36.65%
>> Batch of 50 took 3.23 mins
>> Training Loss per 50 steps: 16.0164 
>> Training Accuracy per 50 steps: 36.58%
>> Batch of 50 took 3.23 mins
>> Training Loss per 50 steps: 16.1063 
>> Training Accuracy per 50 steps: 36.41%
>> Batch of 50 took 3.22 mins
----------------------------------------------------------

In [29]:
torch.save(model.state_dict(), 'Models/autocoder-v2pt2-9jan-pretrained7epoch-34epoch.pt')

report x-d accuracy for eval set  
begin training with only the 1D loss function  
error analysis

#### D) Generating predictions on the test set

In [None]:
import torch

ssoc_prediction_parameters = {
    'SSOC_1D': {'top_n': 2, 'min_prob': 0.5},
    'SSOC_2D': {'top_n': 5, 'min_prob': 0.4},
    'SSOC_3D': {'top_n': 5, 'min_prob': 0.3},
    'SSOC_4D': {'top_n': 5, 'min_prob': 0.05},
    'SSOC_5D': {'top_n': 10, 'min_prob': 0.05}
}

def generate_single_prediction(model, 
                               tokenizer, 
                               title,
                               text, 
                               target, 
                               encoding,
                               training_parameters,
                               ssoc_prediction_parameters, 
                               failsafe = True):
        
    """
    Generates a single prediction from the trained neural network.
    
    """

    # Check data type
    if type(title) != str:
        raise TypeError("Please enter a string for the 'text' argument.")
    if type(text) != str:
        raise TypeError("Please enter a string for the 'text' argument.")
    if type(target) != str:
        raise TypeError("Please enter a string for the 'target' argument.")

    # Tokenize the text using the DistilBERT tokenizer
    tokenized_title = tokenizer(
        text = title,
        text_pair = None,
        add_special_tokens = True,
        max_length = training_parameters['sequence_max_length'],
        padding = 'max_length',
        return_token_type_ids = True,
        truncation = True
    )
    tokenized_text = tokenizer(
        text = text,
        text_pair = None,
        add_special_tokens = True,
        max_length = training_parameters['sequence_max_length'],
        padding = 'max_length',
        return_token_type_ids = True,
        truncation = True
    )
    
    # Extract the tensors from the tokenizer
    test_title_ids = torch.tensor([tokenized_title['input_ids']], dtype = torch.long)
    test_title_mask = torch.tensor([tokenized_title['attention_mask']], dtype = torch.long)
    test_text_ids = torch.tensor([tokenized_text['input_ids']], dtype = torch.long)
    test_text_mask = torch.tensor([tokenized_text['attention_mask']], dtype = torch.long)
    
    # Set the model to evaluation mode and generate the predictions
    model.eval()
    with torch.no_grad():
        preds = model(test_title_ids, test_title_mask, test_text_ids, test_text_mask)
        m = torch.nn.Softmax(dim=1)
    
    # Iteratively generate predictions for each SSOC level that is specified
    predictions_with_proba = {}
    for ssoc_level, ssoc_level_params in sorted(ssoc_prediction_parameters.items()):
        
        # Extract the indices of the top n predicted SSOCs for the given SSOC level
        predicted_idx = preds[ssoc_level].detach().numpy().argsort()[0][::-1][:ssoc_level_params["top_n"]]
        
        # Extract the actual predicted probabilities from the softmax layer using the indices
        predicted_proba_all = m(preds[ssoc_level]).detach().numpy()[0]
        predicted_proba = [predicted_proba_all[idx] for idx in predicted_idx]
        
        # Convert the indices to the actual SSOC using the encoding dictionary
        predicted_ssoc = [encoding[ssoc_level]['idx_ssoc'][idx] for idx in predicted_idx]
        
        # Check if the model made an accurate prediction
        # Meaning whether the correct SSOC appeared in the list of predictions
        accurate_prediction = False
        for ssoc in predicted_ssoc:
            if ssoc == target[0:len(ssoc)]:
                accurate_prediction = True
        
        # Append predictions with the predicted probability to the output
        predictions_with_proba[ssoc_level] = {
            'predicted_ssoc': predicted_ssoc,
            'predicted_proba': predicted_proba,
            'accurate_prediction': accurate_prediction
        }
        
    return predictions_with_proba

def generate_predictions(model, 
                         tokenizer, 
                         test_set,
                         encoding,
                         training_parameters,
                         ssoc_prediction_parameters,
                         ssoc_level = 'SSOC_4D'):
    
    """
    
    
    """
        
    output = []
    accurate_predictions = []
    for i, row in test_set.iterrows():
        print(f'Generating prediction for {i+1}/{len(test_set)}...', end = '\r')
        predictions_with_proba = generate_single_prediction(model, 
                                                            tokenizer, 
                                                            row['title'],
                                                            row['description'], 
                                                            str(row['Predicted_SSOC_2020']),
                                                            encoding,
                                                            training_parameters,
                                                            ssoc_prediction_parameters)
        output.append(predictions_with_proba)
        accurate_predictions.append(predictions_with_proba[ssoc_level]['accurate_prediction'])
    
    print('')
    accuracy = sum(accurate_predictions)/len(accurate_predictions)
    print(f'Overall {ssoc_level} accuracy: {accuracy:.2%}')
    
    return output


In [None]:
test = pd.read_csv('Data/Train/Test.csv')

In [None]:
model.to('cpu')

In [None]:
all_predictions = generate_predictions(model, tokenizer, mrsd_val, encoding, parameters, ssoc_prediction_parameters)

Generating prediction for 492/492...
Overall SSOC_4D accuracy: 74.19%


In [26]:
import pickle

with open('all_predictions.pickle', 'wb') as handle:
    pickle.dump(all_predictions, handle, protocol=pickle.HIGHEST_PROTOCOL)


In [27]:
with open('all_predictions.pickle', 'rb') as handle:
    all_ = pickle.load(handle)

In [24]:
selected = []
for prediction in all_predictions:
    selected.append(prediction['SSOC_5D']['accurate_prediction'])
accuracy = sum(selected)/len(selected)
print(f'Overall accuracy: {accuracy:.2%}')   

Overall accuracy: 61.99%


In [None]:
import json
with open("encoding.json", 'w') as outfile:
    json.dump(encoding, outfile)

In [None]:
data = data[data[colnames['SSOC']].notnull()]

In [None]:
encoding = train.generate_encoding(SSOC_2020)
encoded_data = train.encode_dataset(data, encoding, colnames)

In [None]:
encoded_data['SSOC_1D'].value_counts()

In [None]:
encoded_data['SSOC_2D'].value_counts()

Importing our datasets

Use a custom function to encode the category correctly as PyTorch requires (as a dictionary)