In [1]:
import spacy
import pandas as pd
import numpy as np
import re
import random

## Load Data

In [2]:
df = pd.read_csv(r'C:\Users\Chang\Desktop\Work Things\Internships\Affinity Solutions\Summer Internship - Homework Exercise.csv')

df

Unnamed: 0,transaction_descriptor,store_number,dataset
0,DOLRTREE 2257 00022574 ROSWELL,2257,train
1,AUTOZONE #3547,3547,train
2,TGI FRIDAYS 1485 0000,1485,train
3,BUFFALO WILD WINGS 003,3,train
4,J. CREW #568 0,568,train
...,...,...,...
295,MCDONALD'S F2151,F2151,test
296,NST BEST BUY #1403 332411,1403,test
297,CVS/PHARMACY #06689,6689,test
298,BANANA REPUBLIC #8109,8109,test


## Clean Data

In [3]:
# clean awkward spaces
def clean_spaces(s):
    return re.sub(' +', ' ', s)

# split on punctuations because the vectorizer will not understand any number with it
def split_punc(s):
    return re.sub(r'([^\w\s]|_)',' ',s)

# take out any numbers with zeros leading and trailing it
def split_leading_zeros(s):
    return re.sub(r'\b0+', '', s)

In [4]:
# split numbers from words

def split_num_from_word(s):
    li_split = s.split(' ')
    li_where_to_split = []
    start_, end_ = 0, 0
    for ind, item in enumerate(li_split):
        # filter out for words with both letters and numbers
        if not item.isalpha() and item.isalnum():
            split_num_word = re.split('(\d+)', item)
            if len(split_num_word[-1]) > 2 or len(split_num_word[0]) > 2:
                start_, end_ = (s.index(split_num_word[1])), s.index(split_num_word[1]) + len(split_num_word[1])

    return s[:start_] + ' ' + s[start_:end_] + ' ' + s[end_:]

In [5]:
transaction_descriptor = df['transaction_descriptor']

df['transaction_descriptor'] = list(map(clean_spaces, map(split_leading_zeros, map(split_num_from_word, map(split_punc, transaction_descriptor)))))

df

Unnamed: 0,transaction_descriptor,store_number,dataset
0,DOLRTREE 2257 22574 ROSWELL,2257,train
1,AUTOZONE 3547,3547,train
2,TGI FRIDAYS 1485,1485,train
3,BUFFALO WILD WINGS 3,3,train
4,J CREW 568,568,train
...,...,...,...
295,MCDONALD S F2151,F2151,test
296,NST BEST BUY 1403 332411,1403,test
297,CVS PHARMACY 6689,6689,test
298,BANANA REPUBLIC 8109,8109,test


## Load a Spacy Model

We will load a model from spacy. To check what is has, let us see:

In [6]:
nlp = spacy.load('en_core_web_sm')

Play with the model

In [7]:
li_init_predictions = []
for descriptor in df['transaction_descriptor']:
    doc = nlp(descriptor)
    for ent in doc.ents:
        li_init_predictions.append((ent.text, ent.label_))
        
li_init_predictions

[('3', 'CARDINAL'),
 ('CREW', 'ORG'),
 ('40', 'CARDINAL'),
 ('GREENVILLE SC', 'ORG'),
 ('FIVE', 'CARDINAL'),
 ('1847', 'DATE'),
 ('612 339 9733', 'MONEY'),
 ('MN', 'ORG'),
 ('2650', 'DATE'),
 ('HOUSE', 'ORG'),
 ('535', 'PRODUCT'),
 ('1305', 'DATE'),
 ('26824', 'DATE'),
 ('688', 'CARDINAL'),
 ('207812', 'CARDINAL'),
 ('EXPRESS', 'ORG'),
 ('920', 'CARDINAL'),
 ('F16829', 'DATE'),
 ('208998', 'DATE'),
 ('Q61', 'ORG'),
 ('2610', 'DATE'),
 ('354', 'CARDINAL'),
 ('483280353', 'DATE'),
 ('ROSS STORES', 'ORG'),
 ('10262660', 'DATE'),
 ('447', 'CARDINAL'),
 ('196', 'CARDINAL'),
 ('6870', 'CARDINAL'),
 ('22357', 'DATE'),
 ('748361300', 'DATE'),
 ('MCDONALD S F3172', 'ORG'),
 ('THE HOME DEPOT 1407', 'WORK_OF_ART'),
 ('1548', 'DATE'),
 ('NNT LANE BRYANT', 'WORK_OF_ART'),
 ('142', 'CARDINAL'),
 ('ROSS STORES', 'ORG'),
 ('11572299', 'DATE'),
 ('1327', 'DATE'),
 ('27 2732', 'DATE'),
 ('303', 'CARDINAL'),
 ('ROYAL', 'ORG'),
 ('24', 'CARDINAL'),
 ('180073', 'DATE'),
 ('1783', 'DATE'),
 ('532412', 'DATE

In [8]:
# get components of the pipeline

nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [9]:
# pick just the ner

ner = nlp.get_pipe('ner')

In [10]:
# example on an entry that we could possible have

ex1 = 'TOYS R US 5009'

doc = nlp(ex1)

In [11]:
for entity in doc.ents:
    print(entity, entity.label_)

US GPE
5009 DATE


So we see that our nlp pipeline from spacy requires a lot of training for our specific examples. It does not have the exact domain-specific knowledge that we need it to have for our task (in particular, it does not have the ability to discern between stores and numbers).

## Train Our Model

We will feed our training set into the model.

Training data must be presented to spacy nlp models in the form:

[(Text to train on, {'entities': (start index, stop index, 'desired label')]

### Get our training data in the proper format

In [12]:
from spacy.tokens import DocBin
from tqdm import tqdm

In [13]:
def load_spacy_data(train_df, dataset):
    # load data
    train_df = df.loc[df['dataset'] == dataset]

    # get into form we need to train spacy model
    li_data = []
    for store_num, descriptor in zip(train_df['store_number'], train_df['transaction_descriptor']):
        start_pos = descriptor.index(store_num)
        end_pos = start_pos + len(store_num)
        li_data.append((descriptor, {'entities':[(start_pos, end_pos, 'store_num')]}))

    return li_data

In [14]:
train_data = load_spacy_data(df.loc[df['dataset'] == 'train'], 'train')

### Get our validation data in proper format too

In [15]:
val_data = load_spacy_data(df.loc[df['dataset'] == 'validation'], 'validation')

In [16]:
print(f'We have {len(train_data)} training entries.')

print(f'We have {len(val_data)} validation entries.')

We have 100 training entries.
We have 100 validation entries.


### Load a Blank Model

We will first train a blank model! (the one we loaded earlier was a pretrained model)

In [17]:
nlp = spacy.blank('en')

In [18]:
def load_dot_spacy(data, dataset, nlp):
    db = DocBin() # create a DocBin object
    for text, annot in tqdm(data): # data in previous format
        doc = nlp.make_doc(text) # create doc object from text
        ents = []
        for start, end, label in annot['entities']: # add character indexes
            span = doc.char_span(start, end, label=label, alignment_mode='contract')
            if span is None:
                print('Skipping entity')
            else:
                ents.append(span)
        try:
            doc.ents = ents # label the text with the ents
            db.add(doc)
        except:
            print(text, annot)
    db.to_disk('./' + dataset + '.spacy') # save the docbin object

In [19]:
# will store files named "train.spacy" and "dev.spacy" in our directory with this notebook, we will feed this into
# our config file

load_dot_spacy(train_data, 'train', nlp)

load_dot_spacy(val_data, 'dev', nlp)

### Get the Config File

For spacy 3.x, we must call our training protocol in a config file. We will run the spacy training loop from the command prompt instead of directly from python.

See documentation: https://spacy.io/usage/training

### Instructions to get the config file + autofill the config file

- Activate virtual environment of the project (must have spaCy installed)

- Type in "python -m spacy init fill-config base_config.cfg config.cfg" to the command prompt

- Run it

We will obtain a config file, named "config.cfg".

### Instructions to run the training loop

Go into the command prompt, and type in:

"python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy"

Then run it. It will train, and as it trains, it will display the metrics.

## Evaluate the Performance of Our Model

We will now evaluate the performance of our model on our testing data.

First load the model that we trained:

In [20]:
nlp_best = spacy.load('./output/model-best')

Now load the testing data in the format as above.

In [21]:
df.loc[df['dataset'] == 'test']

Unnamed: 0,transaction_descriptor,store_number,dataset
200,IN N OUT BURGER 242,242,test
201,BP 9442088 LIBERTYVILLE B,9442088,test
202,JCPENNEY 1419,1419,test
203,ROSS STORES 1019,1019,test
204,WM SUPERCENTER 38,38,test
...,...,...,...
295,MCDONALD S F2151,F2151,test
296,NST BEST BUY 1403 332411,1403,test
297,CVS PHARMACY 6689,6689,test
298,BANANA REPUBLIC 8109,8109,test


In [22]:
test_data = load_spacy_data(df.loc[df['dataset']=='test'], dataset='test')

test_data

[(' IN N OUT BURGER 242', {'entities': [(17, 20, 'store_num')]}),
 ('BP 9442088 LIBERTYVILLE B', {'entities': [(3, 10, 'store_num')]}),
 (' JCPENNEY 1419', {'entities': [(10, 14, 'store_num')]}),
 (' ROSS STORES 1019', {'entities': [(13, 17, 'store_num')]}),
 (' WM SUPERCENTER 38', {'entities': [(16, 18, 'store_num')]}),
 (' TUESDAY MORNING 673 6', {'entities': [(17, 20, 'store_num')]}),
 (' IHOP 629 WHITE HOUSE TN', {'entities': [(6, 9, 'store_num')]}),
 (' LBOUTLETS 4249 1475 N BUR', {'entities': [(11, 15, 'store_num')]}),
 (' WINN DIXIE 2505 VALRICO FL 3454 ', {'entities': [(12, 16, 'store_num')]}),
 (' BURLINGTON STORES 825', {'entities': [(19, 22, 'store_num')]}),
 (' WM SUPERCENTER 2923', {'entities': [(16, 20, 'store_num')]}),
 (' BUFFALO WILD WINGS 58 CARSON CITY NV',
  {'entities': [(20, 22, 'store_num')]}),
 (' BOB EVANS REST 2039', {'entities': [(16, 20, 'store_num')]}),
 (' JIMMY JOHNS 382 E', {'entities': [(13, 16, 'store_num')]}),
 (' PENSKE TRK LSG 12260', {'entities': [

Let's do an informal test on one of our entries!

In [23]:
test_1 = test_data[0]

test_1

(' IN N OUT BURGER 242', {'entities': [(17, 20, 'store_num')]})

In [24]:
# apply the model to the text of our first test data

test1_doc = nlp_best(test_1[0])

for ent in test1_doc.ents:
    print(ent.text, ent.label_)

242 store_num


It works on our first test entry. Let us look at a few more interesting ones...

In [25]:
test_2 = test_data[1]

test_2

('BP 9442088 LIBERTYVILLE B', {'entities': [(3, 10, 'store_num')]})

In [26]:
test2_doc = nlp_best(test_2[0])

if len(test2_doc.ents) == 0:
    li1.append('False')
for ent in test2_doc.ents:
    print(ent.text, ent.label_)

9442088 store_num


The data cleaning we did earlier was extremely helpful because originally, this entry had "9442088LIBERTYVILLE", which would've made the store number indistinguishable.

In [27]:
test_3 = test_data[7]

test_3

(' LBOUTLETS 4249 1475 N BUR', {'entities': [(11, 15, 'store_num')]})

In [28]:
test3_doc = nlp_best(test_3[0])
li1 = []
for ent in test3_doc.ents:
    if len(test3_doc.ents) == 0:
        li1.append('False')
    else:
        li1.append((ent.text, ent.label_))
li1

[('4249', 'store_num')]

In cases where there are multiple numbers, this model seems to be able to tell the difference between the numbers. 

For example, in our example just now, we had the descriptor:

"LBOUTLETS 4249 1475 N BUR"

Our model was able to recognize that **4249** was preceded by a store or organization, and that **1475** was simply the number that came before an address.

### Testing Pipeline

Let us inference our model on our entire testing set.

In [29]:
# function for quick testing
# test_data - of the form generated by load_spacy_data function

def test_ner(test_data, model):
    li_inference_results = []
    # iterate through texts and apply model
    for ind, (text, annotations) in enumerate(test_data):
        doc_test = model(text)
        
        # if the model doesn't predict anything, throw a blank output into the list
        # otherwise append the prediction + label
        if len(doc_test.ents) == 0:
            
            # we can hardcode 'store_num' as label because we're only looking for store_num labels
            li_inference_results.append((ind,'', 'store_num',len(doc_test.ents)))
        else:
            for ent in doc_test.ents:
                li_inference_results.append((ind, text, ent.text, ent.label_, len(doc_test.ents)))
    return li_inference_results

The output of our model is of the form:

(the input index that the label corresponds to, the input (descriptor), the prediction (store number), label ('store_num'), numbers of predictions that the model made for the descriptor)

In [30]:
inference_output = test_ner(test_data, nlp_best)

inference_output

[(0, ' IN N OUT BURGER 242', '242', 'store_num', 1),
 (1, 'BP 9442088 LIBERTYVILLE B', '9442088', 'store_num', 1),
 (2, ' JCPENNEY 1419', '1419', 'store_num', 1),
 (3, ' ROSS STORES 1019', '1019', 'store_num', 1),
 (4, ' WM SUPERCENTER 38', '38', 'store_num', 1),
 (5, ' TUESDAY MORNING 673 6', '673', 'store_num', 1),
 (6, ' IHOP 629 WHITE HOUSE TN', '629', 'store_num', 1),
 (7, ' LBOUTLETS 4249 1475 N BUR', '4249', 'store_num', 1),
 (8, ' WINN DIXIE 2505 VALRICO FL 3454 ', '2505', 'store_num', 2),
 (8, ' WINN DIXIE 2505 VALRICO FL 3454 ', '3454', 'store_num', 2),
 (9, ' BURLINGTON STORES 825', '825', 'store_num', 1),
 (10, ' WM SUPERCENTER 2923', '2923', 'store_num', 1),
 (11, ' BUFFALO WILD WINGS 58 CARSON CITY NV', '58', 'store_num', 1),
 (12, ' BOB EVANS REST 2039', '2039', 'store_num', 1),
 (13, ' JIMMY JOHNS 382 E', '382', 'store_num', 1),
 (14, ' PENSKE TRK LSG 12260', '12260', 'store_num', 1),
 (15, ' AEROPOSTALE 864', '864', 'store_num', 1),
 (16, ' GIANT 338', '338', 'store_nu

In [31]:
len(inference_output)

107

Let us see which ones were either blank predictions, or had multiple predictions.

In [32]:
for index, descriptor, prediction, _, len_ in inference_output:
    if len_ != 1:
        print(f'The descriptor at entry {index} is {descriptor}')
        print(f'The predicted store number for entry {index} is {prediction}\n')

The descriptor at entry 8 is  WINN DIXIE 2505 VALRICO FL 3454 
The predicted store number for entry 8 is 2505

The descriptor at entry 8 is  WINN DIXIE 2505 VALRICO FL 3454 
The predicted store number for entry 8 is 3454

The descriptor at entry 22 is  BP 8644346ES 30 B96
The predicted store number for entry 22 is 8644346ES

The descriptor at entry 22 is  BP 8644346ES 30 B96
The predicted store number for entry 22 is 30

The descriptor at entry 28 is  WINN DIXIE 2454 SEFFNER FL 1033 
The predicted store number for entry 28 is 2454

The descriptor at entry 28 is  WINN DIXIE 2454 SEFFNER FL 1033 
The predicted store number for entry 28 is 1033

The descriptor at entry 33 is  NAVY EXCHANGE 50161 3
The predicted store number for entry 33 is 50161

The descriptor at entry 33 is  NAVY EXCHANGE 50161 3
The predicted store number for entry 33 is 3

The descriptor at entry 36 is  CASEYS GEN STORE 2597 SLOAN IA51055
The predicted store number for entry 36 is 2597

The descriptor at entry 36 is  

As we see here, our model still gets hung up on entries where there is a number present in the descriptor (especially numbers in addresses). Numbers next to addresses are a particular weakness, because our model can deduce context, and because locations (like "TAMPA FL") are nouns, **the model cannot make a definitive decision** on whether a number (2340) following a location (TAMPA FL) **is actually a store number of an address number**. Something we could do to go further  would be to feed additional labels that actually make a difference between address numbers and store numbers.

In [33]:
inference_output

[(0, ' IN N OUT BURGER 242', '242', 'store_num', 1),
 (1, 'BP 9442088 LIBERTYVILLE B', '9442088', 'store_num', 1),
 (2, ' JCPENNEY 1419', '1419', 'store_num', 1),
 (3, ' ROSS STORES 1019', '1019', 'store_num', 1),
 (4, ' WM SUPERCENTER 38', '38', 'store_num', 1),
 (5, ' TUESDAY MORNING 673 6', '673', 'store_num', 1),
 (6, ' IHOP 629 WHITE HOUSE TN', '629', 'store_num', 1),
 (7, ' LBOUTLETS 4249 1475 N BUR', '4249', 'store_num', 1),
 (8, ' WINN DIXIE 2505 VALRICO FL 3454 ', '2505', 'store_num', 2),
 (8, ' WINN DIXIE 2505 VALRICO FL 3454 ', '3454', 'store_num', 2),
 (9, ' BURLINGTON STORES 825', '825', 'store_num', 1),
 (10, ' WM SUPERCENTER 2923', '2923', 'store_num', 1),
 (11, ' BUFFALO WILD WINGS 58 CARSON CITY NV', '58', 'store_num', 1),
 (12, ' BOB EVANS REST 2039', '2039', 'store_num', 1),
 (13, ' JIMMY JOHNS 382 E', '382', 'store_num', 1),
 (14, ' PENSKE TRK LSG 12260', '12260', 'store_num', 1),
 (15, ' AEROPOSTALE 864', '864', 'store_num', 1),
 (16, ' GIANT 338', '338', 'store_nu

### Score our Model on Our Testing Set

We will score the model on two criteria:

- Accuracy: Out of all of the ground truths (so the 100 store number labels corresponding to descriptors), how many of them did the model predict correctly? (if the **model predicted more than 1 label for an input**, we will regard it as **incorrect**)

- Precision: Out of all of the model's predictions (so the 107 predictions that were outputted above), how many of those were correct?

**Note: Precision and Recall is typically used in classification, where we evaluate the precision and recall corresponding to each output class.**

In [41]:
def score_model(inference_output, ground_truth_df, return_not_matched=False):
    metrics_dict = dict()
    not_matched = []
    # initialize number correct, number that model matched
    num_correct, num_matched = 0, 0
    for _, descriptor, prediction, _, num_predictions in inference_output:
        eval_row = ground_truth_df.loc[ground_truth_df['transaction_descriptor'] == descriptor]['store_number']
        
        # the prediction is matched if it matches the store_num value of the ground truth for the descriptor
        if eval_row.values[0] == prediction:
            num_matched += 1
            
            # the prediction can be CORRECT only if the store_num matches, and the model only took 1 attempt
            if num_predictions == 1:
                num_correct += 1
        else:
            not_matched.append((descriptor, prediction, num_predictions))
    
    metrics_dict['accuracy'] = num_correct / ground_truth_df.shape[0]
    metrics_dict['precision'] = num_matched / len(inference_output)
    if not return_not_matched:
        return metrics_dict
    else:
        return metrics_dict, not_matched

In [42]:
test_df = df.loc[df['dataset'] == 'test']
metrics = score_model(inference_output, test_df)

accuracy_model, precision_model = metrics['accuracy'], metrics['precision']
print(f'Our model accuracy is {accuracy_model*100:.3f}%')
print(f'Our model precision is {precision_model*100:.3f}%')

Our model accuracy is 90.000%
Our model precision is 89.720%


In [36]:
inference_output

[(0, ' IN N OUT BURGER 242', '242', 'store_num', 1),
 (1, 'BP 9442088 LIBERTYVILLE B', '9442088', 'store_num', 1),
 (2, ' JCPENNEY 1419', '1419', 'store_num', 1),
 (3, ' ROSS STORES 1019', '1019', 'store_num', 1),
 (4, ' WM SUPERCENTER 38', '38', 'store_num', 1),
 (5, ' TUESDAY MORNING 673 6', '673', 'store_num', 1),
 (6, ' IHOP 629 WHITE HOUSE TN', '629', 'store_num', 1),
 (7, ' LBOUTLETS 4249 1475 N BUR', '4249', 'store_num', 1),
 (8, ' WINN DIXIE 2505 VALRICO FL 3454 ', '2505', 'store_num', 2),
 (8, ' WINN DIXIE 2505 VALRICO FL 3454 ', '3454', 'store_num', 2),
 (9, ' BURLINGTON STORES 825', '825', 'store_num', 1),
 (10, ' WM SUPERCENTER 2923', '2923', 'store_num', 1),
 (11, ' BUFFALO WILD WINGS 58 CARSON CITY NV', '58', 'store_num', 1),
 (12, ' BOB EVANS REST 2039', '2039', 'store_num', 1),
 (13, ' JIMMY JOHNS 382 E', '382', 'store_num', 1),
 (14, ' PENSKE TRK LSG 12260', '12260', 'store_num', 1),
 (15, ' AEROPOSTALE 864', '864', 'store_num', 1),
 (16, ' GIANT 338', '338', 'store_nu

## Failure Analysis

**Let's examine the entries that our model got wrong**:

In [43]:
_, not_matched = score_model(inference_output, test_df, return_not_matched=True)

not_matched

[(' WINN DIXIE 2505 VALRICO FL 3454 ', '3454', 2),
 (' BP 8644346ES 30 B96', '8644346ES', 2),
 (' BP 8644346ES 30 B96', '30', 2),
 ('NNT POLO RL WRENTHA 130571 ', '130571', 1),
 (' WINN DIXIE 2454 SEFFNER FL 1033 ', '1033', 2),
 (' NNT SEARS HOMETOWN 862751', '862751', 1),
 (' NAVY EXCHANGE 50161 3', '3', 2),
 (' CASEYS GEN STORE 2597 SLOAN IA51055', 'IA51055', 2),
 (' NST BEST BUY 48 72393', '72393', 2),
 (' FOOTACTION 57331 TAMPA FL 2340 ', '2340', 2),
 (' SUBWAY 32128', '32128', 1)]

### Take a Deep Look at Some Cases

1) Case 1

In [50]:
test_df.loc[test_df['transaction_descriptor'] == 'NNT POLO RL WRENTHA 130571 ']

Unnamed: 0,transaction_descriptor,store_number,dataset
223,NNT POLO RL WRENTHA 130571,13057,test


We see that this one is an unfortunate error, there is no way of knowing, without knowing the label itself beforehand, that we were not supposed to keep the 1 at the end.

2) Case 2

In [51]:
test_df.loc[test_df['transaction_descriptor'] == ' BP 8644346ES 30 B96']

Unnamed: 0,transaction_descriptor,store_number,dataset
222,BP 8644346ES 30 B96,8644346,test


This is also a confusing case because there were 3 separate alphanumeric fields that could've been interpreted as a "store number". This is especially true since it is ambiguous as to what "BP" is. A lot of the other descriptors had full names, or had longer entries for the store names in the descriptors, so the model was likely able to deduce the context in those cases. 

3) Case 3

In [53]:
test_df.loc[test_df['transaction_descriptor'] == ' NNT SEARS HOMETOWN 862751']

Unnamed: 0,transaction_descriptor,store_number,dataset
231,NNT SEARS HOMETOWN 862751,8627,test


We see that, just like case 1, this is just an unfortunate case of the descriptor itself being unclear. It is arguable as to whether a human, analyzing this one by one, could extract this correctly.

4) Case 4

In [55]:
test_df.loc[test_df['transaction_descriptor'] == ' SUBWAY 32128']

Unnamed: 0,transaction_descriptor,store_number,dataset
292,SUBWAY 32128,3212,test


Case 4 is similar to Case 3 and 1.

## Conclusion

In conclusion, we see that spaCy's *Embed, Encode, Attend, Predict* model is significantly better than a traditional RNN (even with the many-to-many LSTM architecture). From a blank (randomly initialized, untrained) instance, we obtained an **accuracy of 90.00%**, and a **precision of 89.72%**! Although these results are not optimal (we would like to see closer to 95%-98% if possible), we also have to note that our training, validation, testing samples were limited in both size (*100 samples each*), and scope of edge cases, for our model to learn sufficiently. 

Furthermore, in our analysis of the failed cases, we see that they were instances that would have been extremely difficult to predict **without having apriori knowledge** about the true store numbers (look at *Case 1, 3, 4 in our failure analysis*). Another advantage of using this model is that, in the event that the model is unsure of the label, it actually predicted two labels. This is advantageous to us because in many machine learning workflows/cycles, it is still necessary to do further post-processing of model outputs, and if model actually has multiple options for guesses, we can always choose the correct ones, or simply filter out the incorrect ones with additional models or other criteria based on context.