**Name: Andrew Chang**

**email: andrewjych@gmail.com**

**For: Affinity Solutions**

In [1]:
# dependencies for part 1
import re

# dependencies for part 2
import pandas as pd
import numpy as np
import tensorflow as tf

# dependencies for part 3
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

# Store Number Entity Extraction

## General Layout of This Notebook

I will be laying out all of my proposed methods, and the experiments I performed with them. **Note: I put down my thought process and my workflow here, if you are interested in final results, please go to the Section 4.x (page 23-36).**

As a simpler example of entity extraction, we will attempt to **extract the store numbers out of a corpus containing a business/organization and a specific store number**. We will attempt 3 methods of extracting the store number in this notebook:

1) Use regular expressions and don't use machine learning

2) Build an RNN-based classifier, in TensorFlow, to predict the store number

3) Use a multi-stage entity extraction model, from spaCy, to predict the correct entity, and extract it

We will discuss the background of each method, and the pros and cons of each.

# 1) Regex Parsing

Although the end goal of this notebook is to show that inferencing with a machine learning model can do this entity extraction task, it is good to explore non-ML options before we pursue any statistical methods.

**Regular Expressions** are a good first choice whenever we have to do some formatting of alphanumerical strings. We can identify a few cases that we see in our data, and we can appropriately filter out what we see as "store numbers".

In [2]:
# LOAD DATA

data = pd.read_csv(r'C:\Users\Chang\Desktop\Work Things\Internships\Affinity Solutions\Summer Internship - Homework Exercise.csv')


In [3]:
def clean_spaces(s):
    return re.sub(' +', ' ', s)

# split on punctuations because the vectorizer will not understand any number with it
def split_punc(s):
    return re.sub(r'([^\w\s]|_)',' ',s)

def split_leading_zeros(s):
    return re.sub(r'\b0+', '', s)

# split numbers from words

def split_num_from_word(s):
    li_split = s.split(' ')
    li_where_to_split = []
    start_, end_ = 0, 0
    for ind, item in enumerate(li_split):
        # filter out for words with both letters and numbers
        if not item.isalpha() and item.isalnum():
            split_num_word = re.split('(\d+)', item)
            if len(split_num_word[-1]) > 2 or len(split_num_word[0]) > 2:
                start_, end_ = (s.index(split_num_word[1])), s.index(split_num_word[1]) + len(split_num_word[1])

    return s[:start_] + ' ' + s[start_:end_] + ' ' + s[end_:]

In [4]:
transaction_descriptor = data['transaction_descriptor']

data['transaction_descriptor'] = list(map(clean_spaces, map(split_leading_zeros, map(split_num_from_word, map(split_punc, transaction_descriptor)))))

In [5]:
# display all descriptors and store numbers

list(zip(data['transaction_descriptor'], data['store_number']))

[(' DOLRTREE 2257 22574 ROSWELL', '2257'),
 (' AUTOZONE 3547', '3547'),
 (' TGI FRIDAYS 1485 ', '1485'),
 (' BUFFALO WILD WINGS 3', '3'),
 (' J CREW 568 ', '568'),
 (' KRISPY KREME 40 GREENVILLE SC', '40'),
 (' FIVE GUYS MN 1847 ECOM 612 339 9733 MN', '1847'),
 (' CASEYS GEN STORE 2650', '2650'),
 (' HUDDLE HOUSE 535', '535'),
 (' JAMBA JUICE 1305', '1305'),
 (' ANN TAYLOR FACTORY 2202', '2202'),
 (' Subway 26824', '26824'),
 (' MARSHALLS 688', '688'),
 (' OREILLY AUTO 4681', '4681'),
 (' TA 227 BARSTOW REST', '227'),
 (' SONIC 3207', '3207'),
 (' HY VEE 1040', '1040'),
 (' MCDONALD S F1013', 'F1013'),
 (' CHEVRON 207812', '207812'),
 (' EXPRESS 920', '920'),
 (' MCDONALD S F16829', 'F16829'),
 (' CHEVRON 208998 Q61', '208998'),
 (' STARBUCKS STORE 49134', '49134'),
 (' CHEVRON 302900', '302900'),
 (' CIRCLE K 1251', '1251'),
 (' NST BEST BUY 479 2610', '479'),
 (' HOLIDAY STNSTORE 354', '354'),
 (' DOMINO S 7430', '7430'),
 (' NST ROSS STORE 483280353', '483280353'),
 (' MCDONALD S F5

We see that regular expressions have allowed us to clean the descriptors up a lot. However, we must discuss why this is not a scalable, nor flexible solution for our task:

## Pros of Regular Expressions/Non-ML Methods in This Task

- Extremely simple implementation, regular expressions are quite literally one-liners, and their effect is profound
- We do not have to train, and tune a model; this is a big deal because training models (especially production-worthy models) is time-intensive, and even resource-heavy depending on the scale of the task

## Cons of Regular Expressions/Non-ML Methods in This Task

- This is too rigid; regular expressions are rigid queries, and they handle only a specific subset of all the formatting cases that could arise in this type of task.
- Not scalable, this is similar to before, but if we had a dataset that wasn't 300, but 300k, maybe even more, filtering with typical string methods and regular expressions would surely fail.

**Note: Hardcoding rules based on "testing" data is also not good practice in a data science/machine learning workflow, so this would not be something that we pursue in that context anyways.**

# 2) Recurrent Neural Network-Based Model

**Classify Tag of Data** $\rightarrow$ **Attach Classification to Each Token** $\rightarrow$ **Extract Only Tokens with tags that have "Store Number" Class**

This will require us to train an entity recognition model in the first part, and then do significant post-processing to obtain the result that we want.

## Preprocess Data

**Clean the descriptors using regular expressions**. There are a lot of numbers that are not recoverable because of weird formatting.

In [6]:
df1 = pd.read_csv(r'C:\Users\Chang\Desktop\Work Things\Internships\Affinity Solutions\Summer Internship - Homework Exercise.csv')

In [7]:
def clean_spaces(s):
    return re.sub(' +', ' ', s)

# split on punctuations because the vectorizer will not understand any number with it
def split_punc(s):
    return re.sub(r'([^\w\s]|_)',' ',s)

def split_leading_zeros(s):
    return re.sub(r'\b0+', '', s)

In [8]:
transaction_descriptor = df1['transaction_descriptor']

df1['transaction_descriptor'] = list(map(clean_spaces, map(split_leading_zeros, map(split_punc, transaction_descriptor))))

## Make a New DataFrame with Inputs Split by Token

We will make sure to preserve which entry it was, and to preserve the tokens themselves.

In [9]:
df1['transaction_descriptor'].str.split(' ').explode()

0      DOLRTREE
0          2257
0         22574
0       ROSWELL
1      AUTOZONE
         ...   
298    REPUBLIC
298        8109
299      BOSTON
299      MARKET
299         443
Name: transaction_descriptor, Length: 1057, dtype: object

In [10]:
df_nn = pd.DataFrame(data={})

df_nn['Tokens'] = df1['transaction_descriptor'].str.split(' ').explode()

df_nn['Descriptor #'] = df_nn.index + 1

df1 = df_nn

### Method

The classes we will make here are **word vs number**. Our plan is to tag anything that seems like a number as numbers. This will help accomplish the task (in theory) because we will be splitting every descriptor token-by-token, and, in a perfect circumstance, it the model would only flag the store number as a "number", and the rest of the tokens as "not a number".

Once we have deduced which descriptors have a "number" or a "word", we can further filter out which "numbers" are actually store numbers, and which ones are not. Because we have preserved the index, we can easily recover whether each token instance corresponds to a train/test/val and which store number it corresponds to.

We will now tag this data using regular expressions.

In [11]:
# check if string has either letters, numbers, or has both
# if string has only letters, then it is just going to be labeled "Text"
# if a string has only numbers or both letters and numbers, it is going to be labeled "Numeric"

def check_string(arr_strings):
    li_tag = []
    for s in arr_strings:
        if (s.isalnum() and not s.isalpha() and not s.isdigit()) or s.isdigit():
            li_tag.append('Numeric')
        else:
            li_tag.append('Text')
            
    return li_tag

In [12]:
df1['Tag'] = check_string(df1['Tokens'])

df1

Unnamed: 0,Tokens,Descriptor #,Tag
0,DOLRTREE,1,Text
0,2257,1,Numeric
0,22574,1,Numeric
0,ROSWELL,1,Text
1,AUTOZONE,2,Text
...,...,...,...
298,REPUBLIC,299,Text
298,8109,299,Numeric
299,BOSTON,300,Text
299,MARKET,300,Text


## Start Vectorizing Text

We will start the process of vectorizing here. We will do the vectorization manually, as this is roughly equivalent to what TensorFlow's TextVectorization layer does.

In [13]:
# get a mapping of numerical id's to words and tags

def dict_map(data, token):
    token_id = dict()
    id_token = dict()
    
    if token == 'Tokens':
        vocab = list(data['Tokens'].unique())
    else:
        vocab = list(data['Tag'].unique())
    
    id_token = {ide: tok for ide, tok in enumerate(vocab)}
    token_id = {tok: ide for ide, tok in enumerate(vocab)}
    
    return token_id, id_token

word_to_id, id_to_word = dict_map(df1, 'Tokens')

tag_to_id, id_to_tag = dict_map(df1, 'Tag')

In [15]:
tag_to_id

{'Text': 0, 'Numeric': 1}

In [16]:
df1['Token ID'] = df1['Tokens'].map(word_to_id)

df1['Tag ID'] = df1['Tag'].map(tag_to_id)

df1

Unnamed: 0,Tokens,Descriptor #,Tag,Token ID,Tag ID
0,DOLRTREE,1,Text,0,0
0,2257,1,Numeric,1,1
0,22574,1,Numeric,2,1
0,ROSWELL,1,Text,3,0
1,AUTOZONE,2,Text,4,0
...,...,...,...,...,...
298,REPUBLIC,299,Text,599,0
298,8109,299,Numeric,600,1
299,BOSTON,300,Text,199,0
299,MARKET,300,Text,200,0


So we see here that **the classification problem becomes whether the token is numeric (positive), or the token is text-like (negative)**. We will now group this by the descriptor # (which we have judiciously preserved throughout our preprocessing steps). 

We will vectorize automatically by aggregating the entries into a list.

In [17]:
df1_final = df1.groupby(['Descriptor #'], as_index = False)['Tokens', 'Tag', 'Token ID', 'Tag ID'].agg(lambda x: list(x))

df1_final

  df1_final = df1.groupby(['Descriptor #'], as_index = False)['Tokens', 'Tag', 'Token ID', 'Tag ID'].agg(lambda x: list(x))


Unnamed: 0,Descriptor #,Tokens,Tag,Token ID,Tag ID
0,1,"[DOLRTREE, 2257, 22574, ROSWELL]","[Text, Numeric, Numeric, Text]","[0, 1, 2, 3]","[0, 1, 1, 0]"
1,2,"[AUTOZONE, 3547]","[Text, Numeric]","[4, 5]","[0, 1]"
2,3,"[TGI, FRIDAYS, 1485, ]","[Text, Text, Numeric, Text]","[6, 7, 8, 9]","[0, 0, 1, 0]"
3,4,"[BUFFALO, WILD, WINGS, 3]","[Text, Text, Text, Numeric]","[10, 11, 12, 13]","[0, 0, 0, 1]"
4,5,"[J, CREW, 568, ]","[Text, Text, Numeric, Text]","[14, 15, 16, 9]","[0, 0, 1, 0]"
...,...,...,...,...,...
295,296,"[MCDONALD, S, F2151]","[Text, Text, Numeric]","[60, 61, 594]","[0, 0, 1]"
296,297,"[NST, BEST, BUY, 1403, 332411]","[Text, Text, Text, Numeric, Numeric]","[76, 77, 78, 595, 596]","[0, 0, 0, 1, 1]"
297,298,"[CVS, PHARMACY, 6689]","[Text, Text, Numeric]","[180, 181, 597]","[0, 0, 1]"
298,299,"[BANANA, REPUBLIC, 8109]","[Text, Text, Numeric]","[598, 599, 600]","[0, 0, 1]"


Machine learning models require standardized inputs, particularly with regards to input shape. We will need to pad the token ID and tag ID corresponding to each Descriptor before we do any computations.

In [18]:
li_token_len = []
for token in df1_final['Token ID']:
    li_token_len.append(len(token))

max_token_len = max(li_token_len)
print(f'The maximum token length is {max(li_token_len)}.')

The maximum token length is 9.


In [19]:
li_tag_len = []
for tag in df1_final['Tag ID']:
    li_tag_len.append(len(tag))
    
print(f'The maximum tag length is {max(li_tag_len)}.')

The maximum tag length is 9.


In [20]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [22]:
pad_tokens = pad_sequences(df1_final['Token ID'], maxlen = max_token_len, padding='post', value=0)
pad_tokens

array([[  0,   1,   2, ...,   0,   0,   0],
       [  4,   5,   0, ...,   0,   0,   0],
       [  6,   7,   8, ...,   0,   0,   0],
       ...,
       [180, 181, 597, ...,   0,   0,   0],
       [598, 599, 600, ...,   0,   0,   0],
       [199, 200, 601, ...,   0,   0,   0]])

In [23]:
pad_tag = pad_sequences(df1_final['Tag ID'], maxlen = max_token_len, padding='post', value=0)
pad_tag

array([[0, 1, 1, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

Check that the output of pad_tokens and pad_tag are not weird.

In [24]:
pad_tokens.shape, pad_tag.shape

((300, 9), (300, 9))

They are both uniform in shape. Let us check a few entries.

In [25]:
pad_tokens[0]

array([0, 1, 2, 3, 0, 0, 0, 0, 0])

In [26]:
pad_tag[0]

array([0, 1, 1, 0, 0, 0, 0, 0, 0])

**Note: In our word dictionary (the encoding we did earlier to get these vectorizations), we have have padded with 0, the values corresponding to "DOLRTREE", which makes no sense, however, in terms of our computation, it makes sense to put a 0 there instead of a large non-zero value (it may actually skew the computations).** What matters is that since we padded with something marked as a "Text", it should automatically correspond to a label of "Text", which is exactly what we did.


## Preprocess Our Vectors Even More

We will now perform a little more preprocessing before we begin our training routine. We first one-hot encode the tags, so that every token will correspond to a one-hot encoded label.

In [27]:
from tensorflow.keras.utils import to_categorical

In [28]:
# this will be our y_true

y_true = to_categorical(pad_tag)

In [29]:
y_true.shape

(300, 9, 2)

### Train-Test Split

Typically, we would use use train-test split from sklearn, or we could even code it ourselves, however, for this task, since the explicit training, validation, testing labels were given, we will just follow it. We did not shuffle any of our dataset, so we will just slice for each set.

In [30]:
# our output label
y_train, y_val, y_test = y_true[:100,:,:], y_true[100:200,:,:], y_true[200:300,:,:]

x_train, x_val, x_test = pad_tokens[:100, :], pad_tokens[100:200, :], pad_tokens[200:300, :]

In [31]:
x_train.shape

(100, 9)

In [32]:
y_train.shape

(100, 9, 2)

## Model Building (Part 1)

We will now build the classifier that will do this classification. We do not have that many data points to train this on, but let us build one from scratch anyways.

The architecture we will be pursuing is a many-to-many LSTM RNN (takes in multiple inputs that have respective outputs).

**Note:** This is a very common architecture for a named-entity recognition or a part of speech tagging.

In [33]:
input_dim = len(list(df1['Tokens'].unique())) + 1
output_dim = 8
input_length = max_token_len
output_units = len(id_to_tag)

In [34]:
print(f'Input Dimensions {input_dim}')
print(f'Output Dimension {output_dim}')
print(f'Input Length {max_token_len}')
print(f'Output Units {output_units}')

Input Dimensions 603
Output Dimension 8
Input Length 9
Output Units 2


In [35]:
# import all necessary things

from tensorflow.keras import Sequential, Model, Input
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, TimeDistributed, Dense
from sklearn.metrics import classification_report

In [36]:
# get all layers to put into model
li_layers = [
    Input(shape=(max_token_len,)),
    Embedding(input_dim=input_dim, output_dim=output_dim),
    Bidirectional(LSTM(units=output_dim, return_sequences=True, name='LSTM1')),
    LSTM(units=output_dim, return_sequences=True, name='LSTM2'),
    TimeDistributed(Dense(units=output_units, activation='softmax'))
]

In [37]:
model1 = Sequential(li_layers)

model1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 9, 8)              4824      
                                                                 
 bidirectional (Bidirectiona  (None, 9, 16)            1088      
 l)                                                              
                                                                 
 LSTM2 (LSTM)                (None, 9, 8)              800       
                                                                 
 time_distributed (TimeDistr  (None, 9, 2)             18        
 ibuted)                                                         
                                                                 
Total params: 6,730
Trainable params: 6,730
Non-trainable params: 0
_________________________________________________________________


In [38]:
# indicate logits = true as we do not generate a probability measure on our output

model1.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-4),
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=[tf.keras.metrics.CategoricalAccuracy(), 
                       tf.keras.metrics.Precision(name='precision'),
                       tf.keras.metrics.Recall(name='recall'),
                       tf.keras.metrics.FalsePositives(name='FP'),
                       tf.keras.metrics.FalseNegatives(name='FN')])

## Fit our Model

In [39]:
# epoch number

epochs = 25

In [40]:
# fit the model

history_1 = model1.fit(x_train, y_train,batch_size=5,epochs=epochs,
                      validation_data = (x_val, y_val))

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [41]:
y_pred_train = np.argmax(model1.predict(x_train), axis=2)

y_train_classes = np.argmax(y_train,axis=2)

In [42]:
y_pred_train.shape

(100, 9)

In [43]:
y_train_classes.shape

(100, 9)

In [46]:
print(classification_report(y_train_classes.flatten(), y_pred_train.flatten(), target_names=id_to_tag.values()))

              precision    recall  f1-score   support

        Text       0.87      1.00      0.93       782
     Numeric       0.00      0.00      0.00       118

    accuracy                           0.87       900
   macro avg       0.43      0.50      0.46       900
weighted avg       0.75      0.87      0.81       900



## Evaluate Our Model

Now upon first sight, our accuracy metrics look extremely good. But is it legitimate? Let us take a look:

In [47]:
count_df = df1.groupby('Tag').count()[['Tag ID']]

count_df

Unnamed: 0_level_0,Tag ID
Tag,Unnamed: 1_level_1
Numeric,372
Text,685


So we see that this is almost a 65-35 split of Text-Numeric tags. Thus, this dataset is very imbalanced! We can only judge the performance of this classification by looking at various metrics (precision, recall, f1), accuracy is not meaningful here.

We can get the actual ground truth per class by taking the argmin along the possible classes. Because we one-hot encoded it, only the class corresponding to the label will have a 1 (everything else 0), hence, we can do this.

In [90]:
# get ground truth

y_true = np.argmax(y_test, axis=2)

In [91]:
# get predictions

y_hat = model1.predict(x_train)

y_pred = np.argmax(y_hat, axis=2)

In [50]:
# flatten vectors to compare outputs

y_true_flat = y_true.flatten()

y_pred_flat = y_pred.flatten()

In [51]:
# get classification report

dict_class_report = classification_report(y_true_flat, y_pred_flat, target_names = id_to_tag.values())

print(classification_report(y_true_flat, y_pred_flat, target_names = id_to_tag.values()))

              precision    recall  f1-score   support

        Text       0.86      1.00      0.92       772
     Numeric       0.00      0.00      0.00       128

    accuracy                           0.86       900
   macro avg       0.43      0.50      0.46       900
weighted avg       0.74      0.86      0.79       900



## Discussion

We must discuss what occurred here. Our biggest issue here is how imbalanced our dataset is (we can look at the support in the classification report, it was close to an 85-15 split of "Not Numbers" vs "Numbers" for our classes). Furthermore, even if our dataset *were* balanced, this model would likely still not perform up to what would be needed for a task like this. In such a task as this one, we would **ideally** want a precision/recall closer to 80%-90%, as to minimize the amount of work needed to correct the mistagged labels (this is a task where we must absolutely get the correct labels).

Given the limited data we have, using a freshly-instantiated neural network is not an optimal way to approach this task because even for a neural network *as shallow as this one*, we have over 6000 parameters to train from scratch. With the limited data that we have (exactly 100 training samples), it cannot be expected that our neural network will find the right weights to inference properly.

## Pros of LSTM RNN

- Easy to preprocess data and train it; the training process and preprocessing the text to an input suitable for a neural network was not difficult at all
- Simple to translate the business problem to a technical implementation

## Cons of LSTM RNN

- Lots of moving parts; i.e. we have too many parameters for even the simplest models
- For something with a lot of moving parts, we do not have nearly enough data to train all of these parameters properly
- Given the above notes, this will absolutely fail the edge cases unless we give it more data, and plenty of representative samples of many cases; in this problem, this is simply not a luxury that we have

# 3) spaCy's *Embed, Encode, Attend, Predict* Model

Here, we will explore a Natural Language Processing library called **spaCy**. This library is ubiquitous for its *embed, encode, attend, predict* architecture, and its convenient workflow. To illustrate what the architecture of the model is like, we give the following slide discussing this workflow (link: https://github.com/explosion/talks/blob/master/2018-04-12_Embed-Encode-Attend-Predict.pdf):


## How We Will Use This

We will use a blank model from spaCy, and in particular, we will use the **named entity recognition** pipeline in the model (the models also have other possible pipelines for other applications). We are trying to do, what is essentially, *transfer learning*, i.e. taking a pre-(trained/built) model and then adapting the model specifically for our task. 

Our workflow that we will demonstrate is the typical workflow: 

- Preprocess Data (cleaning, reshaping for input into model)
- Set up the model configuration files
- Train the model, and score it on the validation set during training
- Inference it on test data
- Evaluate the metrics and evaluate where the model fails

## Load and Preprocess Data

In [52]:
df = pd.read_csv(r'C:\Users\Chang\Desktop\Work Things\Internships\Affinity Solutions\Summer Internship - Homework Exercise.csv')

In [53]:
# clean awkward spaces
def clean_spaces(s):
    return re.sub(' +', ' ', s)

# split on punctuations because the vectorizer will not understand any number with it
def split_punc(s):
    return re.sub(r'([^\w\s]|_)',' ',s)

# take out any numbers with zeros leading and trailing it
def split_leading_zeros(s):
    return re.sub(r'\b0+', '', s)

# split numbers from words

def split_num_from_word(s):
    li_split = s.split(' ')
    li_where_to_split = []
    start_, end_ = 0, 0
    for ind, item in enumerate(li_split):
        # filter out for words with both letters and numbers
        if not item.isalpha() and item.isalnum():
            split_num_word = re.split('(\d+)', item)
            if len(split_num_word[-1]) > 2 or len(split_num_word[0]) > 2:
                start_, end_ = (s.index(split_num_word[1])), s.index(split_num_word[1]) + len(split_num_word[1])

    return s[:start_] + ' ' + s[start_:end_] + ' ' + s[end_:]

In [54]:
# apply data cleaning we did above

transaction_descriptor = df['transaction_descriptor']

df['transaction_descriptor'] = list(map(clean_spaces, map(split_leading_zeros, map(split_num_from_word, map(split_punc, transaction_descriptor)))))


## Load a spaCy model

Let us see how this model will output before we go into the nitty gritty of training and inferencing it.

In [55]:
nlp_init = spacy.load('en_core_web_sm')

In [86]:
# get predictions on initial run
li_init_predictions = []
for descriptor in df['transaction_descriptor']:
    doc = nlp_init(descriptor)
    for ent in doc.ents:
        li_init_predictions.append((ent.text, ent.label_))
        
li_init_predictions[:5]

[('3', 'CARDINAL'),
 ('CREW', 'ORG'),
 ('40', 'CARDINAL'),
 ('GREENVILLE SC', 'ORG'),
 ('FIVE', 'CARDINAL')]

To illustrate this pre-trained model's many pipelines, we can display it here:

In [59]:
# get components of the pipeline

nlp_init.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [60]:
# run a toy example

# example on an entry that we could possible have

ex1 = 'TOYS R US 5009'

doc = nlp_init(ex1)

for entity in doc.ents:
    print(entity, entity.label_)

US GPE
5009 DATE


So we see that our nlp pipeline from spacy requires a lot of training for our specific examples. It does not have the exact domain-specific knowledge that we need it to have for our task (in particular, it does not have the ability to discern between stores and numbers).

## Train Our Model

We will feed our training set into the model.

Training data must be presented to spacy nlp models in the form:

[(Text to train on, {'entities': (start index, stop index, 'desired label')]

**Get our training data into this format**

In [61]:
# function to get our data into above format

def load_spacy_data(train_df, dataset):
    # load data
    train_df = df.loc[df['dataset'] == dataset]

    # get into form we need to train spacy model
    li_data = []
    for store_num, descriptor in zip(train_df['store_number'], train_df['transaction_descriptor']):
        start_pos = descriptor.index(store_num)
        end_pos = start_pos + len(store_num)
        li_data.append((descriptor, {'entities':[(start_pos, end_pos, 'store_num')]}))

    return li_data

In [62]:
train_data = load_spacy_data(df.loc[df['dataset'] == 'train'], 'train')

### Load Validation Data

In [63]:
val_data = load_spacy_data(df.loc[df['dataset'] == 'validation'], 'validation')

In [64]:
print(f'We have {len(train_data)} training entries.')

print(f'We have {len(val_data)} validation entries.')

We have 100 training entries.
We have 100 validation entries.


### Load a Blank Model

We will train a blank model (the one we loaded earlier was a pretrained model) to see what baseline results we could obtain.

In [None]:
nlp = spacy.blank('en')

### Load Data into Proper File Format

Unlike many of the other machine learning libraries, spaCy models (to my knowledge as of version 3.2) cannot be trained directly through python script. We must load our above training and validation data into a serialized file format, the **.spacy** file.

In [65]:
def load_dot_spacy(data, dataset, nlp):
    db = DocBin() # create a DocBin object
    for text, annot in tqdm(data): # data in previous format
        doc = nlp.make_doc(text) # create doc object from text
        ents = []
        for start, end, label in annot['entities']: # add character indexes
            span = doc.char_span(start, end, label=label, alignment_mode='contract')
            if span is None:
                print('Skipping entity')
            else:
                ents.append(span)
        try:
            doc.ents = ents # label the text with the ents
            db.add(doc)
        except:
            print(text, annot)
    db.to_disk('./' + dataset + '.spacy') # save the docbin object

In [66]:
# will store files named "train.spacy" and "dev.spacy" in our directory with this notebook, we will feed this into
# our config file

load_dot_spacy(train_data, 'train', nlp)

load_dot_spacy(val_data, 'dev', nlp)

### Get the Config File

For spacy 3.x, we must call our training protocol in a config file. We will run the spacy training loop from the command prompt instead of directly from python.

See documentation: https://spacy.io/usage/training

### Instructions to get the config file + autofill the config file

- Activate virtual environment of the project (must have spaCy installed)

- Type in "python -m spacy init fill-config base_config.cfg config.cfg" to the command prompt

- Run it

We will obtain a config file, named "config.cfg".

### Instructions to run the training loop

Go into the command prompt, and type in:

"python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy"

Then run it. It will train, and as it trains, it will display the metrics.

## Evaluate the Performance of Our Model

We will now evaluate the performance of our model on our testing data.

First load the model that we trained:

In [67]:
nlp_best = spacy.load('./output/model-best')

Load the testing data

In [68]:
test_data = load_spacy_data(df.loc[df['dataset']=='test'], dataset='test')

### Some Informal Testing

Let us test entry-by-entry first, to make sure that we are getting reasonable outputs.

In [69]:
test_1 = test_data[0]

test_1

(' IN N OUT BURGER 242', {'entities': [(17, 20, 'store_num')]})

In [70]:
# apply the model to the text of our first test data

test1_doc = nlp_best(test_1[0])

for ent in test1_doc.ents:
    print(ent.text, ent.label_)

242 store_num


In [71]:
test_2 = test_data[1]

test_2

('BP 9442088 LIBERTYVILLE B', {'entities': [(3, 10, 'store_num')]})

In [72]:
test2_doc = nlp_best(test_2[0])

if len(test2_doc.ents) == 0:
    li1.append('False')
for ent in test2_doc.ents:
    print(ent.text, ent.label_)

9442088 store_num


**Note:** The data cleaning we did earlier was extremely helpful because originally, this entry had "9442088LIBERTYVILLE", which would've made the store number indistinguishable.

In [73]:
test_3 = test_data[7]

test_3

(' LBOUTLETS 4249 1475 N BUR', {'entities': [(11, 15, 'store_num')]})

In [74]:
test3_doc = nlp_best(test_3[0])
li1 = []
for ent in test3_doc.ents:
    if len(test3_doc.ents) == 0:
        li1.append('False')
    else:
        li1.append((ent.text, ent.label_))
li1

[('4249', 'store_num')]

**This is an interesting case that shows us the strength of this model**. In cases where there are multiple numbers, this **model seems to be able to tell the difference between the numbers**. 

For example, in our example just now, we had the descriptor:

"LBOUTLETS 4249 1475 N BUR"

Our model was able to recognize that **4249** was preceded by a store or organization, and that **1475** was simply the number that came before an address.

### Testing

Let us inference our model on our entire testing set.

In [75]:
# function for quick testing
# test_data - of the form generated by load_spacy_data function

def test_ner(test_data, model):
    li_inference_results = []
    # iterate through texts and apply model
    for ind, (text, annotations) in enumerate(test_data):
        doc_test = model(text)
        
        # if the model doesn't predict anything, throw a blank output into the list
        # otherwise append the prediction + label
        if len(doc_test.ents) == 0:
            
            # we can hardcode 'store_num' as label because we're only looking for store_num labels
            li_inference_results.append((ind,'', 'store_num',len(doc_test.ents)))
        else:
            for ent in doc_test.ents:
                li_inference_results.append((ind, text, ent.text, ent.label_, len(doc_test.ents)))
    return li_inference_results

The output of our model is of the form:

(the input index that the label corresponds to, the input (descriptor), the prediction (store number), label ('store_num'), numbers of predictions that the model made for the descriptor)

In [89]:
inference_output = test_ner(test_data, nlp_best)

inference_output[:30]

[(0, ' IN N OUT BURGER 242', '242', 'store_num', 1),
 (1, 'BP 9442088 LIBERTYVILLE B', '9442088', 'store_num', 1),
 (2, ' JCPENNEY 1419', '1419', 'store_num', 1),
 (3, ' ROSS STORES 1019', '1019', 'store_num', 1),
 (4, ' WM SUPERCENTER 38', '38', 'store_num', 1),
 (5, ' TUESDAY MORNING 673 6', '673', 'store_num', 1),
 (6, ' IHOP 629 WHITE HOUSE TN', '629', 'store_num', 1),
 (7, ' LBOUTLETS 4249 1475 N BUR', '4249', 'store_num', 1),
 (8, ' WINN DIXIE 2505 VALRICO FL 3454 ', '2505', 'store_num', 2),
 (8, ' WINN DIXIE 2505 VALRICO FL 3454 ', '3454', 'store_num', 2),
 (9, ' BURLINGTON STORES 825', '825', 'store_num', 1),
 (10, ' WM SUPERCENTER 2923', '2923', 'store_num', 1),
 (11, ' BUFFALO WILD WINGS 58 CARSON CITY NV', '58', 'store_num', 1),
 (12, ' BOB EVANS REST 2039', '2039', 'store_num', 1),
 (13, ' JIMMY JOHNS 382 E', '382', 'store_num', 1),
 (14, ' PENSKE TRK LSG 12260', '12260', 'store_num', 1),
 (15, ' AEROPOSTALE 864', '864', 'store_num', 1),
 (16, ' GIANT 338', '338', 'store_nu

In [77]:
print(f'The model has made {len(inference_output)} predictions.')

The model has made 107 predictions.


**Note:** The model made more predictions than inputs because for descriptors where it identified more than 1 store number, the model outputs all of them! This is an advantage that we have (and this is similar to what we were looking to achieve earlier in *part 2*, albeit with more success here).

Let us check which predictions the model was "indecisive" on.

In [78]:
for index, descriptor, prediction, _, len_ in inference_output:
    if len_ != 1:
        print(f'The descriptor at entry {index} is {descriptor}')
        print(f'The predicted store number for entry {index} is {prediction}\n')

The descriptor at entry 8 is  WINN DIXIE 2505 VALRICO FL 3454 
The predicted store number for entry 8 is 2505

The descriptor at entry 8 is  WINN DIXIE 2505 VALRICO FL 3454 
The predicted store number for entry 8 is 3454

The descriptor at entry 22 is  BP 8644346ES 30 B96
The predicted store number for entry 22 is 8644346ES

The descriptor at entry 22 is  BP 8644346ES 30 B96
The predicted store number for entry 22 is 30

The descriptor at entry 28 is  WINN DIXIE 2454 SEFFNER FL 1033 
The predicted store number for entry 28 is 2454

The descriptor at entry 28 is  WINN DIXIE 2454 SEFFNER FL 1033 
The predicted store number for entry 28 is 1033

The descriptor at entry 33 is  NAVY EXCHANGE 50161 3
The predicted store number for entry 33 is 50161

The descriptor at entry 33 is  NAVY EXCHANGE 50161 3
The predicted store number for entry 33 is 3

The descriptor at entry 36 is  CASEYS GEN STORE 2597 SLOAN IA51055
The predicted store number for entry 36 is 2597

The descriptor at entry 36 is  

As we see here, our model still gets hung up on entries where there is a number present in the descriptor (especially numbers in addresses). Numbers next to addresses are a particular weakness, because our model can deduce context, and because locations (like "TAMPA FL") are nouns, **the model likely cannot make a definitive decision** on whether a number (2340) following a location (TAMPA FL) **is actually a store number of an address number**. Something **we could do** to go further  would be to **feed additional labels** that actually make a difference between **address numbers and store numbers**.

### Score our Model on Our Testing Set

We will score the model on two criteria:

- Accuracy: Out of all of the ground truths (so the 100 store number labels corresponding to descriptors), how many of them did the model predict correctly? (if the **model predicted more than 1 label for an input**, we will regard it as **incorrect**)

- Precision: Out of all of the model's predictions (so the 107 predictions that were outputted above), how many of those were correct?

**Note: Precision and Recall is typically used in classification, where we evaluate the precision and recall corresponding to each output class, but we are not predicting classes here.**

In [79]:
def score_model(inference_output, ground_truth_df, return_not_matched=False):
    metrics_dict = dict()
    not_matched = []
    # initialize number correct, number that model matched
    num_correct, num_matched = 0, 0
    for _, descriptor, prediction, _, num_predictions in inference_output:
        eval_row = ground_truth_df.loc[ground_truth_df['transaction_descriptor'] == descriptor]['store_number']
        
        # the prediction is matched if it matches the store_num value of the ground truth for the descriptor
        if eval_row.values[0] == prediction:
            num_matched += 1
            
            # the prediction can be CORRECT only if the store_num matches, and the model only took 1 attempt
            if num_predictions == 1:
                num_correct += 1
        else:
            not_matched.append((descriptor, prediction, num_predictions))
    
    metrics_dict['accuracy'] = num_correct / ground_truth_df.shape[0]
    metrics_dict['precision'] = num_matched / len(inference_output)
    if not return_not_matched:
        return metrics_dict
    else:
        return metrics_dict, not_matched

In [80]:
test_df = df.loc[df['dataset'] == 'test']
metrics = score_model(inference_output, test_df)

accuracy_model, precision_model = metrics['accuracy'], metrics['precision']
print(f'Our model accuracy is {accuracy_model*100:.3f}%')
print(f'Our model precision is {precision_model*100:.3f}%')

Our model accuracy is 90.000%
Our model precision is 89.720%


## Failure Analysis

**Let's examine the entries that our model got wrong**:

In [81]:
_, not_matched = score_model(inference_output, test_df, return_not_matched=True)

not_matched

[(' WINN DIXIE 2505 VALRICO FL 3454 ', '3454', 2),
 (' BP 8644346ES 30 B96', '8644346ES', 2),
 (' BP 8644346ES 30 B96', '30', 2),
 ('NNT POLO RL WRENTHA 130571 ', '130571', 1),
 (' WINN DIXIE 2454 SEFFNER FL 1033 ', '1033', 2),
 (' NNT SEARS HOMETOWN 862751', '862751', 1),
 (' NAVY EXCHANGE 50161 3', '3', 2),
 (' CASEYS GEN STORE 2597 SLOAN IA51055', 'IA51055', 2),
 (' NST BEST BUY 48 72393', '72393', 2),
 (' FOOTACTION 57331 TAMPA FL 2340 ', '2340', 2),
 (' SUBWAY 32128', '32128', 1)]

### Take a Deep Look at Some Cases

1) Case 1

In [82]:
test_df.loc[test_df['transaction_descriptor'] == 'NNT POLO RL WRENTHA 130571 ']

Unnamed: 0,transaction_descriptor,store_number,dataset
223,NNT POLO RL WRENTHA 130571,13057,test


We see that this one is an unfortunate error, there is no way of knowing, without knowing the label itself beforehand, that we were not supposed to keep the 1 at the end.

2) Case 2

In [83]:
test_df.loc[test_df['transaction_descriptor'] == ' BP 8644346ES 30 B96']

Unnamed: 0,transaction_descriptor,store_number,dataset
222,BP 8644346ES 30 B96,8644346,test


This is also a confusing case because there were 3 separate alphanumeric fields that could've been interpreted as a "store number". This is especially true since it is ambiguous as to what "BP" is. A lot of the other descriptors had full names, or had longer entries for the store names in the descriptors, so the model was likely able to deduce the context in those cases. 

3) Case 3

In [84]:
test_df.loc[test_df['transaction_descriptor'] == ' NNT SEARS HOMETOWN 862751']

Unnamed: 0,transaction_descriptor,store_number,dataset
231,NNT SEARS HOMETOWN 862751,8627,test


We see that, just like case 1, this is just an unfortunate case of the descriptor itself being unclear. It is arguable as to whether a human, analyzing this one by one, could extract this correctly.

4) Case 4

In [87]:
test_df.loc[test_df['transaction_descriptor'] == ' SUBWAY 32128']

Unnamed: 0,transaction_descriptor,store_number,dataset
292,SUBWAY 32128,3212,test


Case 4 is similar to Case 3 and 1.

5) Case 5

In [88]:
test_df.loc[test_df['transaction_descriptor'] == ' CASEYS GEN STORE 2597 SLOAN IA51055']

Unnamed: 0,transaction_descriptor,store_number,dataset
236,CASEYS GEN STORE 2597 SLOAN IA51055,2597,test


To comment on this case, the model was exposed to a few store numbers that were a combination of letters and numbers. The model likely got this wrong because the "IA51055" was likely supposed to be "IA 51055", and that the store number could've looked like a street address, given that a lot of the other street addresses present in the validation and training set were formatted like this.

## Pros of spaCy NLP Model

- Scalable
- Simple to Train
- Very Good Performance Despite Limited Dataset
- Despite small amount of data, handles *some* hard edge cases

## Cons of spaCy NLP Model

- Fails some edge cases (this is rather nitpicky however, even humans may have not extracted those properly)

## What to do Next?

We could **possibly explore hyperparameter tuning or transfer learning with an existing model** that is trained extensively (spaCy has a few of those available for download). However, I assess that *this is not necessary* because:

1) We achieved about 90% in both accuracy and precision with our model, from a blank instance of spaCy's model architecture. Some of the **edge cases are unresolvable** unless we know, beforehand, the labels of the testing data. And while we have access to them here, in practice, we should not be looking at testing data because we don't have the luxury of tagged/labeled testing data! We could possibly improve our loss or accuracy by a little bit, but many of these edge cases (shown in the preceding section) are hard for **even humans** to extract the store number, unless we know the labels apriori.

2) Likewise, hyperparameter tuning will not be useful for the same reasons as stated before. The last few test entries present a challenge for even humans, no amount of improvement will likely be able to predict all of those.

**The best way to improve this model** is to **give it more data**. The more that this model has to learn off of, the more it can adapt to difficult edge cases (like the *numbers in addresses*, or the *store numbers that are concatenated with other numbers*). It is important that we not only give it *more* data, but that we give it *more variety* of data. The **more different cases that our model is exposed to, the better it will generalize**.

## Conclusion

### Perspective to Carry in a Project

In any machine learning/data science project, there are always two aspects of the problem:

1) **The business/practical problem**

This is the statement of the problem that deals with the implications of achieving a certain task, and why we feel that machine learning/data science techniques will help us in accomplishing this task, particularly in a manner that exceeds the ability of humans in some way (either efficiency, accuracy, scalability, etc.)

2) **The technical problem**

This is the statement of the problem that deals with the *why* and the *how*. *How* do we leverage technical resources to accomplish this task? *Why* does our methodology and technique(s) work? Furthermore, we must be careful to acknowledge any practical considerations (like resource limitations, time limitations, etc.)

### Our Task

In this perspective, the **business/practical problem** of this task is to extract information (store numbers) out of a list of businesses/organizations. The implication of this is that this exceeds the effiency that humans would be able to accomplish this task with, and a single model built here is scalable to thousands, and even millions, of entries.

The **technical problem** of this task (what we are really focused on here) is how we extract this information. More specifically, how do we deal with text data and what must we do to ensure that our models are able to give clean predictions? 

**To answer the latter question**, **I explored 3 ways** we could possibly deal with the text data, and how to extract our desired information from it:

1) Filtering with Regular Expressions and Typical String Parsing Methods

2) Building an LSTM RNN

3) Training a Blank Instance of a Dedicated, Multi-purpose Natural Language Processing Model

After a lot of experimentation and observations of the pros and cons of each method, we have a clear winner with regards to which method is the best to handle this specific problem. The criteria for a good model/method here was the following:

- Scalable and Generalizable
- Easy to Implement
- Can overcome the limitations surrounding the problem (namely our lack of data)

**Option 1** was easy to implement, and clearly was not made any less useful by our lack of data (in fact, the less data there is, the easier it is). However, it is clearly not scalable and generalizable because of the nature of regular expressions. We could easily enumerate every case possible for the limited sample we had (*300 entries*), but clearly, if we scale our data up more, and throw in slight variations of existing cases, the amount of regular expression queries we have to write become out of control.

**Option 2** is scalable and generalizable (in a technical sense, just by nature of what neural networks are), and it is also easy to implement (this is subjective too, but TensorFlow is far simpler than writing hundreds of regular expression queries), but it was clearly hampered by the limitations in our data. Neural networks, as a general rule of thumb, require a lot of data to be effective due to the sheer amount of parameters that must be trained and updated. Furthermore, for a simple implementation, we had to restate our problem to a strict 2-class classification problem. This, in itself, brought more issues (caused imbalanced data) which further reduced the effectiveness of this approach given the limitations.

**Option 3** is scalable and generalizable (with more data), it is easy to implement, and because the model architecture was already laid out, we could just adapt it to our specific problem. Training this model was surprisingly effective in helping it learn our specific task, even despite the small amount of data that we had. There is no doubt that, with more varied data, this model would be extremely accurate.

In conclusion, we see that spaCy's *Embed, Encode, Attend, Predict* model is significantly more reliable than a traditional RNN (even with the many-to-many LSTM architecture), and it is significantly more scalable than a non-machine learning approach. From a blank model instance, we obtained an **accuracy of 90.00%**, and a **precision of 89.72%**! Although these results are not optimal (we would like to see higher accuracies if possible), we also have to note that our training, validation, testing samples were limited in both size (*100 samples each*), and scope of edge cases, for our model to learn sufficiently (*refer to part 2*). This model exceeded personal expectations. 

Furthermore, in our analysis of the failed cases, we see that they were instances that would have been extremely difficult to predict **without having apriori knowledge** about the true store numbers (look at *Case 1, 3, 4 in our failure analysis*). Another advantage of using this model is that, in the event that the model is unsure of the label, it actually predicted two labels. This is advantageous to us because in many machine learning workflows/cycles, it is still necessary to do further post-processing of model outputs, and if model actually has multiple options for guesses, we can always choose the correct ones, or simply filter out the incorrect ones with additional models, or with other criteria based on context.

**For the future...** we can explore more hyperparameter tuning and transfer learning, given that we have access to more data...