# NLP Model

This first model is taken from tensorflow hub and is used to embed words into numbers which can then be fed into a machine learning algorithm down the line. Although it is from tensorflow hub we actually use it in keras just for ease of use. The model is called nnlm and is a deep neural network trained on the google english news. More information may be found [here](https://tfhub.dev/google/collections/bert/1). 

In our particular use case this model is takes a transaction description, converts that into numbers, which is then fed into a classifier to predict what category a transaction is.

In [38]:
# import dependencies
import tensorflow_text as text  
import tensorflow as tf
import tensorflow_hub as hub

# construct our neural network
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/1")
encoder_inputs = preprocessor(text_input) # dict with keys: 'input_mask', 'input_type_ids', 'input_word_ids'
encoder = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2",
    trainable=True)
outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"]      # [batch_size, 768].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 768].

Now we can use the model to get our vector for any sentence we would like. In our production version we have more preprocessing to try get rid of the stuff which doesn't mean anything. We use BERT to try and catch semantic similarities between sentences due to the model learning from wikipedia. This means the resultant vector for coffee shop and coffee room will be closer than coffee room and casino.

In [58]:
embedding_model = tf.keras.Model(text_input, pooled_output)
sentences = tf.constant(["tesco superstore sainsburys"])
print(embedding_model(sentences))

tf.Tensor(
[[-0.9999931   0.01890039 -0.99417526 -0.5798835  -0.9798178  -0.475273
  -0.7508789  -0.89774495  0.06465519  0.04209847  0.3270732  -0.07334452
   0.02483435  0.9999743   0.42316917 -0.90515625  0.70257396  0.15186098
  -0.4735886  -0.4672566   0.88990635 -0.06746099 -0.19950417 -0.81234556
  -0.99843246 -0.06911208 -0.9996253   0.5582032   0.94886774  0.04736497
  -0.04890127 -0.15587012 -0.925856   -0.8101108   0.614291    0.9983587
  -0.9991843  -0.00977816  0.97735554 -0.9855164   0.98690736  0.9609173
  -0.9610637   0.94345903 -0.9910254  -0.08077802 -0.9433065   0.9964877
   0.874688    0.9992928   0.7694149  -0.8159503  -0.06038954  0.35094908
   0.7543407   0.88328785 -0.2450968  -0.9747528   0.820621   -0.332429
   0.04751019  0.9828596  -0.95191634  0.962528   -0.95750004 -0.99998635
  -0.7990878   0.9383162   0.38521895  0.9933143   0.99812734  0.26086992
  -0.9505805  -0.08694382  0.9597408  -0.9977386  -0.705001    0.09304172
  -0.80885977  0.04232275 -0.12215

In [43]:
import pandas as pd
import numpy as np

x = (embedding_model(sentences))

vectors_df = pd.DataFrame(np.stack(x))

# LightGBM

LightGBM stand for light gradient boosting machines and is a tree based model. It is fast and returns highly accurate results in comparison to other models. The documentation for it can be found [here](https://lightgbm.readthedocs.io/en/latest/) and the wiki page is [here](https://en.wikipedia.org/wiki/LightGBM). We use this as a classification model for our categories when we can not extract the merchants. 

The model takes an input of 133 (128 from the output from BERT) and also a further 5 features (balance (normalised), day of week ... )


In [73]:
# here I am just import the correct modules and importing a model that was made during testing with the right parameters
import os
import pickle
import lightgbm

prefix = '/Users/callumsmyth/PycharmProjects/aws-categorisation-engine/local_test/test_dir/'
model_path = os.path.join(prefix, 'model')
lg_class_path = os.path.join(model_path, 'lg_class.pkl')

with open(os.path.join(lg_class_path), 'rb') as inp:
    model = pickle.load(inp)



In [74]:
# add in some more columns so we can get predictions this is just dummy data
x_df = pd.DataFrame({
    "xcol1": [10],
    "xcol2": [0.5],
    "xcol3": [0.5],
    "xcol4": [0.5],
    "xcol5": [0.1]
})
predict_df = pd.concat([x_df, vectors_df])

In [76]:
prediction = model.predict(predict_df)
print(prediction)

['HOME' 'OTHER']


So we dont actually go for the predictions and use them directly. We take the probability of each class of income and use and if there is a probability above a certain threshold (0.3) we will use that class. If there is not then we will assign it 'OTHER' and this is what the customer will see as we are not confident in the results.

This looks like:

In [77]:
model.predict_proba(predict_df)

array([[1.40637363e-04, 1.01387134e-03, 1.45709923e-04, 4.84323182e-04,
        8.93834482e-04, 5.14661064e-03, 2.82819969e-04, 9.14937395e-01,
        1.32025780e-02, 1.70162808e-03, 8.84984257e-04, 1.91748153e-03,
        1.05822484e-02, 2.21050095e-04, 1.85267757e-04, 4.82595600e-02],
       [8.68086675e-03, 2.25433461e-05, 1.26963014e-01, 4.94214977e-02,
        7.12566297e-03, 1.18021395e-04, 1.54085304e-02, 3.61664445e-03,
        4.07569864e-05, 5.97870242e-04, 4.55172942e-05, 6.10035133e-01,
        1.23375729e-04, 1.67146749e-01, 5.99411223e-03, 4.65970364e-03]])

# Balance Prediction

for balance prediction we use a time series forecasting algorithm created by Amazon called DeepAr. This (similarly to BERT) is a neural network and information can be found [here](https://aws.amazon.com/blogs/machine-learning/now-available-in-amazon-sagemaker-deepar-algorithm-for-more-accurate-time-series-forecasting/). Because this is hosted on a docker container by AWS the architecture to use this is slightly different.

For categorisation we can 'BYOC' (bring your own container) and use custom code inside a container with a 'train' and 'serve' channel. This keeps all the code nice and compact in one container, and run the 'train' channel to create a new model. And then with the same container run it on 'serve' model and it will serve the model via an API (Flask) to our internal customer.

For DeepAr, it is AWS's container and we can't include custom code into it. So we have built a custom container (which we have full control over the input and output of the container) which calls the trained DeepAr model inside of it and then exposes the output back through an API (FastAPI) to our internal clients.

Time series models are easier to visualise so we can do that here.

In [90]:
!pip install scikit-learn

You should consider upgrading via the '/Users/callumsmyth/PycharmProjects/yapily-ml-models/yapily-ml-models/bin/python -m pip install --upgrade pip' command.[0m


In [110]:
%%writefile utils.py
import json
import logging
from datetime import timedelta
from random import shuffle

from tqdm import tqdm
import numpy as np
import pandas as pda
from sklearn.base import BaseEstimator, TransformerMixin

class JsonIfy(BaseEstimator, TransformerMixin):

    def __init__(self, inference=False, pred_length=None, dynamic_feat=True):
        self.inference = inference
        self.pred_length = pred_length
        self.dynamic_feat = dynamic_feat

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        time_series_jslines = []
        for ts in X:
            time_series_jslines.append(json.loads(series_to_jsonline(ts, dynamic_feat=self.dynamic_feat,
                                                                     pred_length=self.pred_length,
                                                                     inference=self.inference)))
        return time_series_jslines
    
def series_to_jsonline(ts, dynamic_feat=None, cat=None, inference=False, pred_length=None):
    if inference == False:
        return json.dumps(series_to_obj_train(ts, dynamic_feat, cat))
    else:
        return json.dumps(series_to_obj_inference(ts, dynamic_feat, cat, pred_length))
    

def series_to_obj_inference(ts, dynamic_feat=None, cat=None, pred_length=None):
    target = list(ts['target'])
    obj = {"start": str(ts.index[0]), "target": target[:-pred_length]}
    if cat is not None:
        obj["cat"] = cat

    if dynamic_feat is not None:
        dyn_feat_list = (list(ts['dynamic_features']))
        obj["dynamic_feat"] = [dyn_feat_list]

    return obj

Overwriting utils.py


In [104]:
!export AWS_DEFAULT_REGION=eu-west-2

In [109]:
%%writefile deepar-preds.py
import json
import pickle

import boto3
import matplotlib.pyplot as plt
import pandas as pd
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sklearn.pipeline import Pipeline

from utils import *
endpoint_name = 'balance-reconstruction-staging'


predictor = Predictor(
    endpoint_name=endpoint_name,
    serializer=JSONSerializer()
)

# setting some constants
freq = 'D'
prediction_length = 30
context_length = 30
prefix = 'Scikit-DeepAr-Pipeline/transformed_data/for_testing/'

# this is just reading in data from s3 to look at the predictions 
s3 = boto3.client('s3')
actual_data_obj = s3.get_object(Bucket='balance-reconstruction', Key=prefix + 'actual_data')
actual_data_obj = actual_data_obj['Body'].read()
actual_data = pickle.loads(actual_data_obj)
for i in actual_data:
    i.columns =[ 'target', 'AccountId', 'dynamic_features']

# creating a prediction object
predictor = Predictor(
    endpoint_name=endpoint_name,
    serializer=JSONSerializer())

bucket_name = 'balance-reconstruction'
prefix = 'Scikit-DeepAr-Pipeline/transformed_data/'
client = boto3.client('s3')

# nifty function which takes the data in a dataframe (easy to work with) and converts to the format DeepAr expects
convert_train_to_json = Pipeline([
    ('jsonify', JsonIfy(pred_length=prediction_length, inference=True))
])

# selecting the previous 30 days
prediction_data = []
for i in actual_data:
    prediction_data.append(i[:-60])

# this actual converts our raw data to give to DeepAr and adds some configuration so we can output or 80% confidence interval
time_series_training = convert_train_to_json.fit_transform(prediction_data)
instances = {"instances": time_series_training,  'configuration': {"num_samples": 30,
                                                                  "output_types": ["quantiles"],
                                                                  "quantiles": ["0.1", "0.9", "0.5"]
                                                                  }}
# do the actual predictions
response = predictor.predict(instances)
response_data = json.loads(response.decode())


# from here onwards its just reformating the data so we can get it into a nice overlayed graph
predicted_dataframe = []
prediction_times = [x.index[-31] + pd.Timedelta(1, unit=freq) for x in actual_data]

prediction_times = []
for i in actual_data:
    prediction_times.append(i.index[-31] + pd.Timedelta(1, unit=freq))

print(f'\nprediction times are:\n{prediction_times}\n')
list_of_df = []
for k in range(len(prediction_times)):
    prediction_index = pd.date_range(
        start=prediction_times[k], freq=freq, periods=prediction_length
    )
    predicted_dataframe.append(
        pd.DataFrame(data=response_data["predictions"][k]["quantiles"], index=prediction_index)
    )

for i in actual_data:
    i.drop(['dynamic_features', 'AccountId'], inplace=True, axis=1)

for k in range(len(prediction_times)):
    plt.figure(figsize=(12, 6))
    actual_data[k].tail(60).plot(label="target")
    predicted_dataframe[k]["0.5"].plot(label="prediction median")
    plt.legend()
    plt.show()


Overwriting deepar-preds.py


In [112]:
!pip freeze > requirements.txt