# Introduction

In this notebook we will learn how to **characterize bus repairs** from free text descriptions, entered by users. This will be accomplished through the use of **Natural Language Processing (NLP)**.

This will allow you to discover, step by step, how you can create the code doing the repair text processing.  In the last part of the workshop, this code will be **packaged to create a service** that you can query from an application.

We will be training the model on simulated data. (Feel free to explore notebook [00-Generate-Sample-Claims-Data.ipynb](00-Generate-Sample-Claims-Data.ipynb) to see how this data was simulated.)  Once our model is trained, we can test the model by entering a bus repair issue (e.g. the brakes feel soft when I press on them) and check if the model has correctly categorised the claim.  Repairs will be categorized as:  Brakes, Starter or Other.

Ready? Let's go!

# Environment initialization

## Libraries

When we Launched JupyterLab, we selected the `Tensorflow` notebook image. This already has some key libraries installed for us, but we have also added all the libraries (and their versions) which we are relying on into the `nb-requirements.txt` file. This will make it easier for us to repeatably run this code and share it with others. 

We install these now:

In [1]:
!pip install -r nb-requirements.txt



## Imports

Now that our libraries are installed, we need to import them:

In [2]:
import numpy as np 
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow import keras

## Create Training and Testing data sets

Now that we have loaded the libraries we need, the first step in our journey is to be able to take our raw data and divide it into testing and training sets.


In [3]:
#============================================================================
#Determine what the training and testing percentages, of the data set, will be.
#============================================================================
training_portion = .80         # Use 80% of data for training, 20% for testing
max_words        = 1000        # Max words in text input

data             = pd.read_csv('dataset/testdata1.csv')

#============================================================================
# TRY:  uncomment the below print statement and print out first 5 rows of 
# generated claims so you can see what data looks like.
#
# print(data.head())
#============================================================================

train_size       = int(len(data) * training_portion)

#============================================================================
# FUNCTION:  train_test_split
# This function splits the data into training and test sets.  
# Inputs:   raw data and determined train_size.
#============================================================================
def train_test_split(data, train_size):
    train        = data[:train_size]
    test         = data[train_size:]
    return train, test

train_cat, test_cat   = train_test_split(data.iloc[:,1], train_size)  # label data is second column
train_text, test_text = train_test_split(data.iloc[:,0], train_size)  # text data is first column


## Tokenize the Data sets

After we have training and testing sets, we need to **tokenize the data**.  This means that we convert text documents into contextual vectors which contain numeric representations (index of where those words occur in a word dictionary) of the words in the documents.



In [4]:
tokenize              = Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_text) # fit tokenizer to our training text data

#============================================================================
#x_train and x_test are the vectorization of the text data (which is a claim)
#============================================================================
x_train               = tokenize.texts_to_matrix(train_text)
x_test                = tokenize.texts_to_matrix(test_text)

#============================================================================
# TRY:  uncomment the below print statement and observe the rows in the 
# newly created matrix.
#
# print(x_train)
# ===========================================================================


Before we can make a prediction, we will need to pass any future data through the tokenizer to transform it into feature vectors. We now save the tokenizer so we can use it later: 

In [5]:
import pickle

In [6]:
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenize, handle, protocol=pickle.HIGHEST_PROTOCOL)

## Using Scikit-learn
We will be using the Label Encoder utility from Scikit-learn to convert label strings to numbered index.

In [7]:
#============================================================================
# Convert label strings to numbered index
#============================================================================
encoder              = LabelEncoder()  
encoder.fit(train_cat)
y_train              = encoder.transform(train_cat)
y_test               = encoder.transform(test_cat)

#============================================================================
# Note: for each row in the data, each entry represents the value of the label
# Example:  [2 1 1 2 1 1 0 ...  which corresponds to starter, other,
# other, starter, other, other, brakes ...
#
#============================================================================
# TRY:  uncomment the below print statement.  What would you expect y_train
# to look like?  
#
# print(y_train)
#============================================================================


## One Hot Encoding

We need to create labels (categories such as Brakes or Starter) for our test data, convert the labels to numbered index and then use one-hot encoding.

**One hot encoding** allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. **The categories must be converted into numbers**. This is required for both input and output variables that are categorical.

After we have converted the labels using one-hot encoding, we will be ready to build our main NLP model and train it.

In [8]:
#============================================================================
# Convert the labels to a one-hot representation
#
# One Hot Encoding replaces the column of labels whose (values are 0 or 1 or 2)
# with 3 columns each representing 1 label value.  For example, the label 
# 'other' is replaced by the vector 0 1 0, the label 'starter' is replaced by
# the vector 0 0 1, the label 'brakes' is replaced by the vector 1 0 0
#============================================================================
num_classes          = len(set(y_train))  # set() creates a unique set of objects
y_train              = to_categorical(y_train, num_classes)  
y_test               = to_categorical(y_test, num_classes)

#============================================================================
# TRY:  uncomment the below print statements in order to inspect the 
# dimenstions of our training and test data.
# y_train may appear as y_train shape: (159, 3) which represents 159 rows, 3 cols
#
#print('x_train shape:', x_train.shape)
#print('x_test shape:', x_test.shape)
#print('y_train shape:', y_train.shape)
#print('y_test shape:', y_test.shape)
#============================================================================



## Building the model

Once the model is trained, we can test our model by entering a repair issue (e.g. the brakes feel soft when I press on them) and check if the model has correctly characterized the repair issue. 


In [26]:
#============================================================================
# Build model
#============================================================================
layers               = keras.layers
models               = keras.models
model                = models.Sequential()
#model.add(layers.Dense(512, input_shape=(max_words,), activation='relu'))  # Hidden layer with 512 nodes
model.add(layers.Dense(512, input_shape=(None,1000), activation='relu'))  # Hidden layer with 512 nodes
model.add(layers.Dense(num_classes, activation='softmax'))

#============================================================================
# relu, softmax, categorical_crossentropy are telling the model how to do some 
# internal calculations.  Softmax is telling the model to calculate 
# probabilities for each category in each document.  If you only had yes, 
# or no you would use sigmoid instead of softmax.
#============================================================================
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#============================================================================
# VARIABLES
# history    - normally used to plot learning curves.  
# fit        - calculates the weights in the model. 
# batch_size - tells the internal calculations how many rows to process at 1 time
# epochs     - num of times model calculations will pass through the entire data
#============================================================================
batch_size          = 32
epochs              = 2
history = model.fit(x_train, y_train,      
                    batch_size=batch_size,  
                    epochs=epochs,         
                    verbose=1,
                    validation_split=0.1)

#============================================================================
# evaluate func compares the model predictions with the actual known test values
#============================================================================
score = model.evaluate(x_test, y_test,       
                       batch_size=batch_size, verbose=1)

#============================================================================
# TRY:  uncomment the below print statements to see the test loss and accuracy
# of our model
#
#print('Test loss:', score[0])
#print('Test accuracy:', score[1])
#============================================================================

Epoch 1/2
Epoch 2/2


We can save our trained model, and our encoder classes, both of which we will need to use later when we create a model service

In [27]:
model.save('models/repairmodel.h5')

text_labels = encoder.classes_   #ndarray of output values (labels or classes)  e.g. other, brakes, starter

with open('text_labels.npy', 'wb') as f:
    np.save(f, text_labels)



## Let's test our model!

Now that we have a model, we would like to generate a prediction (e.g. categorize the repair issue as:  Brakes, Starter or Other). We must create a prediction function which takes in a string and returns the predicted catagory of the error:




In [28]:
def predict(single_test_text):

    text_as_series = pd.Series(single_test_text) #do a data conversion
    single_x_test = tokenize.texts_to_matrix(text_as_series)
    single_prediction = model.predict(np.array([single_x_test]))
    single_predicted_label = text_labels[np.argmax(single_prediction)]
    
    return {'prediction': single_predicted_label}

#========================================
#Run the firs time in order to save the model
#=========================================
single_test_text = 'turn the key and nothing happens' 
print(single_test_text)    #print out the repair being categorized

prediction = predict(single_test_text) 
print(prediction)   

turn the key and nothing happens
before sending prediction to model
after model predict
{'prediction': 'starter'}


In [None]:
#==========================================================================
# TRY: uncomment the below to test the predict function.  We will test 3
# repair cases so that we can see if the model can properly categorize
# a Brake, a Starter and an Other repair text

single_test_text = 'press brake pedal and car wont stop'
#single_test_text = 'turn key over and hear a clicking sound' 
#single_test_text = 'there is fluid leaking from the engine' 

print(single_test_text)    #print out the repair being categorized
model = keras.models.load_model('models/repairmodel.h5')  #load the model before called predict function
prediction = predict(single_test_text) 
print(prediction)                           #print the predicti
