---
**INTENT CLASSIFICATION WITH LSTM**

This notebook aims to classify intents from text input into 4 categories `[BookRestaurant, GetWeather, PlayMusic, RateBook]`. In order for chatbot  to be able to give appropriate response, it needs to first correctly classify user intends. Therefore, intent classification model usually implemented as the first stacked model in most of all chatbot model.

---


In [52]:
%reload_ext tensorboard

In [2]:
pip install "tensorflow-text==2.8.*"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow-text==2.8.*
  Downloading tensorflow_text-2.8.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 26.8 MB/s 
Installing collected packages: tensorflow-text
Successfully installed tensorflow-text-2.8.2


In [44]:
import os
import re
from collections import defaultdict, namedtuple

import nltk
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_text as tf_text
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import LSTM, Dense, Bidirectional, Dropout, Dense, Activation, Flatten, Embedding

In [4]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

## Load Data

In [5]:
!wget -N https://cainvas-static.s3.amazonaws.com/media/user_data/vomchaithany/train.csv

--2022-06-12 07:18:25--  https://cainvas-static.s3.amazonaws.com/media/user_data/vomchaithany/train.csv
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.62.44
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.62.44|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 441947 (432K) [application/vnd.ms-excel]
Saving to: ‘train.csv’


2022-06-12 07:18:26 (515 KB/s) - ‘train.csv’ saved [441947/441947]



In [6]:
df = pd.read_csv('train.csv')
df

Unnamed: 0,sentence,BookRestaurant,GetWeather,PlayMusic,RateBook
0,book The Middle East restaurant in IN for noon,1,0,0,0
1,Book a table at T-Rex distant from Halsey St.,1,0,0,0
2,I'd like to eat at a taverna that serves chili...,1,0,0,0
3,I have a party of four in Japan and need a res...,1,0,0,0
4,Please make a restaurant reservation for somew...,1,0,0,0
...,...,...,...,...,...
7924,rate this textbook 0 stars,0,0,0,1
7925,give 5 out of 6 stars to Coming Home,0,0,0,1
7926,Give Drift: The Unmooring of American Military...,0,0,0,1
7927,give 1 out of 6 points to Revolution World,0,0,0,1


## Data preprocessing

As we can observe, the objective of this notebook is to classify sentence into 4 categories `[BookRestaurant, GetWeather, PlayMusic, RateBook]`. The data is already one-hot-encoded into its respective classes, so we only need to do preprocessing on the input sentence. As such the preprocessing steps that we will do are:

- Stopwords removal
- Lemmatization (eg teaches -> teach)
- Stemming (eg sailed -> sail)
- Vectorization (maps text features to integer sequences)

In [7]:
def remove_stopword(text):
    """
    input: string
    Remove stopwords.
    return: string without stopwords    
    """
    filtered = []
    #stopword = set(stopwords.words('english'))
    # for every word in text, append it to filtered list if the word is not in stopword set
    for word in text.split(' '):
        if word not in stopword:
            filtered.append(word)
    return ' '.join(filtered)

def get_lemma(text):
    """
    input: string
    lemmatizes words. (eg teaches -> teach)
    output: string
    """
    t = []
    #wnl = WordNetLemmatizer()
    for word in text.split(' '):
        t.append(wnl.lemmatize(word))
    return " ".join(t)

def get_stem(text):
    """
    input: string
    return word stem using SnowballStemmer, ignore stopwords. (eg sailed -> sail)
    return: string
    """
    t = []
    #stemmer = SnowballStemmer('english', ignore_stopwords = True)
    for word in text.split(' '):
        t.append(stemmer.stem(word))
    return " ".join(t)

def normalize(text, is_lemma=False, is_stem=False, no_stopword=False):
  """
    input: string
    clean text (normalize, lowercase, remove digits, add space around punctuations, strip whitespace) then perform lemmatization, stemming or stopwords removal as per user specification.
    return: string
  """
  # Split accecented characters.
  text = tf_text.normalize_utf8(text, 'NFKD')
  text = tf.strings.lower(text)
  # Keep space, a to z, and select punctuation.
  text = tf.strings.regex_replace(text, '[^ a-z.?!,]', '')
  # Add spaces around punctuation.
  text = tf.strings.regex_replace(text, '[.?!,]', r' \0 ')
  # Strip whitespace.
  text = tf.strings.strip(text)

  string_text = text.numpy().decode("utf-8")  # converts tf.constant() type to byte, then decode into string
  if is_stem:
    return get_stem(string_text)
  if is_lemma:
    return get_lemma(string_text)
  if no_stopword:
    return remove_stopword(string_text)
  return string_text

def tf_add_start_end(text):
  """
  input: tensorflow.python.framework.ops.EagerTensor
  Adds [START] and [END] tag to text. For encoder-decoder model.
  return: tensorflow.python.framework.ops.EagerTensor
  """
  return tf.strings.join(['[START]', text, '[END]'], separator=' ')

### Text Normalization

In [8]:
# initializing
stemmer = SnowballStemmer('english', ignore_stopwords = True)
wnl = WordNetLemmatizer()
stopword = set(stopwords.words('english'))

In [9]:
# Normalization output with original text for comparison
# "%-xxs" gives size requirement '-' for left justification (pretty printing purposes) ref: https://stackoverflow.com/questions/12684368/how-to-left-align-a-fixed-width-string

example_text = tf.constant("I have a party of four in Japan and I'd like to make a reservation at Rimsky-Korsakoffee House on Aug. the 3rd.")
print("%-32s %-100s" % ("Original Text:", example_text.numpy().decode()))
print("%-32s %-100s" % ("Normalized :", normalize(example_text)))
print("%-32s %-100s" % ("Normalized + Lemmatized :", normalize(example_text, is_lemma=True)))
print("%-32s %-100s" % ("Normalized + Stemmed :", normalize(example_text, is_stem=True)))
print("%-32s %-100s" % ("Normalized + Stopwords Removal :", normalize(example_text, no_stopword=True)))

Original Text:                   I have a party of four in Japan and I'd like to make a reservation at Rimsky-Korsakoffee House on Aug. the 3rd.
Normalized :                     i have a party of four in japan and id like to make a reservation at rimskykorsakoffee house on aug .  the rd .
Normalized + Lemmatized :        i have a party of four in japan and id like to make a reservation at rimskykorsakoffee house on aug .  the rd .
Normalized + Stemmed :           i have a parti of four in japan and id like to make a reserv at rimskykorsakoffe hous on aug .  the rd .
Normalized + Stopwords Removal : party four japan id like make reservation rimskykorsakoffee house aug .  rd .                       


In [10]:
# Normalization

df['norm'] = df['sentence'].apply(lambda x: normalize(x))
df['norm_lemma'] = df['sentence'].apply(lambda x: normalize(x, is_lemma=True))
df['norm_stem'] = df['sentence'].apply(lambda x: normalize(x, is_stem=True))
df['norm_stop'] = df['sentence'].apply(lambda x: normalize(x, no_stopword=True))

In [11]:
df.head()

Unnamed: 0,sentence,BookRestaurant,GetWeather,PlayMusic,RateBook,norm,norm_lemma,norm_stem,norm_stop
0,book The Middle East restaurant in IN for noon,1,0,0,0,book the middle east restaurant in in for noon,book the middle east restaurant in in for noon,book the middl east restaur in in for noon,book middle east restaurant noon
1,Book a table at T-Rex distant from Halsey St.,1,0,0,0,book a table at trex distant from halsey st .,book a table at trex distant from halsey st .,book a tabl at trex distant from halsey st .,book table trex distant halsey st .
2,I'd like to eat at a taverna that serves chili...,1,0,0,0,id like to eat at a taverna that serves chili ...,id like to eat at a taverna that serf chili co...,id like to eat at a taverna that serv chili co...,id like eat taverna serves chili con carne party
3,I have a party of four in Japan and need a res...,1,0,0,0,i have a party of four in japan and need a res...,i have a party of four in japan and need a res...,i have a parti of four in japan and need a res...,party four japan need reservation rimskykorsak...
4,Please make a restaurant reservation for somew...,1,0,0,0,please make a restaurant reservation for somew...,please make a restaurant reservation for somew...,pleas make a restaur reserv for somewher in mo...,please make restaurant reservation somewhere m...


### Split into Train/ Validation sets

We are interested in finding out the preprocessing technique which will produce model with high accuracy. Therefore, we will split the inputs preprocessed with 4 different preprocessing techniques (+ 1 with original input) as one training set each.

1. `ori` : Training set with original input
2. `norm` : Training set with normalized input
3. `norm_lemma` : Training set with normalized + lemmatized input
4. `norm_stem` : Training set with normalized + stemmed input
5. `norm_stopword` : Training set with normalized + stopwords removal input

In [12]:
# 1: Original input
X = df['sentence']
y = df[['BookRestaurant', 'GetWeather', 'PlayMusic', 'RateBook']]

# 2: Normalized input
X_n = df['norm']
# 3: Normalized + Lemmatized input
X_nl = df['norm_lemma']
# 4: Normalized + Stemmed input
X_ns = df['norm_stem']
# 5: Normalized + Stopwords Removal input
X_nr = df['norm_stop']

In [13]:
for t in [X, X_n, X_nl, X_ns, X_nr]:
  print(t.shape)
print(y.shape)

(7929,)
(7929,)
(7929,)
(7929,)
(7929,)
(7929, 4)


In [14]:
# Initialize dictionary to save all train/test split sets
d_inputs = defaultdict(tuple)

# Initialize namedtuple to save train/test sets for easy access
Input = namedtuple('Input', 'xtrain xtest ytrain ytest xtrain_padded xtest_padded')

In [15]:
# Initialize tokenizer
tokenizer = Tokenizer()
# Fit tokenizer on original data
tokenizer.fit_on_texts(df['sentence'])
tokenizer_vocab_size = len(tokenizer.word_index) + 1
print(tokenizer_vocab_size)

# Get the max length of sentence in column and add 1000
maxlength = df['sentence'].map(len).max() + 1000

7522


### Tokenization, Encoding, Padding

In [16]:
for x in [('ori', X), ('norm', X_n), ('norm_lemma', X_nl), ('norm_stem', X_ns), ('norm_stopword', X_nr)]:
  # Split preprocessed input text into train/test sets
  xtrain, xtest, ytrain, ytest = train_test_split(x[1], y, test_size = 0.2, stratify=y, random_state=0)
  
  # Tokenize into sequence and encode into numerical
  xtrain_encoded = tokenizer.texts_to_sequences(xtrain)
  xtest_encoded = tokenizer.texts_to_sequences(xtest)

  # Save all in dictionary
  d_inputs[x[0]] = Input(xtrain=xtrain, xtest=xtest, ytrain=ytrain, ytest=ytest,
                         xtrain_padded=sequence.pad_sequences(xtrain_encoded, maxlen = maxlength),
                         xtest_padded=sequence.pad_sequences(xtest_encoded, maxlen = maxlength))

  print('Done train test split and inserted into d_inputs for:', x[0])

Done train test split and inserted into d_inputs for: ori
Done train test split and inserted into d_inputs for: norm
Done train test split and inserted into d_inputs for: norm_lemma
Done train test split and inserted into d_inputs for: norm_stem
Done train test split and inserted into d_inputs for: norm_stopword


In [17]:
print(d_inputs.keys())

dict_keys(['ori', 'norm', 'norm_lemma', 'norm_stem', 'norm_stopword'])


We had successfully split data into train and test sets for each preprocessing techniques applied. We had also saved the split data into defaultdict as named tuple for easy accessing.

To access the split data, we provide the preprocessing type (one of `['ori', 'norm', 'norm_lemma', 'norm_stem', 'norm_stopword']`) and use `.` notation to access specific split. The splits names are one of `['xtrain', 'xtest', 'ytrain', 'ytest', 'xtrain_padded', 'xtest_padded']`

For example to access train set of normalized input, we use:
`d_inputs['norm'].xtrain`

In [18]:
d_inputs['ori']

Input(xtrain=1159        I want a table for 2 at a Portugal restaurant
327     I need seating for ten people at a bar that se...
7895                                I give this book a 5.
7361                   rate this essay one out of 6 stars
3190    Whats the temperature not far from Valley of Fire
                              ...                        
5585                        I'd like to hear Helen Baylor
789     book The Kegs Drive-In in 37 weeks  in Saudi A...
5121                 play the top five songs by Gad Elbaz
5006              Can you play music from 2003 on Netflix
7309    this album is hot trash, it's totally zero stars.
Name: sentence, Length: 6343, dtype: object, xtest=658     Book a tyrolean restaurant in Crocker Indiana ...
5003                   I want to hear that tune from 2010
1735                              book a restaurant for 8
4836                       play Iheart tunes by Neil Finn
5458               Play Me Against The World from Glukoza
        

## Build Model

We will build simple LSTM model for this intent classification task.

In [35]:
model_ori = Sequential([
                     Embedding(tokenizer_vocab_size, 32, input_length = maxlength),
                     LSTM(100),
                     Dropout(0.5),
                     Dense(4, activation='softmax') ])

model_norm = Sequential([
                     Embedding(tokenizer_vocab_size, 32, input_length = maxlength),
                     LSTM(100),
                     Dropout(0.5),
                     Dense(4, activation='softmax') ])

model_norm_lemma = Sequential([
                     Embedding(tokenizer_vocab_size, 32, input_length = maxlength),
                     LSTM(100),
                     Dropout(0.5),
                     Dense(4, activation='softmax') ])

model_norm_stem = Sequential([
                     Embedding(tokenizer_vocab_size, 32, input_length = maxlength),
                     LSTM(100),
                     Dropout(0.5),
                     Dense(4, activation='softmax') ])

model_norm_stopword = Sequential([
                     Embedding(tokenizer_vocab_size, 32, input_length = maxlength),
                     LSTM(100),
                     Dropout(0.5),
                     Dense(4, activation='softmax') ])

In [22]:
model_ori.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1186, 32)          240704    
                                                                 
 lstm (LSTM)                 (None, 100)               53200     
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 4)                 404       
                                                                 
Total params: 294,308
Trainable params: 294,308
Non-trainable params: 0
_________________________________________________________________


In [34]:
def train_model(preprocess_name, model, xtrain, ytrain, xtest, ytest, epochs=40):
  model.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

  # Initialize tensorboard
  logdir = os.path.join("logs", preprocess_name)
  tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)
  filepath = os.path.join("checkpoint", "weights-improvement-{epoch:02d}-{accuracy:.2f}.hdf5")
  checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath, monitor='accuracy', verbose=1, save_best_only=True, mode='max')

  earlystop = tf.keras.callbacks.EarlyStopping(monitor='accuracy', verbose=1, mode='max')

  return model.fit(xtrain, ytrain, validation_data=(xtest, ytest), epochs=epochs, callbacks=[tensorboard_callback, earlystop, checkpoint])

In [36]:
hist1 = train_model('ori', model_ori, d_inputs['ori'].xtrain_padded, d_inputs['ori'].ytrain, d_inputs['ori'].xtest_padded, d_inputs['ori'].ytest)

Epoch 1/40
Epoch 1: accuracy improved from -inf to 0.80719, saving model to checkpoint/weights-improvement-01-0.81.hdf5
Epoch 2/40
Epoch 2: accuracy improved from 0.80719 to 0.99511, saving model to checkpoint/weights-improvement-02-1.00.hdf5
Epoch 2: early stopping


In [37]:
hist2 = train_model('norm', model_norm, d_inputs['norm'].xtrain_padded, d_inputs['norm'].ytrain, d_inputs['norm'].xtest_padded, d_inputs['norm'].ytest)

Epoch 1/40
Epoch 1: accuracy improved from -inf to 0.80309, saving model to checkpoint/weights-improvement-01-0.80.hdf5
Epoch 2/40
Epoch 2: accuracy improved from 0.80309 to 0.98991, saving model to checkpoint/weights-improvement-02-0.99.hdf5
Epoch 2: early stopping


In [38]:
hist3 = train_model('norm_lemma', model_norm_lemma, d_inputs['norm_lemma'].xtrain_padded, d_inputs['norm_lemma'].ytrain, d_inputs['norm_lemma'].xtest_padded, d_inputs['norm_lemma'].ytest)

Epoch 1/40
Epoch 1: accuracy improved from -inf to 0.68453, saving model to checkpoint/weights-improvement-01-0.68.hdf5
Epoch 2/40
Epoch 2: accuracy improved from 0.68453 to 0.98092, saving model to checkpoint/weights-improvement-02-0.98.hdf5
Epoch 2: early stopping


In [40]:
hist4 = train_model('norm_stem', model_norm_stem, d_inputs['norm_stem'].xtrain_padded, d_inputs['norm_stem'].ytrain, d_inputs['norm_stem'].xtest_padded, d_inputs['norm_stem'].ytest)

Epoch 1/40
Epoch 1: accuracy improved from -inf to 0.99653, saving model to checkpoint/weights-improvement-01-1.00.hdf5
Epoch 2/40
Epoch 2: accuracy improved from 0.99653 to 0.99779, saving model to checkpoint/weights-improvement-02-1.00.hdf5
Epoch 2: early stopping


In [41]:
hist5 = train_model('norm_stopword', model_norm_stopword, d_inputs['norm_stopword'].xtrain_padded, d_inputs['norm_stopword'].ytrain, d_inputs['norm_stopword'].xtest_padded, d_inputs['norm_stopword'].ytest)

Epoch 1/40
Epoch 1: accuracy improved from -inf to 0.71181, saving model to checkpoint/weights-improvement-01-0.71.hdf5
Epoch 2/40
Epoch 2: accuracy did not improve from 0.71181
Epoch 2: early stopping


## Deploy Model

In [42]:
classes = ['BookRestaurant','GetWeather','PlayMusic','RateBook']

In [50]:
sample_texts = ["Play snow patrol's run", "get me pumps up kicks and mgmt", "give alice in the wonderland a tens"] 

for t in sample_texts:
  print('Sample text:', t)
  tokens = tokenizer.texts_to_sequences([t])
  tokens = sequence.pad_sequences(tokens, maxlen = maxlength)
  for m in [model_ori, model_norm, model_norm_lemma, model_norm_stem, model_norm_stopword]:
    print(classes[model_ori.predict(np.array(tokens)).argmax()])

Sample text: Play snow patrol's run
PlayMusic
PlayMusic
PlayMusic
PlayMusic
PlayMusic
Sample text: get me pumps up kicks and mgmt
PlayMusic
PlayMusic
PlayMusic
PlayMusic
PlayMusic
Sample text: give alice in the wonderland a tens
RateBook
RateBook
RateBook
RateBook
RateBook
