# Intent Classification

In [44]:
#Required headerfiles
import numpy as np
import json
import random
import pandas as pd
import re
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, load_model
from keras.layers import Dense, LSTM, Bidirectional, Embedding, Dropout
from keras.callbacks import ModelCheckpoint
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
nltk.download('wordnet')
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jayanth\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jayanth\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jayanth\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Class that encasulates variables and methods for training our model

In this task we are given a JSON file that contains sentences along with its correspoding intent label. We need to train a model that can take a sentence and classify which intent the sentence belongs to. For this task I am using only 20 random intents for training and testing from a list of 150 intent classes. 

Firstly, I preprocess the data. We are given separate data for training, testing and validation. The first step in preprocessing is to filter out special characters from the sentence and convert each sentence to a list of words after lower casing all the letters in the words (using white space tokenization). I also applied lemmatization to all the words(convert to root form). These things are done to help in analysing the text as a sequence of words to interpret the meaning. I did not remove any stop words to have some noise in the dataset(may help in generalization). We need to find out the maximum length of a sentence from all the given sentences to help in the padding process. The second step involves creating a vocabulary from all these words to encode them. I created a vocabulary index based on the word frequency. It actually maps each word to an integer; lower the integer for a word, more frequent the word has occured. After this step we apply this tokenizer to all the words in a sentence thus creating a list of integers for all the words in a sentence. Since each sentence is of different length, I pad the sentence with 0 to make length of all sentences same and equal to the maximum length that we had found out earlier(For embedding layer that I am going to use, the sentence length should be equal).

I apply the same techniques to the labels as well but here I consider the full label as a word. The last step of preprocessing is to encode the labels using one hot encoder, since we have 20 classes and this technique helps in representing the data in a usable form for the model to perform training.

The data that we receive after performing tokenization, encoding and padding serves as the input to the model. The model I use over here consists of an embedding layer. This embedding layer helps to embed each word into a continuous vector space. These embeddings act as extracted feature representation. I use a Bidirectional LSTM layer because it can understand the context better. Vanilla RNN has the problem of vanishing/exploding gradients problem and I felt the Bidirectional LSTM would be an ideal network for this task (since it can encode sequence in both forward as well as backward direction) thus understanding the context better for the intent classification task. 

I used accuracy as the metric to measure the performance of the model. I also build a confusion matrix for this multiclass problem since we can get to see if any particular group of classes are not performing well. 

Since I was not fully familiar with text based modelling, it was a nice challenge for me to understand various things that are performed on text like filtering the text, removing the stop words, performing lemmatization, building vocabulary etc. I have written various helper functions to make the code look small and thus simplified it. I was thinking of using better models like BERT in future implementations.

In [48]:
class intentClassification():
    
    def __init__(self, train_data,val_data,test_data):
        #Creating a dictionary to store the data with intent as key and sentences as values
        self.train_dict=self.create_dictionary(train_data)
        self.val_dict=self.create_dictionary(val_data)
        self.test_dict=self.create_dictionary(test_data)
        self.training_testing()

    #function to train the model
    def training_testing(self):
        
        #To randomly sample any 20 intent classes for the purpose of training. Storing them for purpose of loading them
        # in case of using pretrained model.
        
        #********************Important - If you want to train your model for different set of 20 Intents please set new_intents
        #variable below as 1. If you want to use the pretrained model please set new_intents variable as 0,
        #and just load model from the file Intent.txt(model load code will be below)***************************
        
        new_intents=0
        if(new_intents):
            self.train_random_20_data = random.sample(list(self.train_dict), 20) 
            with open('./intent.txt', 'w') as f:
                for item in self.train_random_20_data:
                    f.write("%s\n" % item)
        else:
            with open('./intent.txt') as f:
                self.train_random_20_data = f.read().splitlines()
        
        #Printing the intents(labels) I am using
        print(self.train_random_20_data)

        #Storing the 20 intent classes and its corresponding sentences
        train_rows=self.create_rows(self.train_dict)
        val_rows=self.create_rows(self.val_dict)
        test_rows=self.create_rows(self.test_dict)
        
        #Using data frame to store the data from the above lists with columns sentences and its corresponding intent label
        df_train = pd.DataFrame(train_rows, columns=["Sentence", "Intent"])   
        df_val=pd.DataFrame(val_rows, columns=["Sentence", "Intent"])
        df_test=pd.DataFrame(test_rows, columns=["Sentence", "Intent"])
        
        #List of intents I am using
        unique_intents = list(set(df_train['Intent']))
        
        #To remove the special characters and convert sentences into list of words
        s_train=list(df_train['Sentence'])
        s_val=list(df_val['Sentence'])
        s_test=list(df_test['Sentence'])
        clean_words_train = self.tokenize_to_words(s_train)
        clean_words_val = self.tokenize_to_words(s_val)
        clean_words_test = self.tokenize_to_words(s_test)
        #print(len(clean_words_train))
        #print(clean_words_train[:2])
        
        #To find out the maximum length from all the sentences(maximum number of words)
        max_length = len(max(clean_words_train, key = len))
        
        #Using tokenizer to create a vocabulary based on word frequency
        word_tokenizer = self.create_tokenizer(clean_words_train)
        
        #Size of the vocabulary
        total_vocabulary_size = len(word_tokenizer.word_index) + 1
        
        #To transform each text in a sentence to a sequence of integers using tokenizer. After doing this I pad all the 
        #sentenes at the end of sentences by appending 0 to make length of all sentences same and equal to length of the 
        #longest sentence. The is done for train, test and validation data
        encoded_doc_train = word_tokenizer.texts_to_sequences(clean_words_train)
        padded_doc_train = pad_sequences(encoded_doc_train, maxlen=max_length,padding = "post")
        
        encoded_doc_val = word_tokenizer.texts_to_sequences(clean_words_val)
        padded_doc_val = pad_sequences(encoded_doc_val, maxlen=max_length,padding = "post")
        
        encoded_doc_test = word_tokenizer.texts_to_sequences(clean_words_test)
        padded_doc_test = pad_sequences(encoded_doc_test, maxlen=max_length,padding = "post")
        
        #Use the tokenizer for the labels to convert them to integer for train,test and validation data
        output_tokenizer = self.create_tokenizer(unique_intents,'!"#$%&()*+,-/:;<=>?@[\]^`{|}~\t\n')
        #print(output_tokenizer.word_index)
        encoded_output_train = output_tokenizer.texts_to_sequences(df_train["Intent"])
        encoded_output_train = np.array(encoded_output_train).reshape(len(encoded_output_train), 1)
        
        encoded_output_val = output_tokenizer.texts_to_sequences(df_val["Intent"])
        encoded_output_val = np.array(encoded_output_val).reshape(len(encoded_output_val), 1)
        
        encoded_output_test = output_tokenizer.texts_to_sequences(df_test["Intent"])
        encoded_output_test = np.array(encoded_output_test).reshape(len(encoded_output_test), 1)
        
        #Since we have 20 classes I make use of one hot encoder to represent the categorical data to a more usable form for 
        #the model to use them for training purpose
        output_one_hot_encoded_train = self.one_hot_encode(encoded_output_train)
        output_one_hot_encoded_train.shape
        
        output_one_hot_encoded_val = self.one_hot_encode(encoded_output_val)
        output_one_hot_encoded_val.shape
        
        output_one_hot_encoded_test = self.one_hot_encode(encoded_output_test)
        output_one_hot_encoded_test.shape
        
        #Preparing data for feeding into the model
        train_X,train_Y=padded_doc_train,output_one_hot_encoded_train
        val_X,val_Y=padded_doc_val,output_one_hot_encoded_val
        test_X,test_Y=padded_doc_test,output_one_hot_encoded_test
        
        #Creating the model and using categorical crossentropy as our loss function with adam optimizer
        model = self.create_model(total_vocabulary_size, max_length)
        model.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])
        model.summary()
        
        #Training the model
        #*****************************Specify if you want to train the model by setting is_train as 1 else 0*******************
        is_train=1
        if(is_train):
            hist = model.fit(train_X, train_Y, epochs = 100, batch_size = 32, validation_data = (val_X, val_Y))
        
        #Saving the trained model
        #**********************If you want to save your trained model, uncomment the next line******************************* 
        model.save('./model')
        
        #Use a pretrained model
        #**********************If you want to use the pretrained model, uncomment the next line****************************** 
        #model = keras.models.load_model('./model')

        
        #print(model.metrics_names)
        scores = model.evaluate(test_X, test_Y, verbose=1)
        #print(scores)
        print("\nTesting Accuracy : "+str(scores[1]*100))
        
        #For building confusion matrix 
        pred = model.predict_proba(test_X)
        f=0
        Actual=[]
        Predicted=[]
        for i in range(len(pred)):
            part=np.argmax(pred[i])
            Actual.append(test_rows[i][1])
            Predicted.append(unique_intents[part])
            if(test_rows[i][1]!=unique_intents[part]):
                f+=1
        print("\nConfusion Matrix")       
        print(metrics.confusion_matrix(Actual, Predicted))
        print("\nClassification Report")
        print(metrics.classification_report(Actual, Predicted, digits=3))


    #This function denotes the model I have used for classification.
    def create_model(self,vocab_size, max_length):
        model = Sequential()
        model.add(Embedding(vocab_size, 256, input_length = max_length, trainable = False))   
        model.add(Bidirectional(LSTM(256)))
        model.add(Dense(32, activation = "relu"))
        model.add(Dropout(0.5))
        model.add(Dense(20, activation = "softmax"))
        return model  
    
    #This function helps in encoding the intents using One hot encoder(20 classes)
    def one_hot_encode(self,encode):
        temp = OneHotEncoder(sparse = False)
        return(temp.fit_transform(encode))
   
    #This function creates a vocabulary index based on the word frequency and taking into consideration all the filters that
    #are specified. Each word is mapped to an integer and lower the integer for a word more frequent the word has occured.
    def create_tokenizer(self,words,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'):
        token = Tokenizer(filters = filters)   
        token.fit_on_texts(words)
        return token
    
    #This function is used to remove any special characters, lemmatize the words, lowercase the letters and convert 
    #sentences into a list of words
    def tokenize_to_words(self,sentences):
        words = []
        for s in sentences:
            removal = re.sub(r'[^ a-z A-Z 0-9]', " ", s)
            w = word_tokenize(removal)
            words.append([lemmatizer.lemmatize(i.lower()) for i in w])
        return words

    #To create dictionary that stores the intent and all the sentences of that intent for train,test and validation data
    def create_dictionary(self,data):
        temp_dict={}
        for i in range(len(data)):
            if (data[i][1] not in temp_dict):
                temp_dict[data[i][1]] = []
            temp_dict[data[i][1]].append(data[i][0])
        return temp_dict
    
    #This helper function is used for aiding the creation of dataframe for train, test and validation dataset from dictionary by
    #randomly sampling 20 intent classes
    def create_rows(self,temp_dict):
        rows=[]
        for i in self.train_random_20_data:
            for j in temp_dict[i]:
                a=[]
                a.append(j)
                a.append(i)
                rows.append(a)
        return rows

Main Function
We have an intent classification dataset in the form of a JSON file. In the main function I read the training, testing and validation data from the dataset and call the training class to train the model. 

In [49]:
if __name__ == '__main__':
    with open('./data_full.json') as f:
        data = json.load(f)
    train_data=data['train']
    test_data=data['test']
    val_data=data['val']
    intentClassification(train_data,val_data,test_data)

['todo_list', 'redeem_rewards', 'order_checks', 'bill_balance', 'where_are_you_from', 'distance', 'pin_change', 'book_flight', 'improve_credit_score', 'next_holiday', 'change_accent', 'date', 'who_do_you_work_for', 'report_lost_card', 'flip_coin', 'application_status', 'shopping_list_update', 'restaurant_reviews', 'taxes', 'whisper_mode']
Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 26, 256)           343040    
_________________________________________________________________
bidirectional_12 (Bidirectio (None, 512)               1050624   
_________________________________________________________________
dense_23 (Dense)             (None, 32)                16416     
_________________________________________________________________
dropout_12 (Dropout)         (None, 32)                0         
__________________________________________

Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100

Testing Accuracy : 94.33333277702332

Confusion Matrix
[[30  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0 28  0  0  0  0  0  0  0  0  1  0  0  0  0  1  0  0  0  0]
 [ 0  0 30  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0 27  0  0  0  0  0  0  0  0  0  0  0  0  0  0  3  0]
 [