# How to prepare data for an intent classification model

There are some basic considerations to be taken for a chatbot model.


Firstly, the model needs inputs/patterns which is the message the user would send to the chatbot

the model also requires corresponding targets for this inputs...


We would use a json file in the long run which is the norm in industry, but there's a basic problem to understand

1. the user messages has to be used to form a corpus

2. this so called corpus is actually just a list of sentence where every index position is a new row/document

3. we have to convert this corpus into a bag of words

4. the target is also a list where each label/intent is for a corresponding document at the same index position in the list of words.

5. we will have to one hot encode this list


this notebook explores efficient and easy to understand approach to getting these tasks done...

In [38]:
import pandas as pd
import numpy as np

import re
import nltk
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [2]:
sentences = ['hello',
            'hi',
            'what do you sell?',
            'what services do you render',
            'I would love 3 spring rolls',
            'can i get a bottle of coke?',
            'thank you',
            'goodbye',
            'cheers!']
intents = ['greetings',
          'greetings',
          'services',
          'services',
          'order',
          'order',
          'farewell',
          'farewell',
          'farewell',]

In [3]:
assert len(sentences) == len(intents)

# transforming sentences to bag of words

- tokenize

- lemmatize

- create vocabulary

- create bow

- convert bow list to array

In [4]:
stemmer = LancasterStemmer()
lemmatizer = WordNetLemmatizer()

In [5]:
vocab = []
ignore_words = ['?', '!']

In [15]:
def clean_corpus(sents):
    corpus = []
    for doc in sents:
        tokens = nltk.word_tokenize(doc)
        filtered_tokens = [lemmatizer.lemmatize(token) if token not in ignore_words else token for token in tokens ]
        doc = ' '.join(filtered_tokens)
        corpus.append(doc)
    return corpus

In [16]:
corpus = clean_corpus(sentences)
corpus

['hello',
 'hi',
 'what do you sell ?',
 'what service do you render',
 'I would love 3 spring roll',
 'can i get a bottle of coke ?',
 'thank you',
 'goodbye',
 'cheer !']

In [18]:
cv = CountVectorizer(min_df=0., max_df=1.)

In [20]:
cv_matrix = cv.fit_transform(corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix

array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0],
       [1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

In [22]:
words = []
for pattern in sentences:
    w = nltk.word_tokenize(pattern)
    # add to our words list
    words.extend(w)

In [31]:
vocab = cv.get_feature_names()

In [32]:
vocab

['bottle',
 'can',
 'cheer',
 'coke',
 'do',
 'get',
 'goodbye',
 'hello',
 'hi',
 'love',
 'of',
 'render',
 'roll',
 'sell',
 'service',
 'spring',
 'thank',
 'what',
 'would',
 'you']

In [37]:
cv.transform(["this is me being cool for the coke. hello?"]).toarray()

array([[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

I have successfully reduced complexity around creating bag of word... next thing is one hot encoding the labels

# one hot encoding

In [47]:
le = LabelEncoder()
intent_labels = le.fit_transform(intents)

encoder = OneHotEncoder(sparse=False) #take good note of this!! sparse set to False is the secret here

intent_labels = intent_labels.reshape((-1, 1))

intent_labels = encoder.fit_transform(intent_labels)

In [48]:
intent_labels

array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]])

In [49]:
encoder.categories_

[array([0, 1, 2, 3], dtype=int64)]

In [53]:
le.classes_[3]

'services'

I have been able to successfully one hot encode without writing a long and confusing code... I jsut need to pickle my vectoorizer, label encoder and one_hot_encoder for transformation purposes!

# generating corpus from json

In [54]:
json_file = {"intents": [
    {"tag": "greetings",
    "patterns": ["hello", "hey", "hi", "How far?", "My guy!"],
    "responses": ["Hello!", "hey!", "hey there, what can i do for you?"]
    },

    {"tag": "goodbye",
     "patterns": ["cya", "see you later", "goodbye", "got to go", "see ya!"],
     "responses": ["nice chatting with you", "talk to you soon, cheers"]
    },

    {"tag": "age",
     "patterns": ["how old are you", "what's your age?", "Age"],
     "responses": ["just a few days old, still have a lot to learn", "less than a week old"]
    },
    
    {"tag": "name",
     "patterns": ["what is your name", "what should i call you", "what's your name", "can you tell me your name?"],
     "responses": ["you can call me lisa", "I'm lisa!", "Elizabeth. but please call me lisa *winks*"]
    },

    {"tag": "shop",
     "patterns": ["I' like to buy something", "what are your products", "what do you recommend?", "what are you selling"],
     "responses": ["we sell samosa, spring rolls, chocoloate and vanilla cakes, and also bake anything you want on demand!"]
    },

    {"tag": "hours",
     "patterns": ["when are you guys open", "what are your hours", "hours of opening?"],
     "responses": ["we are open at all times", "24/7"]
    }
   ]
}

In [55]:
tags = []
pattern_corpus = []

for intent in json_file['intents']:
    for pattern in intent['patterns']:
        pattern_corpus.append(pattern)
        tags.append(intent['tag'])

In [56]:
assert len(pattern_corpus) == len(tags)

In [57]:
unique_labels = set(tags)

In [58]:
pattern_corpus

['hello',
 'hey',
 'hi',
 'How far?',
 'My guy!',
 'cya',
 'see you later',
 'goodbye',
 'got to go',
 'see ya!',
 'how old are you',
 "what's your age?",
 'Age',
 'what is your name',
 'what should i call you',
 "what's your name",
 'can you tell me your name?',
 "I' like to buy something",
 'what are your products',
 'what do you recommend?',
 'what are you selling',
 'when are you guys open',
 'what are your hours',
 'hours of opening?']

In [64]:
def create_bow(sents):
    cv = CountVectorizer(min_df=0., max_df=1.)
    bow = cv.fit_transform(corpus)
    bow = bow.toarray()
    return bow, cv

In [72]:
def ohe_labels(target_list):
    le = LabelEncoder()
    intent_label = le.fit_transform(target_list)

    encoder = OneHotEncoder(sparse=False) #take good note of this!! sparse set to False is the secret here

    intent_label = intent_label.reshape((-1, 1))

    intent_label = encoder.fit_transform(intent_label)
    return intent_label, le

In [73]:
corpus = clean_corpus(pattern_corpus)
matrix, vec = create_bow(corpus)

In [74]:
labels, le_vec = ohe_labels(tags)

# working in production environment...

before i can pass anything to the model i woukd first of all need it pass thru all my functions for cleaniing the data

In [79]:
msg = clean_corpus(['hello there my really cool and amazingly running guy'])
msg = vec.transform(msg).toarray()

#i can now pass the above to my model

#after the model makes a prediction i jsut need to

# - use argmax to find the index postion e.g idx = np.argmax(model.predict(msg)[0])

# i can then use le_vec.classes_[idx] -> this gives me the  intent class so i can use the JSON to pick a response...great Job emmanuel

using this argmax idea eliminates the log process of trying to sort the list... instead i can get the argmax index and value and if the value is less than my error threshold the model will simply say it does not understand the user input/request else it just uses the index value to pick a randomo response attending to the user immediately...