The punkt package for the Natural Language Toolkit (NLTK) is a pre-trained model for tokenizing and sentence segmentation in English. It is one of the most widely used resources for English language processing tasks.

The PorterStemmer is a class in the Natural Language Toolkit (NLTK) library used for stemming English words. Stemming is the process of reducing words to their root form. For example, the words "running", "ran", and "runs" would all be stemmed to the root word "run".

The PorterStemmer uses a series of rules to stem words. These rules include:

Removing plurals by removing the suffix "s" or "es".
Removing past tense by removing the suffix "ed".
Removing present participle by removing the suffix "ing".
Removing comparative and superlative suffixes, such as "er", "est", and "iest". **bold text**

In [1]:
#libraries needed for NLP
import nltk
nltk.download('punkt')
from nltk.stem import PorterStemmer
# libraries needed for tensorflow processing
import tensorflow as tf
import numpy as np
import random
import json

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
stemmer = PorterStemmer()

In [3]:
#import our chatbot intents file
with open('intents.json') as json_data:
  intents = json.load(json_data)

In [4]:
intents

{'intents': [{'tag': 'greeting',
   'patterns': ['Hi', 'How are you', 'Is anyone there?', 'Hello', 'Good day'],
   'responses': ['Hello, thanks for visiting',
    'Good to see you again',
    'Hi there, how can I help?'],
   'context_set': ''},
  {'tag': 'goodbye',
   'patterns': ['Bye', 'See you later', 'Goodbye'],
   'responses': ['See you later, thanks for visiting',
    'Have a nice day',
    'Bye! Come back again soon.']},
  {'tag': 'thanks',
   'patterns': ['Thanks', 'Thank you', "That's helpful"],
   'responses': ['Happy to help!', 'Any time!', 'My pleasure']},
  {'tag': 'chatbot',
   'patterns': ['Who built this chatbot?',
    'Tell me about Chatbot',
    'What is this chatbot name?'],
   'responses': ['Hi, I am kirito_Chatbot designed by Mohamed.',
    'Thanks for asking. I am designed by Mohamed saif.',
    'I am a kirito.']},
  {'tag': 'location',
   'patterns': ['What is your location?',
    'Where are you located?',
    'What is your address?'],
   'responses': ['We are fr

####Preprocessing Data

Tokenizing is the process of breaking down a string into smaller units called tokens. These tokens can be words, phrases, symbols, or other meaningful elements. Tokenization is a fundamental step in many natural language processing (NLP) tasks, such as:

Machine translation: Breaking down sentences into words or phrases allows translation models to better understand the meaning of the source text and generate accurate translations.
Sentiment analysis: Tokenizing text into words or phrases helps identify the sentiment expressed in the text.
Named entity recognition: Tokenizing text allows models to identify and classify named entities, such as people, organizations, and locations.
There are different ways to tokenize text, depending on the specific NLP task and the desired outcome. Some common tokenization techniques include:

Word tokenization: This is the simplest form of tokenization, where the text is split into individual words.
Sentence tokenization: This involves splitting the text into sentences.
Phrase tokenization: This breaks the text into meaningful phrases or chunks of words.
Subword tokenization: This splits words into smaller subword units, which can be useful for dealing with out-of-vocabulary words.

In [5]:
words = []
classes = []
documents = []
ignore = ['?']
# loop through each sentence in the intent's patterns
for intent in intents['intents']:
  for pattern in intent['patterns']:
    # tokenize each  and every word in the sentence
    w = nltk.word_tokenize(pattern)
    # add word to word list
    words.extend(w)
    #add tags to our classes
    documents.append((w, intent['tag']))
    # add tags to our classes list
    if intent['tag'] not in classes:
      classes.append(intent['tag'])

In [6]:
# perfrom stemming and lower each word as well as remove duplicates
words = [stemmer.stem(w.lower()) for w in words if w not in  ignore]
words = sorted(list(set(words)))

#remove  duplcaite classes
classes = sorted(list(set(classes)))

print (len(documents), "documents", documents)
print (len(classes), "classes", classes)
print (len(words), "unique_stemmed_words", words)

27 documents [(['Hi'], 'greeting'), (['How', 'are', 'you'], 'greeting'), (['Is', 'anyone', 'there', '?'], 'greeting'), (['Hello'], 'greeting'), (['Good', 'day'], 'greeting'), (['Bye'], 'goodbye'), (['See', 'you', 'later'], 'goodbye'), (['Goodbye'], 'goodbye'), (['Thanks'], 'thanks'), (['Thank', 'you'], 'thanks'), (['That', "'s", 'helpful'], 'thanks'), (['Who', 'built', 'this', 'chatbot', '?'], 'chatbot'), (['Tell', 'me', 'about', 'Chatbot'], 'chatbot'), (['What', 'is', 'this', 'chatbot', 'name', '?'], 'chatbot'), (['What', 'is', 'your', 'location', '?'], 'location'), (['Where', 'are', 'you', 'located', '?'], 'location'), (['What', 'is', 'your', 'address', '?'], 'location'), (['Give', 'me', 'your', 'social', 'media', 'accounts', 'link'], 'connect'), (['Where', 'can', 'we', 'connect'], 'connect'), (['How', 'can', 'i', 'reach', 'out', 'to', 'you', '?'], 'connect'), (['Is', 'there', 'any', 'way', 'we', 'can', 'connect'], 'connect'), (['Which', 'is', 'your', 'favourite', 'movie', '?'], 'mov

####Training Models

In [7]:
# create training data
training = []
output = []
#create an empty array for output
output_empty = [0] * len(classes)

#create training set, bag of words for each sentence
for doc in documents:
  #init bag of words
  bag = []
  # list of tokenized words for the pattern
  pattern_words = doc[0]
  # stemming each word
  pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
  # create bag of words array
  for w in words:
    bag.append(1) if w in pattern_words else bag.append(0)

  # output is 1 for current tag 0 for the rest of other tags
  output_row = list(output_empty)
  output_row[classes.index(doc[1])] = 1

  training.append([bag, output_row])
  #shuffling features and turing it into np.array
  random.shuffle(training)
  training = [[np.array(item[0]), np.array(item[1])] for item in training]


In [8]:
#creating training lists
train_x = [item[0] for item in training]
train_y = [item[1] for item in training]

In [9]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(10,input_shape=(len(train_x[0]),)))
model.add(tf.keras.layers.Dense(10))
model.add(tf.keras.layers.Dense(len(train_y[0]), activation='softmax'))
model.compile(tf.keras.optimizers.Adam(), loss='categorical_crossentropy', metrics=['accuracy'])




In [12]:
model.fit(np.array(train_x), np.array(train_y), epochs=100, batch_size=8, verbose=1)
model.save('model.pkl')

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78