### What is NLP?

Natural Language Processing (NLP) is a field in Computer Science and Artificial Intelligence (AI) that enables machines to understand, interpret, and respond to human language in a meaningful way. It focuses on bridging the gap between human communication and computer understanding, allowing for tasks such as language translation, sentiment analysis, text summarization, and more.

### Popular NLP Libraries:
- **Spacy**: A fast and efficient library for NLP tasks such as tokenization, part-of-speech tagging, named entity recognition, etc.
- **Gensim**: Primarily used for topic modeling, document similarity, and word embedding (Word2Vec, etc.).
- **NLTK (Natural Language Toolkit)**: A powerful library for text processing, including classification, tokenization, stemming, and more.
- **Scikit-Learn**: A general-purpose machine learning library that can be applied to NLP for tasks like classification and clustering.
- **TensorFlow**: A deep learning framework often used for building advanced NLP models, such as neural networks for text generation or classification.
- **Hugging Face**: A library that provides pre-trained NLP models, especially for transformer-based architectures like BERT, GPT, etc., for state-of-the-art language understanding.


### Implementation of Chatbot Using NLP

NLP enables chatbots to understand, process, and respond to human language through tokenization, lemmatization, and intent recognition. The process involves collecting and preprocessing data, training a model to classify user intents, and generating appropriate responses. Chatbots are built by defining intents, vectorizing data, and integrating frontend and backend for real-time interaction.



### Importing Libraries

In this block, we import the necessary libraries for the chatbot project:

- **nltk**: Used for natural language processing (NLP), particularly for tokenization.
- **numpy**: Used for numerical operations, particularly in array manipulation.
- **tensorflow**: The main machine learning library used for building and training the neural network.
- **json**: For working with the intents dataset in JSON format.
- **random**: For selecting random responses from the intents data.

In [1]:
import nltk
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import SGD
import json
import random
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Hari
[nltk_data]     Priya/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Loading the Intents Dataset

In this block, we open and load the **intents.json** file, which contains the dataset used for training the chatbot. The file is loaded using Python's built-in **json** module, and its content is stored in the `intents` variable as a Python dictionary.


In [4]:
with open('intents.json') as file:
    intents = json.load(file)

### Preprocessing the Data and Saving Words and Classes

In this block, we tokenize the patterns from the intents dataset, extract unique words and classes, and save them as **pickle files** for later use. The steps involved are as follows:

1. **Tokenization and Preprocessing**: Each sentence in the dataset is tokenized into individual words, which are then stored in the `words` list. Additionally, each pattern is paired with its corresponding tag (class) and added to the `documents` list.
2. **Removing Stop Words**: Common punctuation marks such as `?`, `!`, `.`, and others are excluded from the `words` list.
3. **Stemming and Lowercasing**: All words are converted to lowercase to ensure uniformity, and duplicate words are removed.
4. **Saving Data**: The list of unique words and classes (tags) are saved as **pickle** files, allowing them to be reused during model training.


In [5]:
import pickle

# Initialize lists
words = []
classes = []
documents = []
ignore_words = ['?', '!', '.', ',', ':', '(', ')']

# Tokenize and preprocess the dataset
for intent in intents['intents']:
    for pattern in intent['patterns']:
        word_list = nltk.word_tokenize(pattern)
        words.extend(word_list)
        documents.append((pattern, intent['tag']))
    if intent['tag'] not in classes:
        classes.append(intent['tag'])

# Stem and lower each word
words = [w.lower() for w in words if w not in ignore_words]
words = sorted(list(set(words)))  # Remove duplicates

# Sort classes
classes = sorted(list(set(classes)))

# Save the words and classes as pickle files
with open('words.pkl', 'wb') as f:
    pickle.dump(words, f)

with open('classes.pkl', 'wb') as f:
    pickle.dump(classes, f)

# Print confirmation message
print("words.pkl and classes.pkl saved!")

words.pkl and classes.pkl saved!


### Creating Training Data and Feature Extraction

In this block, we prepare the training data and convert it into a format suitable for training the neural network. The key steps are as follows:

1. **Preprocessing Sentences**: We tokenize each sentence from the `documents` list and remove the stop words. The resulting tokenized words are stored in the `training_sentences` list.
2. **Labeling Sentences**: Each sentence is paired with its corresponding intent label, which is mapped to a unique index from the `classes` list and stored in `training_labels`.
3. **One-Hot Encoding of Labels**: The `training_labels` are converted into one-hot encoded vectors using `tf.keras.utils.to_categorical()`. This converts each label into a binary vector representing the class.
4. **Bag of Words Model**: The `CountVectorizer` from `sklearn` is used to convert the sentences into a bag of words model, representing each sentence as a feature vector. The vocabulary for the vectorizer is derived from the unique words stored in `words`.


In [6]:
# Initialize training data
training_sentences = []
training_labels = []

# Create the training set
for doc in documents:
    pattern_words = nltk.word_tokenize(doc[0])
    pattern_words = [word.lower() for word in pattern_words if word not in ignore_words]
    
    training_sentences.append(" ".join(pattern_words))
    
    # Set the label (the intent)
    training_labels.append(classes.index(doc[1]))

# Convert training labels to one-hot encoding
training_labels = tf.keras.utils.to_categorical(training_labels, num_classes=len(classes))

# Create the bag of words model (input feature vector)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(vocabulary=words)
X_train = vectorizer.transform(training_sentences).toarray()


### Initializing and Compiling the Neural Network Model

In this block, we define and compile a simple feedforward neural network model for intent classification:

1. **Input Layer**: We add a dense layer with 128 units and the ReLU activation function. The input shape is defined based on the length of the feature vector (`len(X_train[0])`), which corresponds to the number of words in the vocabulary.
2. **Dropout Layer**: A dropout layer is added with a rate of 0.5 to reduce overfitting during training.
3. **Output Layer**: The output layer consists of a dense layer with a number of units equal to the number of classes (intents). The activation function is softmax, which is suitable for multi-class classification as it outputs probabilities.
4. **Compiling the Model**: The model is compiled using the categorical cross-entropy loss function, the SGD optimizer with a learning rate of 0.01 and momentum of 0.9, and the accuracy metric to monitor the model's performance.


In [7]:
# Initialize the neural network model
model = Sequential()

# Add input layer
model.add(Dense(128, input_shape=(len(X_train[0]),), activation='relu'))

# Add dropout for regularization
model.add(Dropout(0.5))

# Add output layer with softmax activation to output probabilities
model.add(Dense(len(classes), activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer=SGD(learning_rate=0.01, momentum=0.9), metrics=['accuracy'])


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)



In this block, we train the neural network model with the training data:

1. **Training the Model**: The model is trained using the `fit()` function, where:
   - `X_train` is the feature matrix containing the bag of words for each sentence.
   - `training_labels` is the one-hot encoded vector representing the corresponding intent.
   - The model is trained for 100 epochs with a batch size of 8.
   - The `verbose=1` argument ensures that progress updates are printed during training.


In [8]:
# Train the model
history = model.fit(X_train, training_labels, epochs=100, batch_size=8, verbose=1)


Epoch 1/100
[1m105/105[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.0095 - loss: 5.6150
Epoch 2/100
[1m105/105[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.0113 - loss: 5.5944
Epoch 3/100
[1m105/105[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.0267 - loss: 5.5725
Epoch 4/100
[1m105/105[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.0244 - loss: 5.5456
Epoch 5/100
[1m105/105[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.0457 - loss: 5.5184
Epoch 6/100
[1m105/105[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.0360 - loss: 5.4875
Epoch 7/100
[1m105/105[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.0333 - loss: 5.4517
Epoch 8/100
[1m105/105[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.0452 - loss: 5.4144
Epoch 9/100
[1m105/105[0m [32

### Saving the Trained Model

After training the model, we save it to a file so that it can be loaded later for making predictions. The model is saved in the HDF5 format using the `save()` function.

In [9]:
# Save the trained model
model.save('chatbot_model.h5')



### Function to Predict the Class of a Sentence

This function predicts the class of a given sentence using the trained model. It tokenizes and processes the input sentence, converts it into a bag of words, and uses the model to predict the class.


In [18]:
# Function to predict the class of a sentence
def predict_class(sentence, model, words, classes):
    # Tokenize and stem the input sentence
    sentence_words = nltk.word_tokenize(sentence)
    sentence_words = [w.lower() for w in sentence_words if w in words]
    
    # Convert the sentence to a bag of words
    bag_of_words = np.zeros(len(words), dtype=int)
    for word in sentence_words:
        bag_of_words[words.index(word)] = 1

    # Predict the class
    prediction = model.predict(np.array([bag_of_words]))[0]
    predicted_class_index = np.argmax(prediction)
    return classes[predicted_class_index]

# Example: Predict a sentence
sentence = "Bye"
predicted_class = predict_class(sentence, model, words, classes)
print(f"Predicted class: {predicted_class}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 64ms/step
Predicted class: goodbye


### Function to Get a Response Based on the Predicted Class

This function retrieves a response from the `intents.json` file based on the predicted class. It selects a random response associated with the predicted intent.


In [20]:
import random

# Function to get the response based on the predicted class
def get_response(predicted_class, intents):
    for intent in intents['intents']:
        if intent['tag'] == predicted_class:
            response = random.choice(intent['responses'])
            return response

# Get the response
response = get_response(predicted_class, intents)
print(f"Response: {response}")


Response: See you later


### Test the Chatbot with a Sentence

This code tests the chatbot by predicting the class for a given sentence ("Bye") and fetching a response based on the predicted class.


In [24]:
# Test with a sentence
sentence = "hey whats up"
predicted_class = predict_class(sentence, model, words, classes)
response = get_response(predicted_class, intents)
print(f"Response: {response}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 89ms/step
Response: I'm fine, thank you
