## 1. Importing Libraries

In [1]:
!pip install numpy keras tensorflow nltk

Defaulting to user installation because normal site-packages is not writeable


DEPRECATION: Loading egg at c:\python312\lib\site-packages\vboxapi-1.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330


In [2]:
import nltk
import json
import pickle
import random
import numpy as np
from nltk.stem import WordNetLemmatizer
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import SGD
import os

-   `nltk` – Used for tokenization and lemmatization.
-   `json` – Loads the intent dataset from a JSON file.
-   `pickle` – Saves and loads processed data (`words.pkl`, `classes.pkl`).
-   `random` – Shuffles training data to improve generalization.
-   `numpy` – Handles data processing and numerical operations.
-   `WordNetLemmatizer` – Converts words to their base form (e.g., "running" → "run").
-   `keras.models.Sequential` – Defines the neural network architecture.
-   `keras.layers.Dense, Dropout` – Adds fully connected layers and dropout for regularization.
-   `keras.optimizers.SGD` – Uses Stochastic Gradient Descent for training.
-   `os` – Handles file operations.
---

## 2. Ensuring NLTK Resources are Available

In [None]:
# Function to check and download NLTK resources if not already available
def check_nltk_resources():
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        nltk.download('punkt')
    
    try:
        nltk.data.find('corpora/wordnet')
    except LookupError:
        nltk.download('wordnet')

# Call the function to ensure NLTK resources are available
check_nltk_resources()

-   This function ensures that `punkt` (for tokenization) and `wordnet` (for lemmatization) are downloaded.
-   If they are missing, it downloads them.
---

## 3. Load and Process Data

In [None]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Load and process the data
data_file = open('Data/admission_data.json').read()
intents = json.loads(data_file)

-   `WordNetLemmatizer()` – Initializes the lemmatizer for text preprocessing.
-   `open('Data/admission_data.json').read()` – Opens and reads the JSON dataset.
-   `json.loads(data_file)` – Converts the JSON string into a Python dictionary.
---

## 4. Preparing Data for Training

In [None]:
words = []
classes = []
documents = []
ignore_words = ['?', '!']

-   `words` – Stores unique words from all training sentences.
-   `classes` – Stores different intent tags.
-   `documents` – Stores word patterns mapped to intent tags.
-   `ignore_words` – Excludes punctuation marks.
---

In [None]:
print(intents)

In [None]:
for intent in intents['intents']:
    for pattern in intent['patterns']:
        # Tokenize each word in the sentence
        w = nltk.word_tokenize(pattern)  # Tokenize each sentence
        # Add to documents
        words.extend(w)  # Add words to list
        
        documents.append((w, intent['tag']))  # Store word-intent pair
        # Add to classes if not already present
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

In [None]:
print(classes)

-   Loops through each intent in the dataset.
-   Tokenizes each pattern sentence into words.
-   Stores word-tag pairs in `documents` for training.
-   Adds new intent tags to `classes`.
---

In [None]:
# Lemmatize and lower each word and remove duplicates
words = sorted(list(set([lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words])))

# Sort classes
classes = sorted(list(set(classes)))

-   Converts all words to lowercase and lemmatizes them.
-   Removes duplicates and sorts them.
-   Sorts `classes` to maintain consistency.
---

In [None]:
print(words)

In [None]:
print(documents)

In [None]:
# Print data information
print(len(documents), "documents")
print(len(classes), "classes", classes)
print(len(words), "unique lemmatized words", words)

- Displays summary information about training data.
---

In [None]:
# Save words and classes to disk
pickle.dump(words, open('Model/words.pkl', 'wb'))
pickle.dump(classes, open('Model/classes.pkl', 'wb'))

- Saves processed `words` and `classes` for later use.
---

## 5. Creating Training Data

In [None]:
# Create training data
training = []
output_empty = [0] * len(classes)

-   `training` – Stores input-output training data.
-   `output_empty` – Initializes a list of zeros to represent class labels.
---

In [None]:
print(documents)

In [None]:
for doc in documents:
    # Initialize bag of words
    bag = []
    # Lemmatize each word
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in doc[0]]
    # Create bag of words array
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
        # Fills `bag` with `1` if the word appears in the document, else `0`.
        
    # Create output array
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    
    training.append([bag, output_row])

In [None]:
print(training)

- Creates an empty bag-of-words representation.
- Lemmatizes words in the current document.
- Fills `bag` with `1` if the word appears in the document, else `0`.
- Creates a one-hot encoded output array for the intent tag.
- Appends the `bag` (input) and `output_row` (label) to `training`.
---

In [None]:
# Shuffle the data and convert to numpy arrays
random.shuffle(training)
training = np.array(training, dtype=object)

-   Randomly shuffles data to prevent bias.
-   Converts `training` list into a NumPy array.

In [None]:
# Create train and test lists
train_x = np.array(list(training[:, 0]))
train_y = np.array(list(training[:, 1]))

print("Training data created")

- Splits `training` data into `train_x` (features) and `train_y` (labels).
- Confirms successful data preprocessing.

## 6. Creating and Training the Model

In [None]:
# Create model
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

-   `Sequential()` – Defines a feed-forward neural network.
-   `Dense(128, input_shape=(len(train_x[0]),), activation='relu')` – Adds a fully connected layer with 128 neurons and ReLU activation.
-   `Dropout(0.5)` – Prevents overfitting by randomly deactivating neurons.
-   `Dense(64, activation='relu')` – Adds another hidden layer with 64 neurons.
-   `Dense(len(train_y[0]), activation='softmax')` – Outputs probability distribution over intent classes.
---

In [None]:
# Compile model
sgd = SGD(learning_rate=0.01, decay=1e-6, momentum=0.9, nesterov=True)

## Break

In [None]:
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

-   Uses **Stochastic Gradient Descent (SGD)** as the optimizer.
-   `categorical_crossentropy` is the loss function (used for multi-class classification).
- `momentum=0.9`: Adds a fraction of the previous update to the current update to accelerate the gradient vectors in the right directions, thus leading to faster converging.


In [None]:
# Fit and save the model
hist = model.fit(train_x, train_y, epochs=200, batch_size=5, verbose=1)

-   Trains the model for **200 epochs** using mini-batches of size **5**.
-   `verbose=1` prints training progress.

In [None]:
model.save('Model/chatbot_model.h5', hist)
print("Model created")

- Saves the trained model as `chatbot_model.h5`.
- Confirms the successful creation of the model.
---
---

### **Summary of Key Steps**

1.  **Load and preprocess dataset** – Tokenization, lemmatization, and intent tagging.
2.  **Create bag-of-words representations** – Convert text data into numerical format.
3.  **Train a Neural Network** – Using dense layers and dropout to classify intent.
4.  **Save model and preprocessing files** – To be used in chatbot inference.

This is the complete breakdown of `train_model.py`. Let me know if you want to go deeper into any part! 🚀