## 1. Importing Libraries

In [1]:
!pip install numpy keras tensorflow nltk

Defaulting to user installation because normal site-packages is not writeable


DEPRECATION: Loading egg at c:\python312\lib\site-packages\vboxapi-1.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330


In [2]:
import nltk
import json
import pickle
import random
import numpy as np
from nltk.stem import WordNetLemmatizer
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import SGD
# import os

-   `nltk` – Used for tokenization and lemmatization.
-   `json` – Loads the intent dataset from a JSON file.
-   `pickle` – Saves and loads processed data (`words.pkl`, `classes.pkl`).
-   `random` – Shuffles training data to improve generalization.
-   `numpy` – Handles data processing and numerical operations.
-   `WordNetLemmatizer` – Converts words to their base form (e.g., "running" → "run").
-   `keras.models.Sequential` – Defines the neural network architecture.
-   `keras.layers.Dense, Dropout` – Adds fully connected layers and dropout for regularization.
-   `keras.optimizers.SGD` – Uses Stochastic Gradient Descent for training.
-   `os` – Handles file operations.
---

## 2. Ensuring NLTK Resources are Available

In [4]:
# Function to check and download NLTK resources if not already available
def check_nltk_resources():
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        nltk.download('punkt')
    
    try:
        nltk.data.find('corpora/wordnet')
    except LookupError:
        nltk.download('wordnet')

# Call the function to ensure NLTK resources are available
check_nltk_resources()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


-   This function ensures that `punkt` (for tokenization) and `wordnet` (for lemmatization) are downloaded.
-   If they are missing, it downloads them.
---

## 3. Load and Process Data

In [23]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Load and process the data
data_file = open('Data/admission_data.json').read()
intents = json.loads(data_file)

-   `WordNetLemmatizer()` – Initializes the lemmatizer for text preprocessing.
-   `open('Data/admission_data.json').read()` – Opens and reads the JSON dataset.
-   `json.loads(data_file)` – Converts the JSON string into a Python dictionary.
---

## 4. Preparing Data for Training

In [24]:
words = []
classes = []
documents = []
ignore_words = ['?', '!']

-   `words` – Stores unique words from all training sentences.
-   `classes` – Stores different intent tags.
-   `documents` – Stores word patterns mapped to intent tags.
-   `ignore_words` – Excludes punctuation marks.
---

In [25]:
print(intents)

{'intents': [{'tag': 'greeting', 'patterns': ['Hi', 'Hello', 'Hey', 'Good day'], 'responses': ['Hello! How can I assist you today?', 'Hi there! How can I help you?', 'Hey! What can I do for you?'], 'context': ''}, {'tag': 'goodbye', 'patterns': ['Bye', 'See you later', 'Goodbye', 'Till next time'], 'responses': ['Goodbye! Have a great day.', 'See you later! Take care.', 'Until next time!'], 'context': ''}, {'tag': 'thanks', 'patterns': ['Thanks', 'Thank you', "That's helpful", 'Thanks for helping me'], 'responses': ["You're welcome!", 'Glad I could assist!', 'Anytime!'], 'context': ''}, {'tag': 'admission_deadline', 'patterns': ['What is the application deadline?', 'When do I need to apply?', 'Deadline for application?'], 'responses': ["The application deadline is typically in September. Make sure to check the college's official website for the exact date."], 'context': ''}, {'tag': 'admission_requirements', 'patterns': ['What are the admission requirements?', 'What do I need to apply?

In [26]:
for intent in intents['intents']:
    for pattern in intent['patterns']:
        # Tokenize each word in the sentence
        w = nltk.word_tokenize(pattern)  # Tokenize each sentence
        # Add to documents
        words.extend(w)  # Add words to list
        
        documents.append((w, intent['tag']))  # Store word-intent pair
        # Add to classes if not already present
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

In [27]:
print(words)

['Hi', 'Hello', 'Hey', 'Good', 'day', 'Bye', 'See', 'you', 'later', 'Goodbye', 'Till', 'next', 'time', 'Thanks', 'Thank', 'you', 'That', "'s", 'helpful', 'Thanks', 'for', 'helping', 'me', 'What', 'is', 'the', 'application', 'deadline', '?', 'When', 'do', 'I', 'need', 'to', 'apply', '?', 'Deadline', 'for', 'application', '?', 'What', 'are', 'the', 'admission', 'requirements', '?', 'What', 'do', 'I', 'need', 'to', 'apply', '?', 'What', 'documents', 'are', 'required', 'for', 'admission', '?', 'What', 'programs', 'does', 'the', 'college', 'offer', '?', 'What', 'majors', 'are', 'available', '?', 'List', 'of', 'courses', 'offered', '?', 'Are', 'scholarships', 'available', '?', 'Can', 'I', 'get', 'financial', 'aid', '?', 'How', 'can', 'I', 'afford', 'college', '?', 'What', 'is', 'campus', 'life', 'like', '?', 'Tell', 'me', 'about', 'student', 'activities', 'Are', 'there', 'clubs', 'on', 'campus', '?', 'Is', 'housing', 'available', 'for', 'freshmen', '?', 'What', 'are', 'the', 'housing', 'opti

In [29]:
# Print data information
print(len(documents), "documents")
print(len(classes), "classes", classes)
print(len(words), "words", words)

91 documents
29 classes ['greeting', 'goodbye', 'thanks', 'admission_deadline', 'admission_requirements', 'programs_offered', 'financial_aid', 'campus_life', 'housing', 'support_services', 'internships', 'faculty', 'study_abroad', 'diversity', 'admission_interview', 'application_tips', 'admission_tests', 'financial_planning', 'admission_status', 'application_fee', 'academic_support', 'career_services', 'student_activities', 'health_services', 'scholarship_deadline', 'admission_website', 'application_process', 'financial_aid_application', 'online_resources']
531 words ['Hi', 'Hello', 'Hey', 'Good', 'day', 'Bye', 'See', 'you', 'later', 'Goodbye', 'Till', 'next', 'time', 'Thanks', 'Thank', 'you', 'That', "'s", 'helpful', 'Thanks', 'for', 'helping', 'me', 'What', 'is', 'the', 'application', 'deadline', '?', 'When', 'do', 'I', 'need', 'to', 'apply', '?', 'Deadline', 'for', 'application', '?', 'What', 'are', 'the', 'admission', 'requirements', '?', 'What', 'do', 'I', 'need', 'to', 'apply', '

-   Loops through each intent in the dataset.
-   Tokenizes each pattern sentence into words.
-   Stores word-tag pairs in `documents` for training.
-   Adds new intent tags to `classes`.
---

In [30]:
# Lemmatize and lower each word and remove duplicates
words = sorted(list(set([lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words])))

# Sort classes
classes = sorted(list(set(classes)))

-   Converts all words to lowercase and lemmatizes them.
-   Removes duplicates and sorts them.
-   Sorts `classes` to maintain consistency.
---

In [31]:
print(words)

["'s", 'about', 'abroad', 'academic', 'access', 'act', 'activity', 'admission', 'advising', 'afford', 'aid', 'an', 'and', 'any', 'applicant', 'application', 'apply', 'are', 'attend', 'available', 'back', 'body', 'bye', 'campus', 'can', 'care', 'career', 'check', 'club', 'college', 'cost', 'counseling', 'course', 'day', 'deadline', 'decision', 'diversity', 'do', 'document', 'doe', 'dormitory', 'during', 'exam', 'exchange', 'extracurricular', 'faculty', 'fee', 'financial', 'find', 'for', 'form', 'freshman', 'from', 'get', 'good', 'goodbye', 'have', 'health', 'hear', 'hello', 'help', 'helpful', 'helping', 'hey', 'hi', 'hour', 'housing', 'how', 'i', 'improve', 'in', 'information', 'international', 'internship', 'interview', 'is', 'it', 'job', 'later', 'life', 'like', 'list', 'major', 'me', 'medical', 'more', 'much', 'my', 'need', 'next', 'of', 'offer', 'offered', 'office', 'official', 'on', 'online', 'opportunity', 'option', 'or', 'organization', 'participate', 'placement', 'planning', 'pr

In [32]:
print(classes)

['academic_support', 'admission_deadline', 'admission_interview', 'admission_requirements', 'admission_status', 'admission_tests', 'admission_website', 'application_fee', 'application_process', 'application_tips', 'campus_life', 'career_services', 'diversity', 'faculty', 'financial_aid', 'financial_aid_application', 'financial_planning', 'goodbye', 'greeting', 'health_services', 'housing', 'internships', 'online_resources', 'programs_offered', 'scholarship_deadline', 'student_activities', 'study_abroad', 'support_services', 'thanks']


In [33]:
print(documents)

[(['Hi'], 'greeting'), (['Hello'], 'greeting'), (['Hey'], 'greeting'), (['Good', 'day'], 'greeting'), (['Bye'], 'goodbye'), (['See', 'you', 'later'], 'goodbye'), (['Goodbye'], 'goodbye'), (['Till', 'next', 'time'], 'goodbye'), (['Thanks'], 'thanks'), (['Thank', 'you'], 'thanks'), (['That', "'s", 'helpful'], 'thanks'), (['Thanks', 'for', 'helping', 'me'], 'thanks'), (['What', 'is', 'the', 'application', 'deadline', '?'], 'admission_deadline'), (['When', 'do', 'I', 'need', 'to', 'apply', '?'], 'admission_deadline'), (['Deadline', 'for', 'application', '?'], 'admission_deadline'), (['What', 'are', 'the', 'admission', 'requirements', '?'], 'admission_requirements'), (['What', 'do', 'I', 'need', 'to', 'apply', '?'], 'admission_requirements'), (['What', 'documents', 'are', 'required', 'for', 'admission', '?'], 'admission_requirements'), (['What', 'programs', 'does', 'the', 'college', 'offer', '?'], 'programs_offered'), (['What', 'majors', 'are', 'available', '?'], 'programs_offered'), (['Lis

In [36]:
# Print data information
print(len(documents), "documents")
print(len(classes), "classes", classes)
print(len(words), "unique lemmatized words", words)

91 documents
29 classes ['academic_support', 'admission_deadline', 'admission_interview', 'admission_requirements', 'admission_status', 'admission_tests', 'admission_website', 'application_fee', 'application_process', 'application_tips', 'campus_life', 'career_services', 'diversity', 'faculty', 'financial_aid', 'financial_aid_application', 'financial_planning', 'goodbye', 'greeting', 'health_services', 'housing', 'internships', 'online_resources', 'programs_offered', 'scholarship_deadline', 'student_activities', 'study_abroad', 'support_services', 'thanks']
145 unique lemmatized words ["'s", 'about', 'abroad', 'academic', 'access', 'act', 'activity', 'admission', 'advising', 'afford', 'aid', 'an', 'and', 'any', 'applicant', 'application', 'apply', 'are', 'attend', 'available', 'back', 'body', 'bye', 'campus', 'can', 'care', 'career', 'check', 'club', 'college', 'cost', 'counseling', 'course', 'day', 'deadline', 'decision', 'diversity', 'do', 'document', 'doe', 'dormitory', 'during', 'e

- Displays summary information about training data.
---

In [38]:
# Save words and classes to disk
pickle.dump(words, open('Model/words.pkl', 'wb'))
pickle.dump(classes, open('Model/classes.pkl', 'wb'))

- Saves processed `words` and `classes` for later use.
---

## 5. Creating Training Data

In [39]:
# Create training data
training = []
output_empty = [0] * len(classes)

In [43]:
print(len(output_empty))

29


-   `training` – Stores input-output training data.
-   `output_empty` – Initializes a list of zeros to represent class labels.
---

In [40]:
print(documents)

[(['Hi'], 'greeting'), (['Hello'], 'greeting'), (['Hey'], 'greeting'), (['Good', 'day'], 'greeting'), (['Bye'], 'goodbye'), (['See', 'you', 'later'], 'goodbye'), (['Goodbye'], 'goodbye'), (['Till', 'next', 'time'], 'goodbye'), (['Thanks'], 'thanks'), (['Thank', 'you'], 'thanks'), (['That', "'s", 'helpful'], 'thanks'), (['Thanks', 'for', 'helping', 'me'], 'thanks'), (['What', 'is', 'the', 'application', 'deadline', '?'], 'admission_deadline'), (['When', 'do', 'I', 'need', 'to', 'apply', '?'], 'admission_deadline'), (['Deadline', 'for', 'application', '?'], 'admission_deadline'), (['What', 'are', 'the', 'admission', 'requirements', '?'], 'admission_requirements'), (['What', 'do', 'I', 'need', 'to', 'apply', '?'], 'admission_requirements'), (['What', 'documents', 'are', 'required', 'for', 'admission', '?'], 'admission_requirements'), (['What', 'programs', 'does', 'the', 'college', 'offer', '?'], 'programs_offered'), (['What', 'majors', 'are', 'available', '?'], 'programs_offered'), (['Lis

In [41]:
print(words)

["'s", 'about', 'abroad', 'academic', 'access', 'act', 'activity', 'admission', 'advising', 'afford', 'aid', 'an', 'and', 'any', 'applicant', 'application', 'apply', 'are', 'attend', 'available', 'back', 'body', 'bye', 'campus', 'can', 'care', 'career', 'check', 'club', 'college', 'cost', 'counseling', 'course', 'day', 'deadline', 'decision', 'diversity', 'do', 'document', 'doe', 'dormitory', 'during', 'exam', 'exchange', 'extracurricular', 'faculty', 'fee', 'financial', 'find', 'for', 'form', 'freshman', 'from', 'get', 'good', 'goodbye', 'have', 'health', 'hear', 'hello', 'help', 'helpful', 'helping', 'hey', 'hi', 'hour', 'housing', 'how', 'i', 'improve', 'in', 'information', 'international', 'internship', 'interview', 'is', 'it', 'job', 'later', 'life', 'like', 'list', 'major', 'me', 'medical', 'more', 'much', 'my', 'need', 'next', 'of', 'offer', 'offered', 'office', 'official', 'on', 'online', 'opportunity', 'option', 'or', 'organization', 'participate', 'placement', 'planning', 'pr

In [44]:
for doc in documents:
    # Initialize bag of words
    bag = []
    # Lemmatize each word
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in doc[0]]
    # Create bag of words array
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
        # Fills `bag` with `1` if the word appears in the document, else `0`.
        
    # Create output array
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    
    training.append([bag, output_row])

In [51]:
print(len(words))

145


In [52]:
print(len(training[1][0]))

145


- Creates an empty bag-of-words representation.
- Lemmatizes words in the current document.
- Fills `bag` with `1` if the word appears in the document, else `0`.
- Creates a one-hot encoded output array for the intent tag.
- Appends the `bag` (input) and `output_row` (label) to `training`.
---

In [54]:
# Shuffle the data and convert to numpy arrays
random.shuffle(training)
training = np.array(training, dtype=object)

-   Randomly shuffles data to prevent bias.
-   Converts `training` list into a NumPy array.

In [56]:
# Create train and test lists
train_x = np.array(list(training[:, 0]))
train_y = np.array(list(training[:, 1]))

print("Training data created")

Training data created


- Splits `training` data into `train_x` (features) and `train_y` (labels).
- Confirms successful data preprocessing.

## 6. Creating and Training the Model

In [57]:
# Create model
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


-   `Sequential()` – Defines a feed-forward neural network.
-   `Dense(128, input_shape=(len(train_x[0]),), activation='relu')` – Adds a fully connected layer with 128 neurons and ReLU activation.
-   `Dropout(0.5)` – Prevents overfitting by randomly deactivating neurons.
-   `Dense(64, activation='relu')` – Adds another hidden layer with 64 neurons.
-   `Dense(len(train_y[0]), activation='softmax')` – Outputs probability distribution over intent classes.
---

In [58]:
# defining SGD optimizer
sgd = SGD(learning_rate=0.01, decay=1e-6, momentum=0.9, nesterov=True)



In [60]:
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

-   Uses **Stochastic Gradient Descent (SGD)** as the optimizer.
-   `categorical_crossentropy` is the loss function (used for multi-class classification).
- `momentum=0.9`: Adds a fraction of the previous update to the current update to accelerate the gradient vectors in the right directions, thus leading to faster converging.


In [61]:
# Fit and save the model
hist = model.fit(train_x, train_y, epochs=200, batch_size=5, verbose=1)

Epoch 1/200
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.0524 - loss: 3.3831    
Epoch 2/200
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.1300 - loss: 3.2113
Epoch 3/200
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.1127 - loss: 3.1196   
Epoch 4/200
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.1464 - loss: 2.8798   
Epoch 5/200
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.2456 - loss: 2.7038
Epoch 6/200
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.2959 - loss: 2.3857
Epoch 7/200
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.2757 - loss: 2.5430   
Epoch 8/200
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.3890 - loss: 2.2190
Epoch 9/200
[1m19/19[0m [32m━━━━

-   Trains the model for **200 epochs** using mini-batches of size **5**.
-   `verbose=1` prints training progress.

In [62]:
model.save('Model/chatbot_model.h5', hist)
print("Model created")



Model created


- Saves the trained model as `chatbot_model.h5`.
- Confirms the successful creation of the model.
---
---

### **Summary of Key Steps**

1.  **Load and preprocess dataset** – Tokenization, lemmatization, and intent tagging.
2.  **Create bag-of-words representations** – Convert text data into numerical format.
3.  **Train a Neural Network** – Using dense layers and dropout to classify intent.
4.  **Save model and preprocessing files** – To be used in chatbot inference.

This is the complete breakdown of `train_model.py`. Let me know if you want to go deeper into any part! 🚀