<h1>University Chatbot Guide System</h1>

A chatbot system represents a cutting-edge form of conversational Artificial Intelligence designed to efficiently respond to user queries in a conversational manner. Within the realm of chatbot systems, our project will focus on the development of an intent-based chatbot system.

<b>Problem statement: </b> <br>
Waiting for hours to seek guidance on university-related queries, especially during critical processes like registration and specification conversions, has been a frustrating experience for many students. This challenge highlights the necessity of an accessible and responsive solution.

<b>In response</b>, we are embarking on a project to develop an <b>intent-based</b> chatbot system that operates 24/7. This system will cater to the diverse concerns and issues faced by students, offering real-time assistance on a wide array of university-related matters. By creating this chatbot guide, we aim to streamline the student experience, providing instant, clear, and reliable information to ensure a smoother journey through their academic pursuits.

# <b>First of all we need to download the needed libraries </b>:

In [4]:
import random
import numpy as np
import pickle
import json
import nltk
from nltk.stem import WordNetLemmatizer

### <center> EDA </center>



---




In this project, We are using a dataset provided by the university admission and regsitration center. Data is available as a .json file that includes list of intents. Each intent includes a tag, patterns list, and responses list.

- Data collected in 2023.

- [DataSet.json file](https://drive.google.com/drive/folders/14GV8F19Uk_Kq9iMNnclWPBxjv1FOHcLa?usp=drive_link)

We are using <b> NLTK </b> library for natural language processing.

The EDA process includes the follow:

1.   Tokenization
2.   Grouping related words for each tag
3.   Stopwords and punctuations(ignored_letters) removing
4.   Lemmitaization
5.   Bag of Words (BOW)



<h3> Data loading </h3>

In [1]:
# Drive_mounting
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [15]:
intents=json.loads(open('/content/drive/MyDrive/NLP/DataSet.json').read())

Here, we define a word list for storing a list of tokenized words. A list of classes is defined for storing unique classes(tags) that act as the topics a user may ask about.
<br>
<br>

In [22]:
words=[]
classes=[]
documents=[]

1. <h2> Tokenization </h2>

<b> <i>Tokenization</i></b> is the process of breaking text into smaller units, typically words or subwords, to facilitate analysis and natural language processing tasks. It involves separating text into meaningful components for further processing and analysis. Here, we are mapping each intent tag with a list of related tokenized words.


In [29]:
nltk.download('punkt')
# Iterate over intents and their patterns
for intent in intents['intents']:
# Tokenize the pattern into a list of words
    for pattern in intent["patterns"]:
        word_list=nltk.word_tokenize(pattern)
        words.extend(word_list)
        documents.append((word_list,intent['tag']))
# If the intent tag is not already in the classes list, append it
        if intent['tag'] not in classes:
            classes.append(intent['tag'])
print("__________________________________________________________________________")
print("documents \n " ,documents)
print("__________________________________________________________________________")

__________________________________________________________________________
documents 
  [(['How', 'to', 'get', 'to', 'the', 'university', '?'], 'location'), (['How', 'do', 'I', 'get', 'to', 'university', '?'], 'location'), (['Hi', '!'], 'Greeting'), (['Hello', '!'], 'Greeting'), (['Hola', '!'], 'Greeting'), (['How', 'are', 'you'], 'Greeting'), (['the', 'official', 'university', 'holiday'], 'holiday'), (['What', 'are', 'the', 'official', 'holidays', '?'], 'holiday'), (['What', 'are', 'the', 'university', "'s", 'official', 'holidays', '?'], 'holiday'), (['Official', 'working', 'hours'], 'working hours'), (['What', 'are', 'the', 'official', 'working', 'hours', '?'], 'working hours'), (['When', 'does', 'the', 'semester', 'start', '?'], 'dates'), (['Is', 'the', 'university', 'recognized', '?'], 'university recognized'), (['Are', 'the', 'majors', 'recognized', '?'], 'majors recognized'), (['How', 'to', 'communicate', 'with', 'departments', '?'], 'departments'), (['How', 'to', 'communicate', 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!




<br>
<br>

2. <h2> Stop words and special characters removing</h2>

<b>The next step</b> in NLP, is filtering out the stopping words and special characters( like @ , $ , _ , -, ? , !, ...etc). Here, we are using NLTK predefined stopwords list.

In [30]:
nltk.download('wordnet')
# importing stopwords
from nltk.corpus import stopwords
# downloading stopwords
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [31]:
# list of letters that we are going to filter out
ignore_letters=['?','!','.',',',]
# list of stopwords
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

<b> Lemmatization </b> is a linguistic process that reduces words to their base or dictionary form (lemmas), aiding in analysis and understanding by grouping variants of a word. It aims to normalize words for improved text processing and analysis.

- Here, we are applying the lemmatization on all words after filtering out  the stoping words and ignored letters.

In [26]:
# Instantiating an object of WordNetLemmatizer
lemmatizer =WordNetLemmatizer()

In [33]:
# Filter and lemmatize words, excluding those in ignore_letters
words = [lemmatizer.lemmatize(word.lower()) for word in words if word not in ignore_letters or word not in stop_words]
words = sorted(set(words))
words

["'s",
 'a',
 'accept',
 'admission',
 'and',
 'application',
 'are',
 'at',
 'bachelor',
 'branch',
 'by',
 'communicate',
 'date',
 'degree',
 'department',
 'do',
 'document',
 'doe',
 'electronic',
 'end',
 'enrollment',
 'fill',
 'for',
 'get',
 'hello',
 'hi',
 'hola',
 'holiday',
 'hour',
 'how',
 'i',
 'is',
 'it',
 'major',
 'new',
 'of',
 'official',
 'out',
 'phone',
 'possible',
 'ramallah',
 'recognized',
 'register',
 'required',
 's',
 'semester',
 'start',
 'student',
 'submit',
 'the',
 'to',
 'university',
 'what',
 'when',
 'with',
 'working',
 'you',
 '’']

In [34]:
# sorting unique classes alphabitcally
classes =sorted (set(classes))
print(classes)
# saving the 'words' list to a binary file
pickle.dump(words, open('words.pkl','wb'))
# saving the 'classes' list to a binary file
pickle.dump(classes, open('classes.pkl','wb'))

['Greeting', 'admission dates', 'dates', 'departments', 'documents', 'electronic application', 'holiday', 'location', 'majors recognized', 'register', 'university recognized', 'working hours']


3. <h2> Creating A bag of words for each document </h3>

A bag-of-words (<b>BOW</b>) representation is created for each document, where word presence is encoded as 1 and absence as 0. The output rows are prepared based on classes, forming a training dataset for machine learning models like classifiers. Since in the next steps, we are going to use mutli-classifers for intents clasification.

In [36]:
training=[]
# Creating an output array with zeros for each class
output_empty=[0]*len(classes)

# iterating over documents
for document in documents:
    bag=[]
    word_patterns= document[0]
    # applying lemmatization on words_patterns
    word_patterns=[lemmatizer.lemmatize(word.lower() )for word in word_patterns]
# iterating over the words in the word_patterns
    for word in words:
# Check if each word in the word list is in the current document, if yes encode it as 1, else as 0
        bag.append(1) if word in word_patterns else bag.append(0)
        output_row=list(output_empty)
        output_row[classes.index(document[1])]=1
# Append the bag of words and output row to the training data
        training.append([bag,output_row])
training

[[[0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0],
  [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]],
 [[0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   0],
  [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]],
 [[0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0

In the realm of data preprocessing, we have accomplished a significant milestone by meticulously cleansing and structuring our dataset. This process involved the removal of stopwords and special characters, as well as lemmatization, resulting in refined and coherent text data. We have successfully organized our dataset into a collection of documents, each comprising a pertinent bag of words list(includes encoded data).

Given that we provided our own data, we encountered no instances of missing values or duplicated entries.

<b>Now, the data is considered as clean preprocessed data and it's ready for training process.</b>