## <u>Part One - PROJECT BASED </u>

• <b>DOMAIN:</b>  Digital content management 

• <b>CONTEXT:</b> : Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles,
etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to
create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

• <b>DATA DESCRIPTION:</b> Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected
posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million
words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a
blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many,
industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:<br>
    • 8240 "10s" blogs (ages 13-17),<br>
    • 8086 "20s" blogs(ages 23-27) and<br>
    • 2994 "30s" blogs (ages 33-47)<br>
For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions.
Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label url
link. Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus<br>

• <b>PROJECT OBJECTIVE:</b> : The need is to build a NLP classifier which can use input text parameters to determine the label/s of the blog.

<b>Steps and tasks:  </b>

1. Import and analyse the data set.
2. Perform data pre-processing on the data:<br>
    • Data cleansing by removing unwanted characters, spaces, stop words etc.Convert text to lowercase.<br>
    • Target/label merger and transformation<br>
    • Train and test split<br>
    • Vectorisation, etc<br>
3. Design, train, tune and test the best text classifier.<br>
   
4. Display and explain detail the classification report.<br>
5. Print the true vs predicted labels for any 5 entries from the dataset.

### <u>Solution</u>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#import libraries
%tensorflow_version 2.x
import tensorflow
tensorflow.__version__

'2.8.0'

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import re
import nltk
import os
import warnings

warnings.filterwarnings('ignore')
from sklearn.preprocessing import MultiLabelBinarizer
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Importing the dataset

In [4]:
blog_data = pd.read_csv("/content/drive/MyDrive/AIML/Labs/CV/blogtext.csv")

In [5]:
blog_data.head(10) #Sample of the data

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


In [6]:
blog_data.shape

(681284, 7)

In [7]:
blog_data.size

4768988

Here we can see a huge data set of 681284 rows and 7 columns. It is difficult to process this big data, so we are taking only a sample of record.

In [8]:
data1 = blog_data.head(10000).copy()

In [9]:
data1.shape

(10000, 7)

In [10]:
data1.isna().any()

id        False
gender    False
age       False
topic     False
sign      False
date      False
text      False
dtype: bool

No null values at present

In [11]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      10000 non-null  int64 
 1   gender  10000 non-null  object
 2   age     10000 non-null  int64 
 3   topic   10000 non-null  object
 4   sign    10000 non-null  object
 5   date    10000 non-null  object
 6   text    10000 non-null  object
dtypes: int64(2), object(5)
memory usage: 547.0+ KB


In [12]:
#We can drop columns "id" & "date" from the dataset since they dont convey any meaning to our problem
data1.drop(['id','date'],axis=1,inplace=True)
data1.head(10)

Unnamed: 0,gender,age,topic,sign,text
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,..."
1,male,15,Student,Leo,These are the team members: Drewe...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...
3,male,15,Student,Leo,testing!!! testing!!!
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...
5,male,33,InvestmentBanking,Aquarius,I had an interesting conversation...
6,male,33,InvestmentBanking,Aquarius,Somehow Coca-Cola has a way of su...
7,male,33,InvestmentBanking,Aquarius,"If anything, Korea is a country o..."
8,male,33,InvestmentBanking,Aquarius,Take a read of this news article ...
9,male,33,InvestmentBanking,Aquarius,I surf the English news sites a l...


In [13]:
data1['age']=data1['age'].astype('object') #Converting int to object, taking age as a object category instead of int

In [14]:
data1.info() #Converted all columns to object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   gender  10000 non-null  object
 1   age     10000 non-null  object
 2   topic   10000 non-null  object
 3   sign    10000 non-null  object
 4   text    10000 non-null  object
dtypes: object(5)
memory usage: 390.8+ KB


In [15]:
data1['text'][22]  #Sample text

"             As readers will know, my favorite airline is Singapore Air (SAI).  They have those personal monitors for everyone with on-demand movies, TV shows and games..and some lovely-looking stewardesses, of course.  Another thing going for them is their empathy with their passengers.  I've twice been bumped up to business class (with it's better meals and basically flat sleeper beds...what an experience that is).  Once because the travel agent made a mistake and SAI decided they'd make my life a bit easier (hint: if you take the same flight 5-10 times they get to know you, too) and another time my flight was overbooked so I too business to San Francisco and got another one to Vancouver (both covered by SAI) as well as 500 Sing$ (300 USD, 350,000 won) which made for a 40% discount from my ticket price.  Anyways, this time I was in line for about 30 minutes (maybe it was longer) and a few times I squatted to relieve my legs a bit.  The gal at the counter apologized and put 2 'Solita

### Preprocessing on the data

In [16]:
#We want to remove unwanted text from the "text" column, for that we use data wrangling with regular expression

data1['new_text']=data1['text'].apply(lambda x: re.sub(r'[^A-Za-z]+',' ',x))  #Removing unwanted char from text using re and storing the formated text as a new column

In [17]:
data1['new_text']=data1['new_text'].apply(lambda x: x.lower()) #Coverting to lowercase

In [18]:
data1['new_text']=data1['new_text'].apply(lambda x: x.strip()) #removing spaces

In [19]:
#Now lets compare our formated text and actual data
print("Actual text:::: {}".format(data1['text'][1]))

Actual text::::            These are the team members:   Drewes van der Laag           urlLink mail  Ruiyu Xie                     urlLink mail  Bryan Aaldering (me)          urlLink mail          


In [20]:
print("Formated text:::: {}".format(data1['new_text'][1]))

Formated text:::: these are the team members drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering me urllink mail


In [21]:
#Now lets remove stopwords from the text

stopwords=set(stopwords.words('english'))

In [22]:
data1['new_text']=data1['new_text'].apply(lambda x: ' '.join([words for words in x.split() if words not in stopwords]))

In [23]:
data1['new_text'][1] 

'team members drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering urllink mail'

All the unwanted characters, stopwords and spaces are removed successfully

Target/label merger and transformation

In [24]:
#We need to merge all the other columns to single column(target/label column)
data1['labels']=data1.apply(lambda col: [col['gender'],str(col['age']),col['topic'],col['sign']], axis=1)

In [25]:
data1.head()

Unnamed: 0,gender,age,topic,sign,text,new_text,labels
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,male,15,Student,Leo,These are the team members: Drewe...,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,male,15,Student,Leo,testing!!! testing!!!,testing testing,"[male, 15, Student, Leo]"
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


In [26]:
data1= data1[['new_text','labels']]
data1.head()

Unnamed: 0,new_text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


### Splitting data into X and Y

In [27]:
X = data1['new_text']
Y = data1['labels']

In [28]:

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,random_state=2,test_size=0.2)

In [29]:
print(X_train.shape)
print(Y_train.shape)

(8000,)
(8000,)


In [30]:
print(X_test.shape)
print(Y_test.shape)

(2000,)
(2000,)


### Vectorize the features

In [31]:
#Lets perform vectorization to get the count of vectors of the X data


In [32]:
vectorizer=CountVectorizer(binary=True, ngram_range=(1,2))  #bi-gram,tri-gram

In [33]:
vectorizer.fit(X_train)
len(vectorizer.vocabulary_) #Vocabulary size

533783

In [34]:
vectorizer.get_feature_names()[:5]

['aa', 'aa amazing', 'aa anger', 'aa keeps', 'aa nice']

In [35]:
X_train_ct = vectorizer.transform(X_train)


In [36]:
type(X_train_ct)
X_train_ct[0]



<1x533783 sparse matrix of type '<class 'numpy.int64'>'
	with 316 stored elements in Compressed Sparse Row format>

In [37]:
X_test_ct = vectorizer.transform(X_test)

In [38]:
vectorizer.get_feature_names()[:5]


['aa', 'aa amazing', 'aa anger', 'aa keeps', 'aa nice']

In [39]:
#Creating a dictionary to get the count of evert label
label_counts=dict()

for labels in data1.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label]+=1
        else:
            label_counts[label]=1

In [40]:
label_counts

{'13': 42,
 '14': 212,
 '15': 602,
 '16': 440,
 '17': 1185,
 '23': 253,
 '24': 655,
 '25': 386,
 '26': 234,
 '27': 1054,
 '33': 136,
 '34': 553,
 '35': 2315,
 '36': 1708,
 '37': 33,
 '38': 46,
 '39': 79,
 '40': 1,
 '41': 20,
 '42': 14,
 '43': 6,
 '44': 3,
 '45': 16,
 '46': 7,
 'Accounting': 4,
 'Aquarius': 571,
 'Aries': 4198,
 'Arts': 45,
 'Automotive': 14,
 'Banking': 16,
 'BusinessServices': 91,
 'Cancer': 504,
 'Capricorn': 215,
 'Communications-Media': 99,
 'Consulting': 21,
 'Education': 270,
 'Engineering': 127,
 'Fashion': 1622,
 'Gemini': 150,
 'HumanResources': 2,
 'Internet': 118,
 'InvestmentBanking': 70,
 'Law': 11,
 'LawEnforcement-Security': 10,
 'Leo': 301,
 'Libra': 491,
 'Marketing': 156,
 'Museums-Libraries': 17,
 'Non-Profit': 71,
 'Pisces': 454,
 'Publishing': 4,
 'Religion': 9,
 'Sagittarius': 1097,
 'Science': 63,
 'Scorpio': 971,
 'Sports-Recreation': 80,
 'Student': 1137,
 'Taurus': 812,
 'Technology': 2654,
 'Telecommunications': 2,
 'Virgo': 236,
 'female': 4

Transforming the labels

In [41]:
#Lets preprocess the labels

binarizer=MultiLabelBinarizer(classes=sorted(label_counts.keys()))

In [42]:
Y_train = binarizer.fit_transform(Y_train)

In [43]:
Y_test = binarizer.transform(Y_test)

In [44]:
Y_test

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1]])

In [45]:
Y_train

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

### We need to choose a classifier


In [46]:
#For this problem we are using OneVsRestClassifier,and for basic classifier we are using LogisticRegression


In [47]:
model=LogisticRegression(solver='lbfgs', max_iter=100)
model=OneVsRestClassifier(model)
model.fit(X_train_ct,Y_train)

OneVsRestClassifier(estimator=LogisticRegression())

In [48]:
Y_pred=model.predict(X_test_ct)


In [49]:
Y_pred

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1]])

In [50]:
Y_test


array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1]])

### Display and explain detail the classification report

In [51]:
#Here we are using the micro and  Macro-average method. It takes the average of the precision and recall of the different sets


def display_metrics_micro(Y_test, Y_pred):
    print('Accuracy score: ', accuracy_score(Y_test, Y_pred))
    print('F1 score: Micro', f1_score(Y_test, Y_pred, average='micro'))
    print('Average precision score: Micro', average_precision_score(Y_test, Y_pred, average='micro'))
    print('Average recall score: Micro', recall_score(Y_test, Y_pred, average='micro'))
    
    
def display_metrics_macro(Y_test, Y_pred):
    print('Accuracy score: ', accuracy_score(Y_test, Y_pred))
    print('F1 score: Macro', f1_score(Y_test, Y_pred, average='macro'))
    print('Average recall score: Macro', recall_score(Y_test, Y_pred, average='macro'))
    
def display_metrics_weighted(Y_test, Y_pred):
    print('Accuracy score: ', accuracy_score(Y_test, Y_pred))
    print('F1 score: weighted', f1_score(Y_test, Y_pred, average='weighted'))
    print('Average precision score: weighted', average_precision_score(Y_test, Y_pred, average='weighted'))
    print('Average recall score: weighted', recall_score(Y_test, Y_pred, average='weighted'))

In [52]:
display_metrics_micro(Y_test,Y_pred)


Accuracy score:  0.327
F1 score: Micro 0.6424971793907484
Average precision score: Micro 0.45976173129131254
Average recall score: Micro 0.533875


In [53]:
display_metrics_macro(Y_test,Y_pred)


Accuracy score:  0.327
F1 score: Macro 0.2313766894433141
Average recall score: Macro 0.17428093502655756


In [54]:
display_metrics_weighted(Y_test,Y_pred)


Accuracy score:  0.327
F1 score: weighted 0.5954162143427103
Average precision score: weighted 0.5135626619782332
Average recall score: weighted 0.533875


The classification report displays the accuarcy F1 score,precision & recall score for the model.

### Print the true vs predicted labels for any 5 entries from the dataset.

In [55]:
preds = Y_pred[:15]
actuals = Y_test[:15]

In [56]:
five_actual = binarizer.inverse_transform(actuals)
five_actual

[('36', 'Aries', 'Fashion', 'male'),
 ('35', 'Aries', 'Technology', 'male'),
 ('35', 'Aries', 'Technology', 'male'),
 ('34', 'Sagittarius', 'female', 'indUnk'),
 ('42', 'Consulting', 'Leo', 'female'),
 ('17', 'Scorpio', 'female', 'indUnk'),
 ('36', 'Aries', 'Fashion', 'male'),
 ('35', 'Aries', 'Technology', 'male'),
 ('35', 'Aries', 'Technology', 'male'),
 ('34', 'Sagittarius', 'female', 'indUnk'),
 ('36', 'Aries', 'Fashion', 'male'),
 ('35', 'Aries', 'Technology', 'male'),
 ('35', 'Aries', 'Technology', 'male'),
 ('39', 'Communications-Media', 'Libra', 'male'),
 ('36', 'Aries', 'Fashion', 'male')]

In [57]:
five_pred = binarizer.inverse_transform(preds)
five_pred

[('male',),
 ('female',),
 ('Technology', 'male'),
 ('34', 'Sagittarius', 'female', 'indUnk'),
 ('female', 'indUnk'),
 ('17', 'Scorpio', 'female', 'indUnk'),
 ('36', 'Aries', 'Fashion', 'male'),
 ('Aries', 'male'),
 ('Aries', 'male'),
 ('34', 'Sagittarius', 'female', 'indUnk'),
 ('35', 'Aries', 'Technology', 'male'),
 ('Aries', 'male'),
 ('male',),
 ('indUnk', 'male'),
 ('Aries', 'male')]

In [58]:
print(binarizer.inverse_transform(Y_pred)[500])
print(binarizer.inverse_transform(Y_test)[500])

('male',)
('35', 'Aries', 'Technology', 'male')


In [59]:
print(binarizer.inverse_transform(Y_pred)[400])
print(binarizer.inverse_transform(Y_test)[400])

('36', 'Aries', 'Fashion', 'male')
('36', 'Aries', 'Fashion', 'male')


In [60]:
print(binarizer.inverse_transform(Y_pred)[450])
print(binarizer.inverse_transform(Y_test)[450])

('27', '36', 'Fashion', 'female', 'indUnk')
('36', 'Aries', 'Fashion', 'male')


In [61]:
print(binarizer.inverse_transform(Y_pred)[333])
print(binarizer.inverse_transform(Y_test)[333])

('male',)
('16', 'Libra', 'Student', 'female')


In [62]:
print(binarizer.inverse_transform(Y_pred)[666])
print(binarizer.inverse_transform(Y_test)[666])

('Sagittarius', 'female', 'indUnk')
('34', 'Sagittarius', 'female', 'indUnk')


## <u>Part Two - PROJECT BASED </u>

• <b>DOMAIN:</b>  Customer support

• <b>CONTEXT:</b> : Great Learning has a an academic support department which receives numerous support requests every day throughout the
year. Teams are spread across geographies and try to provide support round the year. Sometimes there are circumstances where due to
heavy workload certain request resolutions are delayed, impacting company’s business. Some of the requests are very generic where a
proper resolution procedure delivered to the user can solve the problem. Company is looking forward to design an automation which can
interact with the user, understand the problem and display the resolution procedure [ if found as a generic request ] or redirect the request
to an actual human support executive if the request is complex or not in it’s database.

• <b>DATA DESCRIPTION:</b>  A sample corpus is attached for your reference. Please enhance/add more data to the corpus using your linguistics
skills.<br>

• <b>PROJECT OBJECTIVE:</b> : Design a python based interactive semi - rule based chatbot which can do the following:

1. Start chat session with greetings and ask what the user is looking for.
2. Accept dynamic text based questions from the user. Reply back with relevant answer from the designed corpus.<br>
3. End the chat session only if the user requests to end else ask what the user is looking for. Loop continues till the user asks to end it..<br>
   
• <b>EVALUATION</b>:  GL evaluator will use linguistics to twist and turn sentences to ask questions on the topics described in <b>DATA DESCRIPTION</b>
and check if the bot is giving relevant replies.

## Solution

In [63]:
import random
import json
import string
from nltk.stem import WordNetLemmatizer
# install specific downloads
nltk.download('punkt', quiet = True)
nltk.download('wordnet', quiet = True)



True

In [64]:
#imporitng the corpus
data_file = open("/content/drive/MyDrive/AIML/Labs/CV/GL Bot.json").read()
intents  = json.loads(data_file)




In [65]:
intents #Displaying the corpus, it is dictionay with keys and values

{'intents': [{'context_set': '',
   'patterns': ['hi',
    'how are you',
    'is anyone there',
    'hello',
    'whats up',
    'hey',
    'yo',
    'listen',
    'please help me',
    'i am learner from',
    'i belong to',
    'aiml batch',
    'aifl batch',
    'i am from',
    'my pm is',
    'blended',
    'online',
    'i am from',
    'hey ya',
    'talking to you for first time'],
   'responses': ['Hello! how can i help you ?'],
   'tag': 'Intro'},
  {'context_set': '',
   'patterns': ['thank you',
    'thanks',
    'cya',
    'see you',
    'later',
    'see you later',
    'goodbye',
    'i am leaving',
    'have a Good day',
    'you helped me',
    'thanks a lot',
    'thanks a ton',
    'you are the best',
    'great help',
    'too good',
    'you are a good learning buddy'],
   'responses': ['I hope I was able to assist you, Good Bye'],
   'tag': 'Exit'},
  {'context_set': '',
   'patterns': ['olympus',
    'explain me how olympus works',
    'I am not able to understa

In [66]:
intents['intents'][0].keys() #From above and this line we can see we have common inner keys

dict_keys(['tag', 'patterns', 'responses', 'context_set'])

In [67]:
intents['intents'][0].values() #Since it is nested dictionary we can see these are there are values under each sub keys

dict_values(['Intro', ['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time'], ['Hello! how can i help you ?'], ''])

In [68]:
print(intents['intents'][0]['tag'])
print(intents['intents'][0]['patterns'])
print(intents['intents'][0]['responses'])

Intro
['hi', 'how are you', 'is anyone there', 'hello', 'whats up', 'hey', 'yo', 'listen', 'please help me', 'i am learner from', 'i belong to', 'aiml batch', 'aifl batch', 'i am from', 'my pm is', 'blended', 'online', 'i am from', 'hey ya', 'talking to you for first time']
['Hello! how can i help you ?']


These kind of key value pair are present in this corpus for our chatbot.
Now we can add little more data to the corpus using our linguistics.

In [69]:
intro_tag = ['greetings','howdy','welcome','good morning','hi ya','how goes it','howdy do','whats happening']  #list of values adding to the intro tag
intents['intents'][0]['patterns'] += intro_tag #Appending the list
intents['intents'][0]['patterns']

['hi',
 'how are you',
 'is anyone there',
 'hello',
 'whats up',
 'hey',
 'yo',
 'listen',
 'please help me',
 'i am learner from',
 'i belong to',
 'aiml batch',
 'aifl batch',
 'i am from',
 'my pm is',
 'blended',
 'online',
 'i am from',
 'hey ya',
 'talking to you for first time',
 'greetings',
 'howdy',
 'welcome',
 'good morning',
 'hi ya',
 'how goes it',
 'howdy do',
 'whats happening']

In [70]:
exit_tag = ['bye','quit','exit','end','stop','take care','good day','so long','godspeed','farewell','ta ta','pause','mute','finish','cease','halt','terminate','wind up']
intents['intents'][1]['patterns'] += exit_tag #Appending the list
intents['intents'][1]['patterns']

['thank you',
 'thanks',
 'cya',
 'see you',
 'later',
 'see you later',
 'goodbye',
 'i am leaving',
 'have a Good day',
 'you helped me',
 'thanks a lot',
 'thanks a ton',
 'you are the best',
 'great help',
 'too good',
 'you are a good learning buddy',
 'bye',
 'quit',
 'exit',
 'end',
 'stop',
 'take care',
 'good day',
 'so long',
 'godspeed',
 'farewell',
 'ta ta',
 'pause',
 'mute',
 'finish',
 'cease',
 'halt',
 'terminate',
 'wind up']

In [71]:
ignore_punctuation = ["?", "!", ".", ","] #Ignoring unwanted char
lemmatizer = nltk.stem.WordNetLemmatizer()

In [72]:
def process_words(pattern):
    # return variable
    words = []
    # get the tokens using nltk
    tokens = nltk.word_tokenize(pattern)
    for word in tokens:
        # check if the word should be ignored
        if word not in ignore_punctuation and word.isalnum():
            # clean the word and add it to the list
            cleaned_word = lemmatizer.lemmatize(word.lower())
            words.append(cleaned_word)
    # return the list
    return words

In [73]:
def parse_intents(intents):
    # declare our needed variables
    tags = []
    all_words = []
    tag_tokens = []
    response_dict = dict()
    
    # iterate through each intent
    for intent in intents["intents"]:
        
        # add the noanswer tag to the dictionary (edge case)
        if (intent["tag"] == "noanswer"):
            response_dict["noanswer"] = intent["responses"]
        
        # if the intent has no patterns, we can skip
        if (len(intent["patterns"]) == 0):
            continue
        
        # add the tag to the list of tag
        tag = intent["tag"]
        tags.append(tag)
        
        # update the dictionary
        response_dict[tag] = intent["responses"]
        
        # iterate through each pattern
        for pattern in intent["patterns"]:
            # create our tokenized words
            tokenized_words = process_words(pattern)
            # add all the tokenized words to our words
            all_words.extend(tokenized_words)
            # adds a tuple -> (list of tokens, tag) -> to the list
            tag_tokens.append((tokenized_words, tag))      
    
    # return our values in a tuple
    return (np.array(tags), np.array(all_words), np.array(tag_tokens), response_dict)

In [74]:
# call our function
tags, all_words, tag_tokens, tag_responses = parse_intents(intents)
# sort and remove duplicates
tags = np.array(sorted(list(set(tags))))
all_words = np.array(sorted(list(set(all_words))))

In [75]:
print("Tags: {0}".format(tags))
print("------")
print("All Words: {0}".format(all_words))
print("------")
print("Tag-Token Mappings: {0}".format(tag_tokens))

Tags: ['Bot' 'Exit' 'Intro' 'NN' 'Olympus' 'Profane' 'SL' 'Ticket']
------
All Words: ['a' 'able' 'access' 'activation' 'ada' 'adam' 'aifl' 'aiml' 'am' 'an'
 'ann' 'anyone' 'are' 'artificial' 'backward' 'bad' 'bagging' 'batch'
 'bayes' 'belong' 'best' 'blended' 'bloody' 'boosting' 'bot' 'buddy' 'bye'
 'care' 'cease' 'classification' 'contact' 'create' 'cross' 'cya' 'day'
 'deep' 'did' 'diffult' 'do' 'end' 'ensemble' 'epoch' 'exit' 'explain'
 'farewell' 'finish' 'first' 'for' 'forest' 'forward' 'from' 'function'
 'go' 'godspeed' 'good' 'goodbye' 'gradient' 'great' 'greeting' 'halt'
 'happening' 'hate' 'have' 'hell' 'hello' 'help' 'helped' 'hey' 'hi'
 'hidden' 'hour' 'how' 'howdy' 'hyper' 'i' 'imputer' 'in' 'intelligence'
 'is' 'it' 'jerk' 'joke' 'knn' 'later' 'layer' 'learner' 'learning'
 'leaving' 'link' 'listen' 'logistic' 'long' 'lot' 'machine' 'me' 'ml'
 'morning' 'mute' 'my' 'naive' 'name' 'nb' 'net' 'network' 'neural' 'no'
 'not' 'of' 'olympus' 'olypus' 'on' 'online' 'operation' '

In [76]:
def build_bag(all_words, tokens): #BAg of Words
    # reset our current bag
    bag = []
    for word in all_words:
        # add 0/1 if the word is in our token
        in_token = (word in tokens)
        bag.append(1 * in_token)
    return bag

In [77]:
def build_training_set(tags, all_words, tag_tokens):
    # define our variables to return
    train_x = []
    train_y = []
        
    # iterate through each tag-token mapping
    for tag_token in tag_tokens:
        
        # grab our needed values
        tokens = tag_token[0]
        tag = tag_token[1]
        
        # reset our current bag
        current_bag = build_bag(all_words, tokens)
            
        # update our training inputs
        train_x.append(current_bag)
        
        # set our outputs equal to 1 in the location
        train_y.append(1 * (tags == tag))
    
    # return our values
    return (np.array(train_x), np.array(train_y))

In [78]:
train_x, train_y = build_training_set(tags, all_words, tag_tokens)

In [79]:
print(train_x.shape)
print(train_y.shape)
print("Training Inputs: {0}".format(train_x))
print("-----")
print("Training Outputs: {0}".format(train_y))

(154, 183)
(154, 8)
Training Inputs: [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]]
-----
Training Outputs: [[0 0 1 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 1]]


In [80]:
# shuffled indexes, shuffiling to solve some bias
shuffled_indexes = np.random.permutation(train_x.shape[0])
# set new values for train_x and train_y
train_x = train_x[shuffled_indexes]
train_y = train_y[shuffled_indexes]

In [81]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import SGD

In [82]:
# declare our model
model = Sequential()
# add our layers
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

In [83]:
sgd = SGD(lr = 0.01, decay = 1e-6, momentum = 0.9, nesterov = True)
model.compile(loss = 'categorical_crossentropy', optimizer = sgd, metrics = ['accuracy'])

In [84]:
hist = model.fit(train_x, train_y, epochs = 500, batch_size = 5, verbose = 1)


Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

In [85]:
def predict_tag(user_input, model):
    # tokenize/clean inputs
    process_input = process_words(user_input)
    
    # build the bag
    bag_input = build_bag(all_words, process_input)
    bag_input = np.array([bag_input]) # note: convert to a numpy array
    
    # get our predicted values
    pred_tag_values = model.predict(bag_input)
    pred_tag_values = pred_tag_values[0] # note: flatten the 2-d array
    
    # get the index and value of the largest probability value
    max_value_tag = np.argmax(pred_tag_values)
    probability = np.max(pred_tag_values)
    
    # predict the tag and return
    pred_tag = tags[max_value_tag]
    return (pred_tag, probability)

In [86]:
#Checking the functionality
# look at the probability for the bot's confidence level
custom_input = "How are you today?"
predict_tag(custom_input, model)

('Intro', 0.99999547)

In [87]:
def get_response(user_input, model, error_margin):
    # get the predicted tag and probability
    pred_tag, probability = predict_tag(user_input, model)
    #print(pred_tag)
    # get a list of different responses
    responses = tag_responses[pred_tag] if probability > error_margin else tag_responses["noanswer"]
    #print(responses)
    # check if we should exit the bot
    should_exit_bot = (pred_tag == "Exit")
    
    # get the response
    response = random.choice(responses)
    
    # return the variables
    return (response, should_exit_bot)

In [88]:
# Checking the functionality,
custom_input = "Thank You"
get_response(custom_input, model, 0.25)

('I hope I was able to assist you, Good Bye', True)

In [89]:
def chat():
    # initialize variables
    continue_chat = True
    robot_prefix = "Bot: "
    human_prefix = "You: "
    
    # give an introduction
    print(robot_prefix + "Hi! I am your Virtual Assistant for Great Learning. What are you looking for?")
    print("")
    
    # continue while the user doesn't say goodbye
    while (continue_chat):
        # get the user input from the console
        user_input = input(human_prefix)
        
        # get the response and exit condition from the helper function
        response, should_exit = get_response(user_input, model, 0.75)
        
        # print the bot's response
        print(robot_prefix + response)
        print("")
        
        # set the exit condition
        continue_chat = not should_exit

In [90]:
chat()


Bot: Hi! I am your Virtual Assistant for Great Learning. What are you looking for?

You: how are you
Bot: Hello! how can i help you ?

You: olympus
Bot: Link: Olympus wiki

You: online
Bot: Hello! how can i help you ?

You: ML
Bot: Link: Machine Learning wiki 

You: softmax
Bot: Link: Neural Nets wiki

You: what is your name
Bot: I am your virtual learning assistant

You: you are a joke
Bot: Please use respectful words

You: no help
Bot: Tarnsferring the request to your PM

You: quit
Bot: I hope I was able to assist you, Good Bye

