# • DOMAIN: Digital content management 

# • CONTEXT: Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc. are written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

# • DATA DESCRIPTION: Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:

# • 8240 "10s" blogs (ages 13-17),
# • 8086 "20s" blogs(ages 23-27) and
# • 2994 "30s" blogs (ages 33-47)

# • For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label url link.

# • PROJECT OBJECTIVE: To build a NLP classifier which can use input text parameters to determine the label/s of the blog. Specific to this case study, you can consider the text of the blog: ‘text’ feature as independent variable and ‘topic’ as dependent variable.

# Steps and tasks:

# 1. Read and Analyse Dataset.

# A. Clearly write outcome of data analysis(Minimum 2 points)

In [1]:
# Import necessary libraries to be used in the project

import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.stem.snowball import SnowballStemmer
import warnings
warnings.filterwarnings('ignore')
import pandas_profiling as pp
import seaborn as sns
import matplotlib as plt
%matplotlib inline
from langdetect import detect, DetectorFactory
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report

DetectorFactory.seed = 0

In [2]:
from zipfile import ZipFile

file_name = "blogs.zip"

with ZipFile(file_name, 'r') as zip:
    
    zip.printdir()
    
    print('Extracting all the files now...')
    
    zip.extractall()
    
    print('Done!')

File Name                                             Modified             Size
blogtext.csv                                   2019-09-20 22:33:20    800419647
Extracting all the files now...
Done!


In [3]:
BLOGS_DF = pd.read_csv("blogtext.csv")

In [4]:
BLOGS_DF.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [5]:
BLOGS_DF.tail()

Unnamed: 0,id,gender,age,topic,sign,date,text
681279,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan, I could write some really ..."
681280,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan, 'I have the second yeast i..."
681281,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan, Your 'boyfriend' is fuckin..."
681282,1713845,male,23,Student,Taurus,"01,July,2004","Dear Susan: Just to clarify, I am as..."
681283,1713845,male,23,Student,Taurus,"01,July,2004","Hey everybody...and Susan, You might a..."


In [6]:
BLOGS_DF.count()

id        681284
gender    681284
age       681284
topic     681284
sign      681284
date      681284
text      681284
dtype: int64

In [7]:
BLOGS_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 681284 entries, 0 to 681283
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      681284 non-null  int64 
 1   gender  681284 non-null  object
 2   age     681284 non-null  int64 
 3   topic   681284 non-null  object
 4   sign    681284 non-null  object
 5   date    681284 non-null  object
 6   text    681284 non-null  object
dtypes: int64(2), object(5)
memory usage: 36.4+ MB


In [8]:
print(BLOGS_DF.describe())

print('\n\n',BLOGS_DF.shape)

                 id            age
count  6.812840e+05  681284.000000
mean   2.397802e+06      23.932326
std    1.247723e+06       7.786009
min    5.114000e+03      13.000000
25%    1.239610e+06      17.000000
50%    2.607577e+06      24.000000
75%    3.525660e+06      26.000000
max    4.337650e+06      48.000000


 (681284, 7)


In [9]:
print(BLOGS_DF['topic'].unique())

print('\n\n',BLOGS_DF['age'].unique())

print('\n\n',BLOGS_DF['gender'].value_counts())

['Student' 'InvestmentBanking' 'indUnk' 'Non-Profit' 'Banking' 'Education'
 'Engineering' 'Science' 'Communications-Media' 'BusinessServices'
 'Sports-Recreation' 'Arts' 'Internet' 'Museums-Libraries' 'Accounting'
 'Technology' 'Law' 'Consulting' 'Automotive' 'Religion' 'Fashion'
 'Publishing' 'Marketing' 'LawEnforcement-Security' 'HumanResources'
 'Telecommunications' 'Military' 'Government' 'Transportation'
 'Architecture' 'Advertising' 'Agriculture' 'Biotech' 'RealEstate'
 'Manufacturing' 'Construction' 'Chemicals' 'Maritime' 'Tourism'
 'Environment']


 [15 33 14 25 17 23 37 26 24 27 45 34 41 44 16 39 35 36 46 42 13 38 43 40
 47 48]


 male      345193
female    336091
Name: gender, dtype: int64


    * BLOGS_DF CONTAINS 681284 ROWS AND 7 COLUMNS.
    * ID AND AGE ARE TWO COLUMNS OF INTEGER DATA TYPE
    * ALL OTHER COLUMNS ARE OF OBJECT DATA TYPE
    * THERE ARE NO MISSED DATA FROM THE DATA FRAME.
    * MINIMUM AGE IS 13 AND MAXIMUM AGE IS 48.
    * BLOGS CONTAIN 40 UNIQUE TOPICS
    * THERE ARE 345193 MALE BLOGGERS AND 336091 FEMALE BLOGGERS

# B. Clean the Structured Data


# B.i. Missing value analysis and imputation.

In [10]:
BLOGS_DF.isna().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [11]:
BLOGS_DF.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

    * THERE ARE NO NULL / MISSING VALUES IN THE BLOGS_DF

# B.ii. Eliminate Non-English textual data.

In [12]:
LANG = []

for i in BLOGS_DF['text']:
    STR = i
    try:
        lang = detect(STR)
        LANG.append(lang)
    except:
        continue;
    

In [15]:
np.unique(LANG)

array(['af', 'ar', 'bg', 'bn', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'en',
       'es', 'et', 'fa', 'fi', 'fr', 'he', 'hi', 'hr', 'hu', 'id', 'it',
       'ja', 'ko', 'lt', 'lv', 'mk', 'nl', 'no', 'pl', 'pt', 'ro', 'ru',
       'sk', 'sl', 'so', 'sq', 'sv', 'sw', 'ta', 'th', 'tl', 'tr', 'uk',
       'ur', 'vi', 'zh-cn', 'zh-tw'], dtype='<U5')

In [43]:
def DETECT_ENGLISH(TEXT):
    try:
        return detect(TEXT) == 'en'
    except:
        return False

In [44]:
NEW_DF = BLOGS_DF[BLOGS_DF['text'].apply(DETECT_ENGLISH)]

In [46]:
NEW_DF.shape

DF_COPY = NEW_DF.copy()

(651500, 7)

    * WE HAVE ELIMINATED 29784 ROWS WHICH CONTAINS TEXT DATA OF OTHER LANGUAGE.
    * SHAPE OF NEW DATA FRAME IS 651500 ROWS AND 7 COLUMNS.

# 2. Preprocess unstructured data to make it consumable for model training.

# A. Eliminate All special Characters and Numbers
# B. Lowercase all textual data
# C. Remove all Stopwords
# D. Remove all extra white spaces

In [104]:
# Eliminate All special Characters and Numbers
import re
NEW_DF.text = NEW_DF['text'].replace(r'[^A-Za-z ]+', '', regex=True)

NEW_DF.text[0]

'           Info has been found   pages and  MB of pdf files Now i have to wait untill our team leader has processed it and learns html         '

In [105]:
# Lowercase all textual data
NEW_DF['text'] = NEW_DF['text'].str.lower()

NEW_DF.text[0]

'           info has been found   pages and  mb of pdf files now i have to wait untill our team leader has processed it and learns html         '

In [106]:
# Remove all Stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stopwords = set(stopwords.words('english'))
NEW_DF.text = NEW_DF.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

NEW_DF.text[0]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\thril\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'info found pages mb pdf files wait untill team leader processed learns html'

    * WE HAVE REMOVED UNWANTED CHARS AND NUMBERS FROM THE DATA SET
    * WE HAVE REMOVED ADDITIONAL SPACES FROM THE DATA SET
    * WE HAVE CONVERTED THE TEXT DATA INTO LOWER CASE.

# 3. Build a base Classification model
# A. Create dependent and independent variables

In [107]:
NEW_DF.dtypes

id         int64
gender    object
age        int64
topic     object
sign      object
date      object
text      object
dtype: object

In [108]:
NEW_DF = NEW_DF.drop(columns = ['id','date'])

In [109]:
NEW_DF['age'] = NEW_DF['age'].astype(str)

In [110]:
NEW_DF.dtypes

gender    object
age       object
topic     object
sign      object
text      object
dtype: object

In [111]:
NEW_DF['labels'] = NEW_DF['gender']+','+ NEW_DF['age'] +','+ NEW_DF['topic'] +','+NEW_DF['sign']

In [112]:
NEW_DF.drop(columns = ['gender','age','sign','topic'], axis = 1, inplace = True)

In [113]:
NEW_DF.reset_index(drop=True, inplace=True)

#NEW_DF.to_csv('BACKUP.csv', index = False)

# I AM TAKING ONLY 5000 RECORDS FOR TRAINING THE MACHINE DUE TO CPU CRASHES.

DATA = NEW_DF[0:5000]

print(DATA.shape,'\n\n',DATA.head())

(5000, 2) 

                                                 text  \
0  info found pages mb pdf files wait untill team...   
1  het kader van kernfusie op aarde maak je eigen...   
2                                    testing testing   
3  thanks yahoos toolbar capture urls popupswhich...   
4  interesting conversation dad morning talking k...   

                               labels  
0                 male,15,Student,Leo  
1                 male,15,Student,Leo  
2                 male,15,Student,Leo  
3  male,33,InvestmentBanking,Aquarius  
4  male,33,InvestmentBanking,Aquarius  


In [114]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

LEMMATIZER = WordNetLemmatizer()

def LEMMATIZER_FUN(text):
    text = [LEMMATIZER.lemmatize(word)for word in text.split()]
    return " ".join(text)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\thril\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\thril\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [115]:
DATA['text'] = DATA['text'].apply(LEMMATIZER_FUN)

DATA.head()

Unnamed: 0,text,labels
0,info found page mb pdf file wait untill team l...,"male,15,Student,Leo"
1,het kader van kernfusie op aarde maak je eigen...,"male,15,Student,Leo"
2,testing testing,"male,15,Student,Leo"
3,thanks yahoo toolbar capture url popupswhich m...,"male,33,InvestmentBanking,Aquarius"
4,interesting conversation dad morning talking k...,"male,33,InvestmentBanking,Aquarius"


In [116]:
from nltk.stem.snowball import SnowballStemmer

STEMMER = SnowballStemmer("english")

def STEMMER_FUN(text):    
    '''a function which stems each word in the given text'''
    text = [STEMMER.stem(word) for word in text.split()]
    return " ".join(text)

In [3]:
DATA['text'] = DATA['text'].apply(STEMMER_FUN)

DATA.head()

#DATA.to_csv('BACKUP2.csv', index = False)

#DATA = pd.read_csv('BACKUP2.csv')

Unnamed: 0,text,labels
0,info found page mb pdf file wait until team le...,"male,15,Student,Leo"
1,het kader van kernfusi op aard maak je eigen w...,"male,15,Student,Leo"
2,test test,"male,15,Student,Leo"
3,thank yahoo toolbar captur url popupswhich mea...,"male,33,InvestmentBanking,Aquarius"
4,interest convers dad morn talk korean put mone...,"male,33,InvestmentBanking,Aquarius"


In [4]:
X = DATA['text'] # INDEPENDENT VARIABLE
Y = DATA['labels'] # DEPENDENT VARIABLE

In [5]:
X.head()

0    info found page mb pdf file wait until team le...
1    het kader van kernfusi op aard maak je eigen w...
2                                            test test
3    thank yahoo toolbar captur url popupswhich mea...
4    interest convers dad morn talk korean put mone...
Name: text, dtype: object

In [6]:
Y[:5]

0                   male,15,Student,Leo
1                   male,15,Student,Leo
2                   male,15,Student,Leo
3    male,33,InvestmentBanking,Aquarius
4    male,33,InvestmentBanking,Aquarius
Name: labels, dtype: object

    * WE HAVE DROPPED ID AND DATE COLUMNS FROM THE DATA FRAME AS THERE IS NO SIGNIFICANT EFFECT
    * WE HAVE MERGED GENDER, AGE AND SIGN TO FORM A NEW COLUMN LABEL
    * AFTER CREATION OF NEW COLUMN, WE HAVE DROPPED GENDER, AGE AND SIGN COLUMNS
    * WE HAVE CREATED INDEPENDENT VARIABLE X AND ASSIGNED TEXT AND LABELS TO IT
    * WE HAVE CREATED TARGET VARIABLE X AND ASSIGNED TOPIC AS DEPENDENT VARIABLE.
    
    
    *** I AM CONSIDERING ONLY 5000 RECORDS FOR TRAINING MY MODEL DUE TO CPU CRASHES CONSTANTLY USING HIGH VOLUME
        OF DATA SET AND TO AVOID FURTHER CPU CRASHES

# B. Split data into train and test

In [7]:
from sklearn.model_selection import train_test_split

X_TRAIN, X_TEST, Y_TRAIN, Y_TEST = train_test_split(X, Y, test_size = 0.25, random_state = 25)

    * WE HAVE SPLIT THE DATA INTO TRAIN AND TEST IN RATIO 75:25
    
    * X_TRAIN AND Y_TRAIN ARE THE TEST DATA AND X_TEST AND Y_TEST ARE THE TEST DATA.

# C. Vectorize data using any one vectorizer.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

VECTORIZER = CountVectorizer(binary=True, ngram_range=(1,2))

In [9]:
X_TRAIN_V = VECTORIZER.fit_transform(X_TRAIN)

In [10]:
X_TRAIN_A = X_TRAIN_V.toarray()

In [11]:
X_TRAIN_A.shape

(3750, 244807)

In [12]:
X_TEST_V = VECTORIZER.transform(X_TEST)

X_TEST_A = X_TEST_V.toarray()

X_TEST_A.shape

(1250, 244807)

In [13]:
nltk.download('punkt')

MAX = 100

LABELS = DATA['labels'].str.cat(sep=',')

LABELS = LABELS. replace(',',' ')

WORDS = nltk.tokenize.word_tokenize(LABELS)

word_dist = nltk.FreqDist(WORDS)

print (word_dist)

RESULTS = pd.DataFrame(word_dist.most_common(MAX),columns=['WORDS', 'FREQUENCY'])

RESULTS.head(5)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\thril\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<FreqDist with 54 samples and 20000 outcomes>


Unnamed: 0,WORDS,FREQUENCY
0,male,3066
1,Aries,2286
2,Technology,2136
3,35,2111
4,female,1934


In [14]:
RESULTS_TO_DICT = pd.Series(RESULTS.FREQUENCY.values,index= RESULTS.WORDS).to_dict()

RESULTS_TO_DICT

{'male': 3066,
 'Aries': 2286,
 'Technology': 2136,
 '35': 2111,
 'female': 1934,
 'indUnk': 1618,
 'Sagittarius': 682,
 'Scorpio': 676,
 '17': 599,
 'Student': 546,
 '34': 528,
 'Libra': 409,
 '24': 337,
 '15': 325,
 'Aquarius': 317,
 '25': 262,
 'Leo': 183,
 '14': 168,
 '23': 128,
 'Engineering': 119,
 'Education': 116,
 '33': 101,
 'Taurus': 94,
 '26': 88,
 '27': 86,
 'Capricorn': 84,
 'Gemini': 82,
 'Cancer': 81,
 '39': 77,
 'BusinessServices': 75,
 'Sports-Recreation': 75,
 'InvestmentBanking': 70,
 'Pisces': 67,
 '16': 65,
 'Communications-Media': 61,
 '36': 60,
 'Non-Profit': 46,
 'Virgo': 39,
 'Science': 31,
 'Arts': 31,
 'Internet': 20,
 '37': 19,
 'Banking': 16,
 'Consulting': 15,
 '45': 14,
 '41': 14,
 'Automotive': 14,
 '42': 9,
 '46': 6,
 'Religion': 4,
 '44': 3,
 'Law': 3,
 'Museums-Libraries': 2,
 'Accounting': 2}

In [15]:
# Binarize the labels using multilabel binarizer

from sklearn.preprocessing import MultiLabelBinarizer

BINARIZER = MultiLabelBinarizer(classes=sorted(RESULTS_TO_DICT.keys()))

In [16]:
D_LABELS = pd.DataFrame(data = Y_TRAIN)

#D_LABELS.head()

Y_TRAIN_B = pd.DataFrame(BINARIZER.fit_transform(D_LABELS.labels.str.split(',')),
                         columns=BINARIZER.classes_,index=D_LABELS.labels).reset_index()

Y_TRAIN_B.head()

Unnamed: 0,labels,14,15,16,17,23,24,25,26,27,...,Science,Scorpio,Sports-Recreation,Student,Taurus,Technology,Virgo,female,indUnk,male
0,"female,17,Student,Aries",0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1,"male,25,Non-Profit,Cancer",0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,"female,34,indUnk,Virgo",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,1,0
3,"male,35,Technology,Aries",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
4,"female,24,indUnk,Scorpio",0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,1,1,0


In [17]:
Y_TRAIN_B.drop(['labels'],inplace = True,axis = 1)

Y_TRAIN_B = Y_TRAIN_B.to_numpy()

Y_TRAIN_B

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0]])

In [18]:
Y_TRAIN_B.shape

(3750, 54)

In [19]:
D_TEST_LABELS = pd.DataFrame(data = Y_TEST)

#D_TEST_LABELS.head()

Y_TEST_B = pd.DataFrame(BINARIZER.fit_transform(D_TEST_LABELS.labels.str.split(',')),
                         columns=BINARIZER.classes_,index=D_TEST_LABELS.labels).reset_index()

Y_TEST_B.head()

Unnamed: 0,labels,14,15,16,17,23,24,25,26,27,...,Science,Scorpio,Sports-Recreation,Student,Taurus,Technology,Virgo,female,indUnk,male
0,"male,24,Engineering,Libra",0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,"female,17,indUnk,Scorpio",0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0
2,"male,35,Technology,Aries",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
3,"male,35,Technology,Aries",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
4,"male,35,Technology,Aries",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1


In [20]:
Y_TEST_B.drop(['labels'],inplace = True,axis = 1)

Y_TEST_B = Y_TEST_B.to_numpy()

Y_TEST_B

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 1, ..., 0, 1, 1],
       [0, 0, 0, ..., 1, 1, 0]])

In [21]:
Y_TEST_B.shape

(1250, 54)

# D. Build a base model for Supervised Learning - Classification.

In [17]:
BASE_MODEL = LogisticRegression(solver='lbfgs')
BASE_MODEL = OneVsRestClassifier(BASE_MODEL)

In [43]:
BASE_MODEL.fit(X_TRAIN_A, Y_TRAIN_B)

In [44]:
PREDICTIONS = BASE_MODEL.predict(X_TEST_A)

In [22]:
Y_TEST_B.shape

(1250, 54)

In [46]:
PREDICTIONS.shape

(1250, 54)

In [47]:
PREDICTIONS

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 1, 0]])

# E. Clearly print Performance Metrics.

In [25]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

In [77]:
print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS))
print('\nPRECISION: ',precision_score(Y_TEST_B,PREDICTIONS,average='weighted'))
print('\nRECALL: ',recall_score(Y_TEST_B,PREDICTIONS,average='weighted'))
print('\nF1_SCORE: ',f1_score(Y_TEST_B,PREDICTIONS,average='weighted'))

ACCURACY:  0.4696

PRECISION:  0.7928180951436132

RECALL:  0.6198

F1_SCORE:  0.6421249229967048


In [65]:
print('CLASSIFICATION REPORT:\n\n',classification_report(Y_TEST_B,PREDICTIONS))

CLASSIFICATION REPORT:

               precision    recall  f1-score   support

           0       1.00      0.13      0.23        39
           1       0.71      0.17      0.28        87
           2       0.00      0.00      0.00        10
           3       0.75      0.34      0.46       161
           4       0.00      0.00      0.00        32
           5       0.85      0.20      0.32        86
           6       1.00      0.03      0.06        65
           7       0.00      0.00      0.00        20
           8       1.00      0.05      0.09        22
           9       1.00      0.18      0.30        34
          10       0.99      0.78      0.88       130
          11       0.79      0.89      0.84       499
          12       1.00      0.05      0.10        20
          13       0.00      0.00      0.00         5
          14       0.00      0.00      0.00        26
          15       0.00      0.00      0.00         3
          16       0.00      0.00      0.00         2
  

# 4. Improve Performance of model. 
# A. Experiment with other vectorisers.

In [22]:
# USING TFIDF VECTORIZER AND TRAIN THE BASE MODEL.

from sklearn.feature_extraction.text import TfidfVectorizer

TFID_VEC = TfidfVectorizer(stop_words= 'english')

TFID_VEC.fit(X_TRAIN)

X_TRAIN_T = TFID_VEC.transform(X_TRAIN)

X_TEST_T = TFID_VEC.transform(X_TEST)

In [23]:
print('TRAIN DATA SHAPE: ', X_TRAIN_T.shape)

print('\n\nTEST DATA SHAPE: ', X_TEST_T.shape)

TRAIN DATA SHAPE:  (3750, 24613)


TEST DATA SHAPE:  (1250, 24613)


In [24]:
X_TRAIN_T

<3750x24613 sparse matrix of type '<class 'numpy.float64'>'
	with 195037 stored elements in Compressed Sparse Row format>

In [None]:
BASE_MODEL.fit(X_TRAIN_T, Y_TRAIN_B)

In [73]:
PREDICTIONS_TFID = BASE_MODEL.predict(X_TEST_T)

In [74]:
print('ACCURACY_TFID: ',accuracy_score(Y_TEST_B,PREDICTIONS_TFID))

ACCURACY_TFID:  0.3616


In [75]:
print('CLASSIFICATION REPORT:\n\n',classification_report(Y_TEST_B,PREDICTIONS))

CLASSIFICATION REPORT:

               precision    recall  f1-score   support

           0       1.00      0.13      0.23        39
           1       0.71      0.17      0.28        87
           2       0.00      0.00      0.00        10
           3       0.75      0.34      0.46       161
           4       0.00      0.00      0.00        32
           5       0.85      0.20      0.32        86
           6       1.00      0.03      0.06        65
           7       0.00      0.00      0.00        20
           8       1.00      0.05      0.09        22
           9       1.00      0.18      0.30        34
          10       0.99      0.78      0.88       130
          11       0.79      0.89      0.84       499
          12       1.00      0.05      0.10        20
          13       0.00      0.00      0.00         5
          14       0.00      0.00      0.00        26
          15       0.00      0.00      0.00         3
          16       0.00      0.00      0.00         2
  

In [78]:
print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_TFID))
print('\nPRECISION: ',precision_score(Y_TEST_B,PREDICTIONS_TFID,average='weighted'))
print('\nRECALL: ',recall_score(Y_TEST_B,PREDICTIONS_TFID,average='weighted'))
print('\nF1_SCORE: ',f1_score(Y_TEST_B,PREDICTIONS_TFID,average='weighted'))

ACCURACY:  0.3616

PRECISION:  0.7766332943479283

RECALL:  0.522

F1_SCORE:  0.5524569754617406


In [79]:
# USE WORD2VEC VECTORIZER

TFID_TRAIN = TFID_VEC.fit_transform(X_TRAIN)

TFID_TRAIN.shape

(3750, 24613)

In [80]:
TFID_TRAIN_CORPUS = pd.DataFrame(TFID_TRAIN.toarray(), columns = TFID_VEC.get_feature_names(), index = X_TRAIN.index)

In [81]:
TFID_TRAIN_CORPUS.shape

(3750, 24613)

In [82]:
TFID_TRAIN_CORPUS.mean().sort_values(ascending = False).head(10)

urllink    0.029574
im         0.027815
like       0.024794
know       0.019543
dont       0.019042
think      0.018049
time       0.016546
realli     0.016169
good       0.015711
day        0.015660
dtype: float64

In [83]:
TOP50_TOKENS_TRAIN = (TFID_TRAIN_CORPUS.mean().sort_values(ascending = False).head(50).index)

TOP50_TOKENS_TRAIN

Index(['urllink', 'im', 'like', 'know', 'dont', 'think', 'time', 'realli',
       'good', 'day', 'thing', 'want', 'peopl', 'love', 'make', 'work', 'say',
       'feel', 'got', 'look', 'today', 'ive', 'new', 'littl', 'need', 'come',
       'right', 'life', 'someth', 'night', 'way', 'post', 'blog', 'year',
       'diva', 'read', 'friend', 'ill', 'thought', 'tri', 'start', 'week',
       'mean', 'went', 'thank', 'oh', 'talk', 'said', 'didnt', 'hope'],
      dtype='object')

# B. Build classifier Models using other algorithms than base model

In [26]:
RFC = RandomForestClassifier()

RFC = OneVsRestClassifier(RFC)

RFC.fit(X_TRAIN_A, Y_TRAIN_B)

In [28]:
PREDICTIONS_RFC = RFC.predict(X_TEST_A)

In [29]:
print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_RFC))
print('\nPRECISION: ',precision_score(Y_TEST_B,PREDICTIONS_RFC,average='weighted'))
print('\nRECALL: ',recall_score(Y_TEST_B,PREDICTIONS_RFC,average='weighted'))
print('\nF1_SCORE: ',f1_score(Y_TEST_B,PREDICTIONS_RFC,average='weighted'))

ACCURACY:  0.3576

PRECISION:  0.5796927544458358

RECALL:  0.4854

F1_SCORE:  0.4540750413295921


In [27]:
#RFC = OneVsRestClassifier(RFC)

RFC.fit(X_TRAIN_T, Y_TRAIN_B)

In [28]:
PREDICTIONS_RFCT = RFC.predict(X_TEST_T)

In [29]:
print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_RFCT))
print('\nPRECISION: ',precision_score(Y_TEST_B,PREDICTIONS_RFCT,average='weighted'))
print('\nRECALL: ',recall_score(Y_TEST_B,PREDICTIONS_RFCT,average='weighted'))
print('\nF1_SCORE: ',f1_score(Y_TEST_B,PREDICTIONS_RFCT,average='weighted'))

ACCURACY:  0.3512

PRECISION:  0.7775546298296782

RECALL:  0.5168

F1_SCORE:  0.549017216136401


In [57]:
from sklearn.ensemble import AdaBoostClassifier

ABC = AdaBoostClassifier()

ABC = OneVsRestClassifier(ABC)

In [64]:
ABC.fit(X_TRAIN_T, Y_TRAIN_B)

In [65]:
PREDICTIONS_RFCT = ABC.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_RFCT))
print('\nPRECISION: ',precision_score(Y_TEST_B,PREDICTIONS_RFCT,average='weighted'))
print('\nRECALL: ',recall_score(Y_TEST_B,PREDICTIONS_RFCT,average='weighted'))
print('\nF1_SCORE: ',f1_score(Y_TEST_B,PREDICTIONS_RFCT,average='weighted'))

ACCURACY:  0.3888

PRECISION:  0.7087703201541039

RECALL:  0.6242

F1_SCORE:  0.6473237444826353


# C. Tune Parameters/Hyperparameters of the model/s.

In [33]:
RFC_TUNED = RandomForestClassifier(n_estimators= 300, max_features= 'sqrt', 
                                   max_depth= 6, criterion= 'entropy', bootstrap= True)

In [34]:
BASE_MODEL_TUNED = LogisticRegression(solver= 'newton-cg', penalty= 'none', C= 0.5)

BASE_MODEL_TUNED = OneVsRestClassifier(BASE_MODEL_TUNED)

In [None]:
BASE_MODEL_TUNED.fit(X_TRAIN_A, Y_TRAIN_B)

In [35]:
RFC_TUNED = OneVsRestClassifier(RFC_TUNED)

RFC_TUNED.fit(X_TRAIN_A, Y_TRAIN_B)

In [36]:
RFC_OVS = OneVsRestClassifier(RFC_TUNED)

RFC_OVS.fit(X_TRAIN_T, Y_TRAIN_B)

In [37]:
PREDICTIONS_TUNED = RFC_OVS.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_TUNED))

ACCURACY:  0.2432


In [38]:
#Bag of Words vectorizer is givning very less accuracy. Let us try with TFID vectorizer

RFC_TUNED.fit(X_TRAIN_T, Y_TRAIN_B)

In [39]:
PREDICTIONS_TUNED = RFC_TUNED.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_TUNED))

ACCURACY:  0.2536


In [40]:
#Try with gini entropy. I am using only TFID vectorizer data to train the model

RFC_TUNED = RandomForestClassifier(n_estimators= 300, max_features= 'sqrt', 
                                   max_depth= 6, criterion= 'gini', bootstrap= True)

RFC_OVS = OneVsRestClassifier(RFC_TUNED)

RFC_OVS.fit(X_TRAIN_T, Y_TRAIN_B)

In [41]:
PREDICTIONS_TUNED = RFC_OVS.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_TUNED))

ACCURACY:  0.2392


In [42]:
# NO improvement in accuracy. Try with log loss entropy

RFC_TUNED = RandomForestClassifier(n_estimators= 300, max_features= 'sqrt', 
                                   max_depth= 6, criterion= 'log_loss', bootstrap= True)

RFC_OVS = OneVsRestClassifier(RFC_TUNED)

RFC_OVS.fit(X_TRAIN_T, Y_TRAIN_B)

In [43]:
PREDICTIONS_TUNED = RFC_OVS.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_TUNED))

ACCURACY:  0.244


In [44]:
# try with reducing the n_estimators to 100 and criterion to entropy and max depth to 8

RFC_TUNED = RandomForestClassifier(n_estimators= 100, max_features= 'sqrt', 
                                   max_depth= 8, criterion= 'entropy', bootstrap= True)

RFC_OVS = OneVsRestClassifier(RFC_TUNED)

RFC_OVS.fit(X_TRAIN_T, Y_TRAIN_B)

In [45]:
PREDICTIONS_TUNED = RFC_OVS.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_TUNED))

ACCURACY:  0.2696


In [46]:
# try with max_depth = 10

RFC_TUNED = RandomForestClassifier(n_estimators= 100, max_features= 'sqrt', 
                                   max_depth= 10, criterion= 'entropy', bootstrap= True)

RFC_OVS = OneVsRestClassifier(RFC_TUNED)

RFC_OVS.fit(X_TRAIN_T, Y_TRAIN_B)

In [47]:
PREDICTIONS_TUNED = RFC_OVS.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_TUNED))

ACCURACY:  0.2792


In [48]:
# try with n_estimators = 50

RFC_TUNED = RandomForestClassifier(n_estimators= 50, max_features= 'sqrt', 
                                   max_depth= 10, criterion= 'entropy', bootstrap= True)

RFC_OVS = OneVsRestClassifier(RFC_TUNED)

RFC_OVS.fit(X_TRAIN_T, Y_TRAIN_B)

In [49]:
PREDICTIONS_TUNED = RFC_OVS.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_TUNED))

ACCURACY:  0.2768


In [50]:
# try with n_estimators = 50 and max_depth = 15

RFC_TUNED = RandomForestClassifier(n_estimators= 50, max_features= 'sqrt', 
                                   max_depth= 25, criterion= 'entropy', bootstrap= True)

RFC_OVS = OneVsRestClassifier(RFC_TUNED)

RFC_OVS.fit(X_TRAIN_T, Y_TRAIN_B)

In [51]:
PREDICTIONS_TUNED = RFC_OVS.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_TUNED))

ACCURACY:  0.2904


In [None]:
from sklearn.tree import DecisionTreeClassifier

DTC_ALG = DecisionTreeClassifier()

ABC_TUNING = AdaBoostClassifier(base_estimator=DTC_ALG,algorithm='SAMME')

ABC_OVS = OneVsRestClassifier(ABC_TUNING)

ABC_OVS.fit(X_TRAIN_T, Y_TRAIN_B)

In [53]:
PREDICTIONS_ABCT = ABC_OVS.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_ABCT))
print('\nPRECISION: ',precision_score(Y_TEST_B,PREDICTIONS_ABCT,average='weighted'))
print('\nRECALL: ',recall_score(Y_TEST_B,PREDICTIONS_ABCT,average='weighted'))
print('\nF1_SCORE: ',f1_score(Y_TEST_B,PREDICTIONS_ABCT,average='weighted'))

ACCURACY:  0.2312

PRECISION:  0.6303979270685621

RECALL:  0.6002

F1_SCORE:  0.610896885725206


In [None]:
#LOG_REG = LogisticRegression(solver='lbfgs', max_iter=2000,random_state=1)

from sklearn.svm import SVC

SVM = SVC()

ABC_TUNING = AdaBoostClassifier(base_estimator=SVM,algorithm='SAMME')

ABC_OVS = OneVsRestClassifier(ABC_TUNING)

ABC_OVS.fit(X_TRAIN_T, Y_TRAIN_B)

# D. Clearly print Performance Metrics.

In [66]:
PREDICTIONS_RFCT = ABC.predict(X_TEST_T)

print('ACCURACY: ',accuracy_score(Y_TEST_B,PREDICTIONS_RFCT))
print('\nPRECISION: ',precision_score(Y_TEST_B,PREDICTIONS_RFCT,average='weighted'))
print('\nRECALL: ',recall_score(Y_TEST_B,PREDICTIONS_RFCT,average='weighted'))
print('\nF1_SCORE: ',f1_score(Y_TEST_B,PREDICTIONS_RFCT,average='weighted'))

ACCURACY:  0.3888

PRECISION:  0.7087703201541039

RECALL:  0.6242

F1_SCORE:  0.6473237444826353


In [67]:
print('CLASSIFICATION REPORT:\n\n',classification_report(Y_TEST_B,PREDICTIONS_RFCT))

CLASSIFICATION REPORT:

               precision    recall  f1-score   support

           0       0.42      0.26      0.32        39
           1       0.72      0.30      0.42        87
           2       0.20      0.20      0.20        10
           3       0.60      0.31      0.41       161
           4       0.36      0.12      0.19        32
           5       0.66      0.36      0.47        86
           6       0.57      0.18      0.28        65
           7       0.38      0.15      0.21        20
           8       0.31      0.18      0.23        22
           9       0.91      0.59      0.71        34
          10       0.96      0.82      0.89       130
          11       0.74      0.85      0.79       499
          12       0.62      0.25      0.36        20
          13       0.00      0.00      0.00         5
          14       0.44      0.15      0.23        26
          15       0.00      0.00      0.00         3
          16       0.00      0.00      0.00         2
  

# 5. Share insights on relative performance comparison

# A. Which vectorizer performed better? Probable reason?

    
    * USING BAG OF WORDS VECTORIZER, MODEL IS EVALUATED AT 47% ACCURACY. BUT USING TFID IT IS REDUCED TO 36%
    
    * BAG OF WORDS IS MORE FLEXIBLE AND MOST USED VECTORIZER WHEN WE ARE SOLVING CLASSIFICATION PROBLEMS.
    
    * EXTRACTING FEATURES FROM THE DOCUEMNTS IS SIMPLE.
    
    * CREATES MORE NULL VALUES AND MAKES IT MORE TIME COMPLEX IN PROCESSING THE MODEL.
    
    * SINCE I USED ONLY A PORTION OF THE DATA FOR TRAINING THE MODEL, ACCURACY OF THE MODEL IS EXPECTED TO BE LESS.
    
    * AND EVEN WITH LESS AMOUNT OF DATA USED, BAG OF WORDS VECTORISER HAS TAKEN MORE TIME AND MEMORY CONSUMPTION.
    
    * TF-IDF VECTORISER IN TURN USED VERY LESS TIME AND MEMORY SPACE FOR EXECUTION. THIS CAN HELP TO ADD MORE DATA
      TO TRAIN THE MODEL AND INCREASE THE ACCURACY. ALSO PERFORMING THE PARAMETER TUNING ON CLASSIFIER AND LATER 
      TRAINING THE MODEL WITH TF-IDF VECTORIZER IS MUCH EASIER.
      
    * TF-IDF VECTORIZER REGAINS THE CONTEXT OF THE TRAINING DATA WHERE AS BAG OF WORDS SOME TIME GIVES SIMILAR
      CONTEXT TO SENTENCES WITH SIMILAR WORDS.
      
    * CONSIDERING THESE POINTS, TF-IDF HAS PERFORMED BETTER OVER BAG OF WORDS.

# B. Which model outperformed? Probable reason?


    * WE HAVE USED ONLY ONE VS MANY CLASSIFIER SINCE WE ARE DEALING WITH MULTI-LABEL CLASSIFICATION PROBLEM.
    
    * FOR THE BASE MODEL, WE USED LOGISTIC REGRESSION WHICH GAVE 47% ACCURACY WHICH IS VERY LESS FOR A MODEL,
      BUT WE ARE USING 5000 RECORDS FOR TRAIN AND TEST THE MODEL WHICH IS VERY LESS COMPARED TO 8.8 LAKH RECORDS
      
    * IN THE CURRENT CONTEXT BASE MODEL USED WITH LOGISTIC REGRESSION OUTPERFORMED OVER OTHER MODELS IN ACCURACY.
    
    * THE PROBLEM WITH BASE MODEL WITH LOGISTIC REGRESSION IS MORE TIME AND MEMORY CONSUMPTION EVEN WIH TF-IDF 
      VECTORIZED DATA.
      
    * SINCE WE ARE USING BINARY DATA FOR CLASSIFICATION, WE BUILD OTHER MODEL USING ESTIMATOR AS ADA BOOST,
      WHICH GAVE 38.8% OF ACCURACY WITH TF-IDF VECTORIZED DATA AND CONSUMEND VERY LESS TIME AND MEMORY.
      
    * OTHER MODELS WHERE WE HYPERTUNED THE PARAMETERS FOR RANDOM FOREST CLASSIFIER AND ADA BOOST GAVE LESS 
      ACCURACY THAN THE BASE MODEL WITH ADA BOOST CLASSIFIER.
      
    * COSIDERING THE ALL POINTS, ONE VS MANY CLASSIFIER WITH ADA BOOST ESTIMATOR OUTPERFORMED OTHER MODELS SINCE
      IT TOOK WAY LESS TIME AND MEMORY SPACE MAKING IT LEAST TIME COMPLEX AND MEMORY USAGE OF ALL THE MODELS.

# C. Which parameter/hyperparameter significantly helped to improve performance?Probable reason?

    * TUING THE PARAMETERS GAVE VERY LESS ACCURACY THAN USING MODEL AS IS.
    
    * IN RFC ESTIMATOR, WE CAN SEE THAT TUNING PARAMETES STARTED TO IMPROVE THE ACCURACY STARTING FROM LOW
      WHEN WE REDUCED THE N_ESTIMATORS AND INCREASED THE DEPTH. FURTHER INCREASE IN DEPTH CAN FIT MORE DATA
      INCASE MORE RECORDS ARE USED FOR TRAINING THE MODEL.
      
    * SINCE WE USED LESS DATA FOR TRAIN AND TEST THE MODEL, WE CAN SEE THE PERFORMANCE OF THE MODEL IS EVALUATED 
      AT VERY LOW.

# D. According to you, which performance metric should be given most importance, why?

    * PRECISION AND RECALL SHOULD BE TWO PERFORMANCE METRICS I GIVE MOST IMPORTANCE TO.
    
    * WHILE BUILDING THE CLASSIFICATION MODEL, MOST IMPORTANT ASPECT IS TO REDUCE THE FALSE POSITIVES AND 
      FALSE NEGATIVES THAT MAKES THE MODEL TO PREDICT THE MORE TRUE POSITIVES.
      
    * ACCURACY GIVES ONLY OVERALL PICTURE OF MODEL IN CLASSIFYING THE TEST RECORDS WHICH MAY NOT BE IDEAL.
    
    * WHEN WE CONSIDER PRECISION AND RECALL, THE MORE THE VALUE IS NEARING TO 1, THE MORE THE MODEL IS DOING
      ACTUAL PREDICTIONS AND CLASSIFYING THEM CORRECTLY.
      
    * EVENTHOUGH THE MODEL ACCURACY IS VERY LESS, WE CAN ACTUALLY SEE THAT PRECISION IS ABOVE 70% IN MOST
      OF THE CASES AND RECALL IS ABOVE 60% WHICH CAN BE FURTHER INCREASED IF WE USE MORE AND MORE DATA TO 
      FIT INTO THE MODEL THAT MAKES THE TEST RECORDS TO BE CLASSIFIED CORRECTLY.
      
    * HENCE PRECISION AND RECALL ARE THE TWO PERFORMANCE METRICS THAT I LOOK OUT FOR WHILE BUILDING THE MODEL

# DOMAIN: Customer support

# CONTEXT: Great Learning has a an academic support department which receives numerous support requests every day throughout the year. Teams are spread across geographies and try to provide support round the year. Sometimes there are circumstances where due to heavy workload certain request resolutions are delayed, impacting company’s business. Some of the requests are very generic where a proper resolution procedure delivered to the user can solve the problem. Company is looking forward to design an automation which can interact with the user, understand the problem and display the resolution procedure [ if found as a generic request ] or redirect the request to an actual human support executive if the request is complex or not in it’s database.

# DATA DESCRIPTION: A sample corpus is attached for your reference. Please enhance/add more data to the corpus using your linguistics skills.

# PROJECT OBJECTIVE: Design a python based interactive semi - rule based chatbot which can do the following:

# 1. Start chat session with greetings and ask what the user is looking for.

# 2. Accept dynamic text based questions from the user. Reply back with relevant answer from the designed corpus.

# 3. End the chat session only if the user requests to end else ask what the user is looking for. Loop continues till the user asks to end it.

In [68]:
import io
import random
import string # to process standard python strings
import nltk
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

In [92]:
DATA = pd.read_json('GL+Bot.json')

In [93]:
DATA.head()

Unnamed: 0,intents
0,"{'tag': 'Intro', 'patterns': ['hi', 'how are y..."
1,"{'tag': 'Exit', 'patterns': ['thank you', 'tha..."
2,"{'tag': 'Olympus', 'patterns': ['olympus', 'ex..."
3,"{'tag': 'SL', 'patterns': ['i am not able to u..."
4,"{'tag': 'NN', 'patterns': ['what is deep learn..."


In [94]:
DATA.tail()

Unnamed: 0,intents
3,"{'tag': 'SL', 'patterns': ['i am not able to u..."
4,"{'tag': 'NN', 'patterns': ['what is deep learn..."
5,"{'tag': 'Bot', 'patterns': ['what is your name..."
6,"{'tag': 'Profane', 'patterns': ['what the hell..."
7,"{'tag': 'Ticket', 'patterns': ['my problem is ..."


In [95]:
DATA.shape

(8, 1)

In [97]:
DATA.dtypes

intents    object
dtype: object

In [74]:
nltk.download('popular', quiet=True)
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\thril\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\thril\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# CONVERT THE DATA INTO LIST OF WORDS

# ADD PATTERNS AND ASSOCIATED TAGS

In [98]:
WORDS = []
LABELS = []
PATTERNS = []
TAGS = []

In [99]:
for intent in DATA['intents']:
    for PATTERN in intent['patterns']:
        WORD = nltk.word_tokenize(PATTERN)
        WORDS.extend(WORD)
        PATTERNS.append(WORD)
        TAGS.append(intent["tag"])
        
    if intent['tag'] not in LABELS:
        LABELS.append(intent['tag'])

In [100]:
np.unique(TAGS)

array(['Bot', 'Exit', 'Intro', 'NN', 'Olympus', 'Profane', 'SL', 'Ticket'],
      dtype='<U7')

In [101]:
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()

WORDS = [stemmer.stem(w.lower()) for w in WORDS if w != "?"]
WORDS = sorted(list(set(WORDS)))

LABELS = sorted(LABELS)

In [102]:
# SINCE ML ALGORITHMS USE NUMERIC DATA, SO USE BAG OF WORDS TO FORMAT THE INPUT FOR ML ALGORITHM

TRAINING_DATA = []
RESPONSES = []

out_empty = [0 for _ in range(len(LABELS))]

for x, doc in enumerate(PATTERNS):
    BAG = []

    WORD = [stemmer.stem(w.lower()) for w in doc]

    for w in WORDS:
        if w in WORD:
            BAG.append(1)
        else:
            BAG.append(0)

    RESPONSE = out_empty[:]
    RESPONSE[LABELS.index(TAGS[x])] = 1

    TRAINING_DATA.append(BAG)
    RESPONSES.append(RESPONSE)

In [103]:
TRAINING_DATA = np.array(TRAINING_DATA)

RESPONSES = np.array(RESPONSES)

# DESIGN, TRAIN AND SAVE THE MODEL

    * NOW THAT WE HAVE THE INPUTS AND RESPONSES READY, WE WILL DESIGN THE MODEL AND TRAIN IT.
    * SAVE THE MODEL FOR FUTURE USAGE

In [104]:
import tflearn
import tensorflow
import random

Instructions for updating:
non-resource variables are not supported in the long term


In [105]:
tensorflow.compat.v1.reset_default_graph()

TRAIN_MODEL = tflearn.input_data(shape=[None, len(TRAINING_DATA[0])])
TRAIN_MODEL = tflearn.fully_connected(TRAIN_MODEL, 8)
TRAIN_MODEL = tflearn.fully_connected(TRAIN_MODEL, 8)
TRAIN_MODEL = tflearn.fully_connected(TRAIN_MODEL, len(RESPONSES[0]), activation="softmax")
TRAIN_MODEL = tflearn.regression(TRAIN_MODEL)

CHATBOT_MODEL = tflearn.DNN(TRAIN_MODEL)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [106]:
CHATBOT_MODEL.fit(TRAINING_DATA, RESPONSES, n_epoch = 1000, batch_size = 8, show_metric = True)

Training Step: 15999  | total loss: [1m[32m0.00046[0m[0m | time: 0.091s
| Adam | epoch: 1000 | loss: 0.00046 - acc: 1.0000 -- iter: 120/128
Training Step: 16000  | total loss: [1m[32m0.00048[0m[0m | time: 0.107s
| Adam | epoch: 1000 | loss: 0.00048 - acc: 1.0000 -- iter: 128/128
--


In [107]:
CHATBOT_MODEL.save("CHATBOT_MODEL.tflearn")

INFO:tensorflow:C:\Users\thril\CHATBOT_MODEL.tflearn is not in all_model_checkpoint_paths. Manually adding it.
INFO:tensorflow:C:\Users\thril\CHATBOT_MODEL.tflearn.data-00000-of-00001
INFO:tensorflow:0
INFO:tensorflow:C:\Users\thril\CHATBOT_MODEL.tflearn.index
INFO:tensorflow:0
INFO:tensorflow:C:\Users\thril\CHATBOT_MODEL.tflearn.meta
INFO:tensorflow:100


In [108]:
def INPUT_WORDS(INPUT, WORDS):
    BAG = [0 for _ in range(len(WORDS))]

    INPUT_WORDS = nltk.word_tokenize(INPUT)
    INPUT_WORDS = [stemmer.stem(word.lower()) for word in INPUT_WORDS]

    for se in INPUT_WORDS:
        for i, w in enumerate(WORDS):
            if w == se:
                BAG[i] = 1
            
    return np.array(BAG)

In [109]:
def CHATBOT_FUNC():
    print("Welcome, Start talking with the bot (type quit to stop)!")
    while True:
        INPUT = input("You: ")
        if INPUT.lower() == "quit":
            break

        RESULTS = CHATBOT_MODEL.predict([INPUT_WORDS(INPUT, WORDS)])
        RESULTS_INDEX = np.argmax(RESULTS)
        TG = LABELS[RESULTS_INDEX]

        for TGS in DATA["intents"]:
            if TGS['tag'] == TG:
                RESPONSES = TGS['responses']

        print(random.choice(RESPONSES))

In [110]:
CHATBOT_FUNC()

Welcome, Start talking with the bot (type quit to stop)!
You: hi
Hello! how can i help you ?
You: GREAT LEARNING
Hello! how can i help you ?
You: need to learn about COMPuter ViSiOn
Link: Neural Nets wiki
You: MAChinE LEarNiNg
Link: Machine Learning wiki 
You: need more help
I hope I was able to assist you, Good Bye
You: not satisfied with the help
Tarnsferring the request to your PM
You: thanks. take me to olypus
I hope I was able to assist you, Good Bye
You: thanks take me to olympus
Tarnsferring the request to your PM
You: what the hell
Please use respectful words
You: you are not smart
I am your virtual learning assistant
You: i know it
Hello! how can i help you ?
You: you are not helping me
I am your virtual learning assistant
You: i hate you
Please use respectful words
You: i never disrespected you
I hope I was able to assist you, Good Bye
You: no you are not able to assist
Tarnsferring the request to your PM
You: I think I have to add more data to your file to make you more inte

                                        *** END OF NLP PROJECT 1 ***
                                                *** THANKS ***