<a> **DOMAIN**: Digital content management

<a> **CONTEXT**: Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc.are written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

<a> **DATA DESCRIPTION**: Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:
    
• 8240 "10s" blogs (ages 13-17),
    
• 8086 "20s" blogs(ages 23-27) and    
• 2994 "30s" blogs (ages 33-47)
    
• For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label url link.

<a> **PROJECT OBJECTIVE**: To build a NLP classifier which can use input text parameters to determine the label/s of the blog. Specific to this case study, you can consider the text of the blog: ‘text’ feature as independent variable and ‘topic’ as dependent variable.

#### 1. Read and Analyse Dataset.

In [2]:
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [3]:
extract_dir = os.getcwd()

In [4]:
extract_dir

'C:\\Users\\sanja\\OneDrive\\Desktop\\PGP AIML\\AI - Topics\\4. Natural Language Processing\\Project\\NLP- Project 1'

In [25]:
filename = 'blogs.zip'

In [11]:
import shutil
shutil.unpack_archive(filename, extract_dir)

In [5]:
# read the blogs.csv file

data = pd.read_csv('blogtext.csv')

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 681284 entries, 0 to 681283
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      681284 non-null  int64 
 1   gender  681284 non-null  object
 2   age     681284 non-null  int64 
 3   topic   681284 non-null  object
 4   sign    681284 non-null  object
 5   date    681284 non-null  object
 6   text    681284 non-null  object
dtypes: int64(2), object(5)
memory usage: 36.4+ MB


In [7]:
data.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


In [8]:
data.shape

(681284, 7)

##### There are 681284 records and it's huge to perform data analysis and computational task on such huge volumes of data. Hence, we are going take subset of the data and re-run with entire dataset once all errors are fixed and optimization is done. We will use 50k records only

#### Clean the Structure data

In [9]:
blog_df = pd.read_csv('blogtext.csv', nrows = 50000)

In [10]:
blog_df.isna().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

No Missing values in the blog dataset

#### Eliminate Non-English textual data

In [18]:
! pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py): started
  Building wheel for langdetect (setup.py): finished with status 'done'
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993227 sha256=b7df0929b53db2c27a9a8d62226ad42e86198ca6eedfa49388b3b8a241c9f36e
  Stored in directory: c:\users\sanja\appdata\local\pip\cache\wheels\13\c7\b0\79f66658626032e78fc1a83103690ef6797d551cb22e56e734
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [11]:
from langdetect import detect

def detect_english(text):
    try:
        return detect(text) == 'en'
    except:
        return False

In [12]:
blog_df = blog_df[blog_df['text'].apply(detect_english)]

In [13]:
blog_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47726 entries, 0 to 49998
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      47726 non-null  int64 
 1   gender  47726 non-null  object
 2   age     47726 non-null  int64 
 3   topic   47726 non-null  object
 4   sign    47726 non-null  object
 5   date    47726 non-null  object
 6   text    47726 non-null  object
dtypes: int64(2), object(5)
memory usage: 2.9+ MB


#### 2. Preprocess unstructured data to make it consumable for model training.

2A. Eliminate All special Characters and Numbers

In [14]:
blog_df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...


In [15]:
# Defining a function to eliminate all special characters and numbers

import re

blog_df.text = blog_df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))

2B. Lowercase all textual data 

In [16]:
blog_df.text = blog_df.text.apply(lambda x: x.lower())

2D. Remove all extra white spaces

In [17]:
blog_df.text = blog_df.text.apply(lambda x: x.strip())

2C. Remove all Stopwords

In [18]:
from nltk.corpus import stopwords

In [19]:
stopwords=set(stopwords.words('english'))
blog_df.text = blog_df.text.apply(lambda t: ' '.join([words for words in t.split() if words not in stopwords]) )

In [20]:
blog_df.sample(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
30389,3458177,female,24,Banking,Aries,"08,August,2004",upon time tkgs actually group girls became fri...
44329,3682212,female,27,indUnk,Leo,"01,July,2004",tax refund came today hooray yes yes know thou...
45283,2383253,female,16,Student,Virgo,"11,April,2004",yes survived spring break remember
2616,589736,male,35,Technology,Aries,"05,August,2004",agree points disagree others thanks dialogue
13949,480727,male,23,indUnk,Pisces,"02,June,2004",gotta lax oneself juz erm gonna call issit p g...


#### Build a base Classification model

In [21]:
# we will drop Id and Date columns as they are not useful for model building

blog_df.drop(labels =['id', 'date'], axis =1, inplace =True)

As we want to make this into a multi-label classification problem, we need to merge all the label columns together, so that we have all the labels together for a particular sentence

In [22]:
# Create multi-label column

blog_df['labels'] = blog_df.apply(lambda col: [col['gender'], col['age'], col['topic'],col['sign']], axis =1)

In [23]:
blog_df.head()

Unnamed: 0,gender,age,topic,sign,text,labels
0,male,15,Student,Leo,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
2,male,15,Student,Leo,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,male,15,Student,Leo,testing testing,"[male, 15, Student, Leo]"
4,male,33,InvestmentBanking,Aquarius,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"
5,male,33,InvestmentBanking,Aquarius,interesting conversation dad morning talking k...,"[male, 33, InvestmentBanking, Aquarius]"


In [24]:
#drop gender,age,topic & sign as they are already merged to labels column
blog_df.drop(columns=['gender','age','topic','sign'], axis=1, inplace=True)

In [25]:
blog_df.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"
5,interesting conversation dad morning talking k...,"[male, 33, InvestmentBanking, Aquarius]"


#### Separate features and labels, and split the data into training and testing

In [26]:
X= blog_df.text
y = blog_df.labels

In [27]:
# split X & y into Train and Test datasets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2, test_size = 0.2)

In [28]:
print(X_train.shape)
print(y_train.shape)

(38180,)
(38180,)


In [29]:
print(X_test.shape)
print(y_test.shape)

(9546,)
(9546,)


In [30]:
X_test

45112    aah summer finally time straw hair one day gre...
27710    keep degrading warzone going agree mention peo...
5340     love fireflies lightning bugs glimpses light d...
31110    second day thingy totally forgot username log ...
42089    discovered joys urllink cky song episode jacka...
                               ...                        
987      got first round interview mars inc got email b...
8044     urllink urllink pants originally uploaded urll...
34374    well king paul back throne sort got minority g...
36137    watch late night shows blind date elimidate no...
23848    lady save twenty year olds sprout works days w...
Name: text, Length: 9546, dtype: object

#### Vectorize the features

a. Create a Bag of Words using count vectorizer

i. Use ngram_range=(1, 2)

ii. Vectorize training and testing features

b. Print the term-document matrix

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(ngram_range=(1,2))

In [33]:
# Vectorize Train data

cvect.fit(X_train)

#Check the vocablury size
len(cvect.vocabulary_)

2395722

In [34]:
X_train_ct = cvect.transform(X_train)

In [35]:
X_train_ct

<38180x2395722 sparse matrix of type '<class 'numpy.int64'>'
	with 6841850 stored elements in Compressed Sparse Row format>

In [36]:
X_train_ct[0]

<1x2395722 sparse matrix of type '<class 'numpy.int64'>'
	with 216 stored elements in Compressed Sparse Row format>

In [37]:
X_test_ct = cvect.transform(X_test)

In [38]:
X_test_ct

<9546x2395722 sparse matrix of type '<class 'numpy.int64'>'
	with 1202555 stored elements in Compressed Sparse Row format>

In [39]:
cvect.get_feature_names()[:10]

['aa',
 'aa aa',
 'aa advert',
 'aa amazing',
 'aa anger',
 'aa batteries',
 'aa class',
 'aa damn',
 'aa ended',
 'aa eriol']

In [40]:
print(X_train_ct)

  (0, 52442)	1
  (0, 53466)	1
  (0, 56144)	2
  (0, 56263)	1
  (0, 58940)	1
  (0, 79043)	3
  (0, 79882)	1
  (0, 79893)	1
  (0, 80334)	1
  (0, 110337)	1
  (0, 110459)	1
  (0, 138212)	1
  (0, 139332)	1
  (0, 145086)	2
  (0, 148550)	1
  (0, 149122)	1
  (0, 164233)	1
  (0, 164289)	1
  (0, 241569)	3
  (0, 241683)	1
  (0, 241751)	1
  (0, 242078)	1
  (0, 293363)	1
  (0, 293502)	1
  (0, 382657)	1
  :	:
  (38179, 2119246)	1
  (38179, 2119760)	1
  (38179, 2127746)	1
  (38179, 2128565)	1
  (38179, 2144224)	1
  (38179, 2145413)	1
  (38179, 2158687)	4
  (38179, 2158706)	1
  (38179, 2158714)	1
  (38179, 2159018)	2
  (38179, 2161018)	1
  (38179, 2161022)	1
  (38179, 2172157)	1
  (38179, 2173131)	1
  (38179, 2187523)	1
  (38179, 2188171)	1
  (38179, 2229888)	1
  (38179, 2230912)	1
  (38179, 2264290)	2
  (38179, 2264650)	1
  (38179, 2264762)	1
  (38179, 2306195)	1
  (38179, 2308557)	1
  (38179, 2334453)	1
  (38179, 2334559)	1


In [41]:
print(X_test_ct)

  (0, 382)	1
  (0, 4131)	1
  (0, 4155)	1
  (0, 50369)	1
  (0, 56144)	2
  (0, 57202)	1
  (0, 59110)	1
  (0, 59885)	1
  (0, 60082)	1
  (0, 68406)	1
  (0, 70307)	1
  (0, 70326)	1
  (0, 75759)	1
  (0, 124836)	1
  (0, 125272)	1
  (0, 140611)	1
  (0, 141187)	1
  (0, 141776)	1
  (0, 170078)	2
  (0, 170156)	1
  (0, 170210)	3
  (0, 170294)	1
  (0, 170726)	1
  (0, 205015)	1
  (0, 205419)	1
  :	:
  (9545, 2294843)	1
  (9545, 2295914)	1
  (9545, 2297459)	1
  (9545, 2297631)	1
  (9545, 2301325)	1
  (9545, 2301758)	1
  (9545, 2311321)	1
  (9545, 2311768)	1
  (9545, 2317581)	1
  (9545, 2318201)	1
  (9545, 2348578)	2
  (9545, 2349516)	1
  (9545, 2350211)	1
  (9545, 2350370)	1
  (9545, 2354158)	1
  (9545, 2354317)	1
  (9545, 2357505)	1
  (9545, 2361025)	1
  (9545, 2368358)	1
  (9545, 2375918)	1
  (9545, 2376912)	1
  (9545, 2377409)	1
  (9545, 2378691)	1
  (9545, 2382054)	1
  (9545, 2382215)	1


Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label.

In [43]:
label_counts =dict()

for labels in blog_df.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[str(label)]+=1
        else:
            label_counts[str(label)] =1
    

In [44]:
label_counts


{'male': 24519,
 '15': 1,
 'Student': 10204,
 'Leo': 3666,
 '33': 1,
 'InvestmentBanking': 83,
 'Aquarius': 4627,
 'female': 23207,
 '14': 1,
 'indUnk': 16891,
 'Aries': 7280,
 '25': 1,
 'Capricorn': 3663,
 '17': 1,
 'Gemini': 2400,
 '23': 1,
 'Non-Profit': 470,
 'Cancer': 4400,
 'Banking': 279,
 '37': 1,
 'Sagittarius': 4417,
 '26': 1,
 '24': 1,
 'Scorpio': 3100,
 '27': 1,
 'Education': 2523,
 '45': 1,
 'Engineering': 1341,
 'Libra': 4186,
 'Science': 648,
 '34': 1,
 '41': 1,
 'Communications-Media': 1434,
 'BusinessServices': 399,
 'Sports-Recreation': 118,
 'Virgo': 2758,
 'Taurus': 3235,
 'Arts': 1786,
 'Pisces': 3994,
 '44': 1,
 '16': 1,
 'Internet': 1341,
 'Museums-Libraries': 277,
 'Accounting': 349,
 '39': 1,
 '35': 1,
 'Technology': 4101,
 '36': 1,
 'Law': 290,
 '46': 1,
 'Consulting': 193,
 'Automotive': 116,
 '42': 1,
 'Religion': 250,
 '13': 1,
 'Fashion': 1749,
 '38': 1,
 '43': 1,
 'Publishing': 203,
 '40': 1,
 'Marketing': 395,
 'LawEnforcement-Security': 122,
 'HumanReso

#### Transform the labels

As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn a. Convert your train and test labels using MultiLabelBinarizer

In [45]:
from sklearn.preprocessing import MultiLabelBinarizer
binarizer=MultiLabelBinarizer(classes=sorted(label_counts.keys()))

In [46]:
y_train = binarizer.fit_transform(y_train)

In [47]:
y_test = binarizer.transform(y_test)

In [48]:
y_train

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 1, 1],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 1, 1]])

In [49]:
y_test

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 1, 1],
       [0, 0, 0, ..., 1, 1, 0],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]])

#### Choose a classifier

In this task, we will use OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression . It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large

#### Fit the classifier, make predictions and get the accuracy

 Print the following
 
 i. Accuracy score
 
 ii. F1 score
 
 iii. Average precision score
 
 iv. Average recall score 

In [50]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

In [51]:
model = LogisticRegression(solver ='lbfgs', max_iter =50)
model = OneVsRestClassifier(model)
model.fit(X_train_ct,y_train)

OneVsRestClassifier(estimator=LogisticRegression(max_iter=50))

In [53]:
Ypred=model.predict(X_test_ct)

In [54]:
Ypred

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 1, 1],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]])

In [55]:
y_test

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 1, 1],
       [0, 0, 0, ..., 1, 1, 0],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]])

#### Model Evaluations

#### Micro-average Method:

This method is used to sum up the individual true positives, false positives, and false negatives of the system for different sets and the apply them to get the statistics.

#### Macro-average Method

The method is straight forward. Just take the average of the precision and recall of the system on different sets

In [56]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

def display_metrics_micro(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: Micro', f1_score(Ytest, Ypred, average='micro'))
    print('Average precision score: Micro', average_precision_score(Ytest, Ypred, average='micro'))
    print('Average recall score: Micro', recall_score(Ytest, Ypred, average='micro'))
    
    
def display_metrics_macro(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: Macro', f1_score(Ytest, Ypred, average='macro'))
    print('Average recall score: MAcro', recall_score(Ytest, Ypred, average='macro'))
    
def display_metrics_weighted(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: weighted', f1_score(Ytest, Ypred, average='weighted'))
    print('Average precision score: weighted', average_precision_score(Ytest, Ypred, average='weighted'))
    print('Average recall score: weighted', recall_score(Ytest, Ypred, average='weighted'))

In [57]:
display_metrics_micro(y_test,Ypred)

Accuracy score:  0.15273412947831552
F1 score: Micro 0.5266322825357911
Average precision score: Micro 0.3326194649621051
Average recall score: Micro 0.3975487115021999


In [58]:
display_metrics_macro(y_test,Ypred)

Accuracy score:  0.15273412947831552
F1 score: Macro 0.16217090553148034
Average recall score: MAcro 0.11144507658216643


In [59]:
display_metrics_weighted(y_test,Ypred)

Accuracy score:  0.15273412947831552
F1 score: weighted 0.4871625393828181
Average precision score: weighted 0.4267776369972152
Average recall score: weighted 0.3975487115021999


Print true label and predicted label for any five examples

In [60]:
preds = Ypred[:15]
actuals = y_test[:15]

In [61]:
five_actual = binarizer.inverse_transform(actuals)
five_actual

[('Gemini', 'Student', 'female'),
 ('Sagittarius', 'indUnk', 'male'),
 ('Scorpio', 'female', 'indUnk'),
 ('Cancer', 'Fashion', 'female'),
 ('Libra', 'female', 'indUnk'),
 ('Libra', 'female', 'indUnk'),
 ('Pisces', 'Technology', 'male'),
 ('Capricorn', 'Student', 'female'),
 ('Education', 'Pisces', 'male'),
 ('Aries', 'Technology', 'male'),
 ('Capricorn', 'Student', 'female'),
 ('Law', 'Taurus', 'male'),
 ('Capricorn', 'Tourism', 'male'),
 ('Pisces', 'Student', 'female'),
 ('Communications-Media', 'Leo', 'male')]

In [62]:
five_pred = binarizer.inverse_transform(preds)
five_pred

[('Student', 'female'),
 ('Sagittarius', 'indUnk', 'male'),
 ('female',),
 ('male',),
 ('female', 'indUnk'),
 ('female', 'indUnk'),
 ('Pisces', 'Technology', 'male'),
 (),
 ('Education', 'Pisces', 'male'),
 (),
 ('Student', 'female'),
 ('Taurus', 'male'),
 ('indUnk', 'male'),
 ('female',),
 ('Aquarius', 'female')]

In [63]:
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

def build_model_train(X_train, y_train, X_valid=None, y_valid=None, C=1.0, model='lr'):
    if model=='lr':
        model = LogisticRegression(C=C, penalty='l1', dual=False, solver='liblinear')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
    
    elif model=='svm':
        model = LinearSVC(C=C, penalty='l1', dual=False, loss='squared_hinge')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
    
    elif model=='nbayes':
        model = MultinomialNB(alpha=1.0)
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)
        
    elif model=='lda':
        model = LinearDiscriminantAnalysis(solver='svd')
        model = OneVsRestClassifier(model)
        model.fit(X_train, y_train)

    return model

In [64]:
models = ['lr','svm','nbayes']
for model in models:
    model = build_model_train(X_train_ct,y_train,model=model)
    model.fit(X_train_ct,y_train)
    Ypred=model.predict(X_test_ct)
    print("\n")
    print(f"**displaying  metrics for the mode {model}\n")
    display_metrics_micro(y_test,Ypred)
    print("\n")
    print("\n")
    display_metrics_macro(y_test,Ypred)
    print("\n")
    print("\n")
    display_metrics_weighted(y_test,Ypred)
    print("\n")
    print("\n")



**displaying  metrics for the mode OneVsRestClassifier(estimator=LogisticRegression(penalty='l1',
                                                 solver='liblinear'))

Accuracy score:  0.18856065367693275
F1 score: Micro 0.5460306284104912
Average precision score: Micro 0.3410631906272306
Average recall score: Micro 0.4332704797821077




Accuracy score:  0.18856065367693275
F1 score: Macro 0.22652987777219838
Average recall score: MAcro 0.1684097058474611




Accuracy score:  0.18856065367693275
F1 score: weighted 0.526099334034129
Average precision score: weighted 0.4377674446152397
Average recall score: weighted 0.4332704797821077






**displaying  metrics for the mode OneVsRestClassifier(estimator=LinearSVC(dual=False, penalty='l1'))

Accuracy score:  0.1689712968782736
F1 score: Micro 0.5146608315098468
Average precision score: Micro 0.31781030628526574
Average recall score: Micro 0.39011104127383195




Accuracy score:  0.1689712968782736
F1 score: Macro 0.2365769122873358
A

#### 4. Improve Performance of model.

4A.Experiment with other vectorisers

Now let's use TF-IDF vectorizer to see if that helps improve the model performance

In [65]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [66]:
def tfidf_vector(data):
    tfidf_vectorizer = TfidfVectorizer()
    vect = tfidf_vectorizer.fit_transform(data)
    return vect, tfidf_vectorizer

In [67]:
X_train_tfidf, tfidf_vectorizer = tfidf_vector(X_train)

In [69]:
X_train_tfidf

<38180x111502 sparse matrix of type '<class 'numpy.float64'>'
	with 3011265 stored elements in Compressed Sparse Row format>

In [70]:
X_train_tfidf[0]

<1x111502 sparse matrix of type '<class 'numpy.float64'>'
	with 91 stored elements in Compressed Sparse Row format>

In [71]:
X_test_tfidf = tfidf_vectorizer.transform(X_test)

4B. Build classifier Models using other algorithms than base model.

In [76]:
# from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.naive_bayes import MultinomialNB

In [77]:
model = RandomForestClassifier(n_estimators =100, criterion ='gini', max_depth =None)
model = OneVsRestClassifier(model)
model.fit(X_train_tfidf, y_train)

OneVsRestClassifier(estimator=RandomForestClassifier())

In [78]:
Ypred=model.predict(X_test_tfidf)
print("\n")
print(f"**displaying  metrics for the mode {model}\n")
display_metrics_micro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_macro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_weighted(y_test,Ypred)
print("\n")
print("\n")



**displaying  metrics for the mode OneVsRestClassifier(estimator=RandomForestClassifier())

Accuracy score:  0.11030798240100566
F1 score: Micro 0.47564290697394696
Average precision score: Micro 0.28847121000428444
Average recall score: Micro 0.34520567078706615




Accuracy score:  0.11030798240100566
F1 score: Macro 0.12267122139174329
Average recall score: MAcro 0.08229713217562458




Accuracy score:  0.11030798240100566
F1 score: weighted 0.4222863709442582
Average precision score: weighted 0.3965503399453171
Average recall score: weighted 0.34520567078706615






4C.Tune Parameters/Hyperparameters of the model/s.

In [79]:
# Let's tune the parameters of the model to see if this helps to improve the model performance

model = RandomForestClassifier(n_estimators =10, criterion ='entropy', max_depth =None)
model = OneVsRestClassifier(model)
model.fit(X_train_tfidf, y_train)


OneVsRestClassifier(estimator=RandomForestClassifier(criterion='entropy',
                                                     n_estimators=10))

4D. Clearly print Performance Metrics

In [80]:
Ypred=model.predict(X_test_tfidf)
print("\n")
print(f"**displaying  metrics for the mode {model}\n")
display_metrics_micro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_macro(y_test,Ypred)
print("\n")
print("\n")
display_metrics_weighted(y_test,Ypred)
print("\n")
print("\n")



**displaying  metrics for the mode OneVsRestClassifier(estimator=RandomForestClassifier(criterion='entropy',
                                                     n_estimators=10))

Accuracy score:  0.04923528179342133
F1 score: Micro 0.38891143349903623
Average precision score: Micro 0.21769386649079842
Average recall score: Micro 0.2677212095816747




Accuracy score:  0.04923528179342133
F1 score: Macro 0.07403803364656894
Average recall score: MAcro 0.04837467547351254




Accuracy score:  0.04923528179342133
F1 score: weighted 0.33620644171651504
Average precision score: weighted 0.3410020734821616
Average recall score: weighted 0.2677212095816747






#### 5.Share insights on relative performance comparison

5A. Which vectorizer performed better? Probable reason?.


I think TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.The only difference is that the TfidfVectorizer() returns floats while the CountVectorizer() returns ints.

5B. Which model outperformed? Probable reason?

Logistics Regression model performed best amongst all models. As evidenced above, LR model had the best accoracy and recall scores

5C. Which parameter/hyperparameter significantly helped
to improve performance?Probable reason?.

In the above example for RandomForest Classifier, the parameter "n_estimators" was key to help improve the performance of the model. This is the number of trees we want to build before taking the maximum voting or averages of predictions. Higher number of trees gives us better performance but makes our code slower. We should choose as high value as our processor can handle because this makes our predictions stronger and more stable.

5D. According to you, which performance metric should be
given most importance, why?.

Although there are many ways for measuring classification performance,but the key classification metrics are Accuracy, Recall, Precision and F1-score. And out of these, I think F1 score is the most important metric as it combines precision and recall into one metric. This is the harmonic mean of precision and recall, and is probably the most used metric for evaluating binary classification models. If our F1 score increases, it means that our model has increased performance for accuracy, recall or both

#### <a> PART B

<a> **DOMAIN**: Customer support

<a> **CONTEXT**: Great Learning has a an academic support department which receives numerous support requests every day throughout the year.Teams are spread across geographies and try to provide support round the year. Sometimes there are circumstances where due to heavy workload certain request resolutions are delayed, impacting company’s business. Some of the requests are very generic where a proper resolution procedure delivered to the user can solve the problem. Company is looking forward to design an automation which can interact with the user, understand the problem and display the resolution procedure [ if found as a generic request ] or redirect the request to an actual human support executive if the request is complex or not in it’s database.

<a> **DATA DESCRIPTION**: A sample corpus is attached for your reference. Please enhance/add more data to the corpus using your linguistics skills.

<a> **PROJECT OBJECTIVE**: Design a python based interactive semi - rule based chatbot which can do the following:
1. Start chat session with greetings and ask what the user is looking for. [5 Marks]
2. Accept dynamic text based questions from the user. Reply back with relevant answer from the designed corpus. [10 Marks]
3. End the chat session only if the user requests to end else ask what the user is looking for. Loop continues till the user asks to end it. [5 Marks]

#### Importing necessary libraries

In [12]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
import json
import pickle
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
import random

import warnings
warnings.filterwarnings("ignore")

#### Loading data

As our data is in JSON format, we’ll need to parse our “GL Bot.json” into Python language. This can be done using the JSON package(we have already imported it).

In [13]:
data_file = open('GL Bot.json').read()
intents =json.loads(data_file)

In [14]:
intents

{'intents': [{'tag': 'Intro',
   'patterns': ['hi',
    'how are you',
    'is anyone there',
    'hello',
    'whats up',
    'hey',
    'yo',
    'listen',
    'please help me',
    'i am learner from',
    'i belong to',
    'aiml batch',
    'aifl batch',
    'i am from',
    'my pm is',
    'blended',
    'online',
    'i am from',
    'hey ya',
    'talking to you for first time'],
   'responses': ['Hello! how can i help you ?'],
   'context_set': ''},
  {'tag': 'Exit',
   'patterns': ['thank you',
    'thanks',
    'cya',
    'see you',
    'later',
    'see you later',
    'goodbye',
    'i am leaving',
    'have a Good day',
    'you helped me',
    'thanks a lot',
    'thanks a ton',
    'you are the best',
    'great help',
    'too good',
    'you are a good learning buddy'],
   'responses': ['I hope I was able to assist you, Good Bye'],
   'context_set': ''},
  {'tag': 'Olympus',
   'patterns': ['olympus',
    'explain me how olympus works',
    'I am not able to understan

#### Data Preprocessing

In [15]:
#creating lists 
words=[]
classes = []
documents = []
#ignore these words
ignore_words = ['?', '!']

for intent in intents['intents']:
    for pattern in intent['patterns']:

        #tokenization technique
        w = nltk.word_tokenize(pattern)
        words.extend(w)
        #adding documents
        documents.append((w, intent['tag']))

        # append the tags into "classes" list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

# lemmatization technique
words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
# this way we can remove duplicates
words = sorted(list(set(words)))
# now sort the "classes" list
classes = sorted(list(set(classes)))

print (len(documents), "documents")
# classes are categories of intents
print (len(classes), "classes", classes)
# print all words after apply the two techniques
print (len(words), "unique lemmatized words", words)

128 documents
8 classes ['Bot', 'Exit', 'Intro', 'NN', 'Olympus', 'Profane', 'SL', 'Ticket']
158 unique lemmatized words ['a', 'able', 'access', 'activation', 'ada', 'adam', 'aifl', 'aiml', 'am', 'an', 'ann', 'anyone', 'are', 'artificial', 'backward', 'bad', 'bagging', 'batch', 'bayes', 'belong', 'best', 'blended', 'bloody', 'boosting', 'bot', 'buddy', 'classification', 'contact', 'create', 'cross', 'cya', 'day', 'deep', 'did', 'diffult', 'do', 'ensemble', 'epoch', 'explain', 'first', 'for', 'forest', 'forward', 'from', 'function', 'good', 'goodbye', 'gradient', 'great', 'hate', 'have', 'hell', 'hello', 'help', 'helped', 'hey', 'hi', 'hidden', 'hour', 'how', 'hyper', 'i', 'imputer', 'in', 'intelligence', 'is', 'jerk', 'joke', 'knn', 'later', 'layer', 'learner', 'learning', 'leaving', 'link', 'listen', 'logistic', 'lot', 'machine', 'me', 'ml', 'my', 'naive', 'name', 'nb', 'net', 'network', 'neural', 'no', 'not', 'of', 'olympus', 'olypus', 'on', 'online', 'operation', 'opertions', 'otimi

#### Creating training and testing data

In [16]:
# creating training data
training = []
# empty array for output
output_empty = [0] * len(classes)
# training set
for doc in documents:
    # initialize the list "bag"(which is going to be bag of words)
    bag = []
    # creating list for tokens of pattern(words)
    pattern_words = doc[0]
    # lemmatization
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # if the word is found in current pattern then append 1 in the bag of words array otherwise append 0
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)
    
    # only for current tag, output will be 1. Otherwise 0
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1
    
    training.append([bag, output_row])
    
# shuffling the features
random.shuffle(training)
training = np.array(training)

#spliting the data into x and y . X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Training data created


####  Model building

We use Keras sequential API to build a deep neural network that has 3 layers. 

Compile this Keras model with an SGD optimizer.

Fit the model(I trained my model for 200 epochs)


In [17]:
# We use Keras sequential API to build a deep neural network that has 3 layers. 

model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

#Compile this Keras model with SGD optimizer.
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

#fit the model with 200 epochs 
hist = model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)

#print the statment when the model training is finished
print("model created")


Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

Epoch 84/200
Epoch 85/200
Epoch 86/200
Epoch 87/200
Epoch 88/200
Epoch 89/200
Epoch 90/200
Epoch 91/200
Epoch 92/200
Epoch 93/200
Epoch 94/200
Epoch 95/200
Epoch 96/200
Epoch 97/200
Epoch 98/200
Epoch 99/200
Epoch 100/200
Epoch 101/200
Epoch 102/200
Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200
Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200
Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 15

#### Preprocessing the input

Input given by the user in the chatbot should be in the same manner as our model is trained on. Therefore we do similar text-preprocessing here also by tokenization and lemmatization. We are creating a function for this here.

In [18]:
def clean_up_sentence(sentence):
    # tokenization
    sentence_words = nltk.word_tokenize(sentence)
    # lematization
    sentence_words = [lemmatizer.lemmatize(word.lower()) for word in sentence_words]
    return sentence_words


#### Prediction

We will create a function that can translate the user’s message(sentences) into the bag of words(array which contains 0 and 1 values). When this function finds a word from the sentence in chatbot vocabulary, it sets 1 into the corresponding position within the array. This array is going to be sent to be classified by the model to spot to what intent it belongs.

In [19]:
def bow(sentence, words, show_details=True):
    # tokeniziation(using the function we created earlier)
    sentence_words = clean_up_sentence(sentence)
    # bag of words
    bag = [0]*len(words)  
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s: 
                
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)
    #return array of nag of words
    return(np.array(bag))

#function for prediction
def predict_class(sentence, model):
    # filtering the prediction based on threshold value
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    #setting threshold value
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort on the basis of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

#### Getting the response from the intents

Creating functions that can get a random response from the responses list from the identified intent

In [20]:
#choosing the response randomly from the predefined reponses for the given identified intent
def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    list_of_intents = intents_json['intents']
    for i in list_of_intents:
        if(i['tag']== tag):
            result = random.choice(i['responses'])
            break
    return result

#function to return the response as output in the window
def chatbot_response(msg):
    ints = predict_class(msg, model)
    res = getResponse(ints, intents)
    return res

In [21]:
flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type 'bye' ")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")

        else:
            print("ROBO: ",end="")
            print(chatbot_response(user_response))
            
    else:
        flag=False
        print("ROBO: Bye! take care..")

ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type 'bye' 
hey
Hello! how can i help you ?
how are you 
Hello! how can i help you ?
I am not able to understand olympus
Link: Olympus wiki
who to contact to get for olympus please
Link: Olympus wiki
i am not able to understand SVM
Link: Machine Learning wiki 
what is Deep LEARNING
Link: Neural Nets wiki
your name please
I am your virtual learning assistant
my problem is not resolved
Tarnsferring the request to your PM
This is not a good solution
Tarnsferring the request to your PM
what the hell man
Please use respectful words
you are useless
Please use respectful words
bye
ROBO: Bye! take care..
