<a href="https://colab.research.google.com/github/OmkarModi/Text_classification/blob/main/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi Class Text Classification

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from sklearn import preprocessing

In [2]:
#paths to various files used in projects

data_path = "/content/root2ai - Data.csv"
word_embedding_path = 'glove.6B.300d.txt'

## data preprocessing

In [3]:
data = pd.read_csv(data_path)
print(data.head())
print(data.info())

                                                Text      Target
0  reserve bank forming expert committee based in...  Blockchain
1          director could play role financial system  Blockchain
2  preliminary discuss secure transaction study r...  Blockchain
3  security indeed prove essential transforming f...  Blockchain
4  bank settlement normally take three days based...  Blockchain
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22704 entries, 0 to 22703
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    22701 non-null  object
 1   Target  22704 non-null  object
dtypes: object(2)
memory usage: 354.9+ KB
None


from info we can find that our data consist some empty or null cells so we need to deal with it

In [4]:
#it is necessary to clean the cells that have NaN values or are empty 
#so that don't raise errors while performing classification
data.dropna(inplace=True)

###train test spliting and label encoding

the labels provided are categorial data so it is necessary to encode them so computer could understand them and train it

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

X = data['Text']
y = data['Target']
encoder = preprocessing.LabelEncoder()
y = np.array(encoder.fit_transform(y))

X_train, X_test, y_train , y_test = train_test_split(X,y, test_size = 0.2 , random_state = 0)

enc = OneHotEncoder(sparse=False)
onehot_train_y = y_train.reshape(len(y_train),1)  #reshaping it to 2d array as OneHotEncoder requires 2d array as perameter
onehot_train_y = enc.fit_transform(onehot_train_y)
onehot_test_y = y_test.reshape(len(y_test),1)
onehot_test_y = enc.fit_transform(onehot_test_y)

classes label of our data consists can be obtained. there are 11 classes our labels are distributed

In [6]:
class_names = list(encoder.classes_)
print(class_names)

['Bigdata', 'Blockchain', 'Cyber Security', 'Data Security', 'FinTech', 'Microservices', 'Neobanks', 'Reg Tech', 'Robo Advising', 'Stock Trading', 'credit reporting']


NOTE- In preprocessing step our text needs to be cleaned. we should clean all non word characters, html tags, stopwords and other noises in texts. Data provided to us is already cleaned and is lowercased so this step is skipped.

##Feature Selection

raw text is transformed into meaningful feature vectors

###Count vectors

Count Vector is a matrix notation of the dataset in which every row represents a text from the data, every column represents a word from the text, and every cell represents the frequency count of a particular term in a particular document.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
count_vect.fit(X_train)
count_vect_xtrain = count_vect.transform(X_train)

Data representation is similar to that of count Vectors but each cell contains a scalar quantity rather than frequency which represents the relative importance of a term in the document 

###Word Level Tfid

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
word_tfid_vect = TfidfVectorizer()
word_tfid_vect_xtrain = word_tfid_vect.fit_transform(X_train)

###ngram Level Tfid

group of n adjacent words is considered because the group contain important information rather than single word.

In [9]:
ngram_tfid_vect = TfidfVectorizer(ngram_range = (2,3))
ngram_tfid_vect_xtrain = ngram_tfid_vect.fit_transform(X_train)

###Character Level Tfid

Here character level score is counted 

In [10]:
char_tfid_vect = TfidfVectorizer(analyzer = 'char',ngram_range=(2,3))
char_tfid_vect_xtrain = char_tfid_vect.fit_transform(X_train)

###Word2Vec

A word embedding is a form of representing words and documents using a dense vector representation. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.

In [11]:
import gensim.models 
from nltk.tokenize import word_tokenize,sent_tokenize
nltk.download('punkt')
sentence = data['Text'].tolist()
sent_token = [word_tokenize(sent) for sent in sentence]
model = gensim.models.Word2Vec(sentences=sent_token)

model.wv.init_sims()

#using average vectors is found to be useful feature

def word_averaging(wv, words):
    all_words, mean = set(), []
    
    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    if not mean:
        return np.zeros(wv.vector_size,)

    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, text) for text in text_list ])

def w2v_tokenize_text(text):
    tokens = []
    for sent in sent_tokenize(text, language='english'):
        for word in word_tokenize(sent, language='english'):
            if len(word) < 2:
                continue
            tokens.append(word)
    return tokens

train_tokenized = X_train.apply(w2v_tokenize_text)
test_tokenized = X_test.apply(w2v_tokenize_text)

X_train_word_average = word_averaging_list(model.wv,train_tokenized)
X_test_word_average = word_averaging_list(model.wv,test_tokenized)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!




In [12]:
from keras.preprocessing import text, sequence

#this maps the word to index 
token = text.Tokenizer()
token.fit_on_texts(data['Text'])
word_index = token.word_index

#padding sequences to further feed as input to models
train_seq_x = sequence.pad_sequences(token.texts_to_sequences(X_train), maxlen=50)
test_seq_x = sequence.pad_sequences(token.texts_to_sequences(X_test), maxlen=50)

#creating embedding matrix that stores vector representation of words
embedding_matrix = np.zeros((len(word_index) + 1, 100))
for word, i in word_index.items():
  if word in list(model.wv.vocab):
    embedding_vector = model[word]
    embedding_matrix[i] = embedding_vector

  app.launch_new_instance()


###Doc2Vec

This is similar to word2vec but here instead of word whole text is represented to a vector

In [13]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from gensim.models import Doc2Vec
from sklearn import utils
import gensim
from gensim.models.doc2vec import TaggedDocument
import re

def label_sentences(corpus, label_type):
  labeled = []
  for i, v in enumerate(corpus):
      label = label_type + '_' + str(i)
      labeled.append(TaggedDocument(v.split(), [label]))
  return labeled

X_train_d2v = label_sentences(X_train, 'Train')
X_test_d2v = label_sentences(X_test, 'Test')
all_data = X_train_d2v + X_test_d2v

model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, min_count=1, alpha=0.065, min_alpha=0.065)
model_dbow.build_vocab([x for x in tqdm(all_data)])

for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(all_data)]), total_examples=len(all_data), epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

def get_vectors(model, corpus_size, vectors_size, vectors_type):
    """
    Get vectors from trained doc2vec model
    :param doc2vec_model: Trained Doc2Vec model
    :param corpus_size: Size of the data
    :param vectors_size: Size of the embedding vectors
    :param vectors_type: Training or Testing vectors
    :return: list of vectors
    """
    vectors = np.zeros((corpus_size, vectors_size))
    for i in range(0, corpus_size):
        prefix = vectors_type + '_' + str(i)
        vectors[i] = model.docvecs[prefix]
    return vectors
    
train_vectors_dbow = get_vectors(model_dbow, len(X_train_d2v), 300, 'Train')
test_vectors_dbow = get_vectors(model_dbow, len(X_test_d2v), 300, 'Test')

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_vectors_dbow = scaler.fit_transform(train_vectors_dbow)
test_vectors_dbow = scaler.transform(test_vectors_dbow)

  from pandas import Panel
100%|██████████| 22701/22701 [00:00<00:00, 1833629.81it/s]
100%|██████████| 22701/22701 [00:00<00:00, 1458427.46it/s]
100%|██████████| 22701/22701 [00:00<00:00, 2312051.26it/s]
100%|██████████| 22701/22701 [00:00<00:00, 1945066.50it/s]
100%|██████████| 22701/22701 [00:00<00:00, 2779753.46it/s]
100%|██████████| 22701/22701 [00:00<00:00, 2272106.50it/s]
100%|██████████| 22701/22701 [00:00<00:00, 2087954.37it/s]
100%|██████████| 22701/22701 [00:00<00:00, 2705814.18it/s]
100%|██████████| 22701/22701 [00:00<00:00, 2717087.44it/s]
100%|██████████| 22701/22701 [00:00<00:00, 1779119.08it/s]
100%|██████████| 22701/22701 [00:00<00:00, 1739720.36it/s]
100%|██████████| 22701/22701 [00:00<00:00, 2055500.52it/s]
100%|██████████| 22701/22701 [00:00<00:00, 1956014.94it/s]
100%|██████████| 22701/22701 [00:00<00:00, 1916641.07it/s]
100%|██████████| 22701/22701 [00:00<00:00, 2436659.21it/s]
100%|██████████| 22701/22701 [00:00<00:00, 2551925.57it/s]
100%|██████████| 22701/22701 

## Model Building

A dictionary that will story evalution matrix for various model and for each feature

In [14]:
model_dict = {}

Function to train and fit various models

In [15]:
from sklearn import metrics
def model_fit(model,X_train,y_train,X_test,y_test):
  classifier = model
  classifier.fit(X_train,y_train)
  y_pred = classifier.predict(X_test)
  metric = {'accuracy' : metrics.accuracy_score(y_test,y_pred), 'recall' : metrics.recall_score(y_test,y_pred,average = 'weighted',zero_division=0), 'precision' : metrics.precision_score(y_test,y_pred, average = 'weighted',zero_division=0) , 'f1_score' : metrics.f1_score(y_test, y_pred, average='weighted',zero_division=0) }
  return metric

### Logistic Regression

In [16]:
from sklearn.linear_model import LogisticRegression
LR={}
LR['count_vector'] = model_fit(LogisticRegression(max_iter=250),count_vect_xtrain,y_train,count_vect.transform(X_test),y_test)
LR['word_tfid'] = model_fit(LogisticRegression(max_iter=250),word_tfid_vect_xtrain,y_train,word_tfid_vect.transform(X_test),y_test)
LR['ngram_tfid'] = model_fit(LogisticRegression(max_iter=250),ngram_tfid_vect_xtrain,y_train,ngram_tfid_vect.transform(X_test),y_test)
LR['char_tfid'] = model_fit(LogisticRegression(max_iter=250),char_tfid_vect_xtrain,y_train,char_tfid_vect.transform(X_test),y_test)
LR['word2v'] = model_fit(LogisticRegression(max_iter=250),X_train_word_average,y_train,X_test_word_average,y_test)
LR['doc2v'] = model_fit(LogisticRegression(max_iter=250),train_vectors_dbow,y_train,test_vectors_dbow,y_test)

model_dict['LogisticRegression'] = LR

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [17]:
print(model_dict['LogisticRegression'])

{'count_vector': {'accuracy': 0.6650517507157014, 'recall': 0.6650517507157014, 'precision': 0.6773009422038596, 'f1_score': 0.6541626476882758}, 'word_tfid': {'accuracy': 0.6538207443294428, 'recall': 0.6538207443294428, 'precision': 0.7002706540977516, 'f1_score': 0.6306351615684646}, 'ngram_tfid': {'accuracy': 0.47148205241136315, 'recall': 0.47148205241136315, 'precision': 0.6986463925421136, 'f1_score': 0.3798440257847721}, 'char_tfid': {'accuracy': 0.6295970050649636, 'recall': 0.6295970050649636, 'precision': 0.6564821187599058, 'f1_score': 0.6034472687215399}, 'word2v': {'accuracy': 0.39374587095353447, 'recall': 0.39374587095353447, 'precision': 0.3196880828752352, 'f1_score': 0.24014897658836998}, 'doc2v': {'accuracy': 0.6362034794098216, 'recall': 0.6362034794098216, 'precision': 0.6297431103283946, 'f1_score': 0.6210494790348875}}


### Naive Bayes

In [18]:
from sklearn.naive_bayes import MultinomialNB
NB={}
NB['count_vector'] = model_fit(MultinomialNB(),count_vect_xtrain,y_train,count_vect.transform(X_test),y_test)
NB['word_tfid'] = model_fit(MultinomialNB(),word_tfid_vect_xtrain,y_train,word_tfid_vect.transform(X_test),y_test)
NB['ngram_tfid'] = model_fit(MultinomialNB(),ngram_tfid_vect_xtrain,y_train,ngram_tfid_vect.transform(X_test),y_test)
NB['char_tfid'] = model_fit(MultinomialNB(),char_tfid_vect_xtrain,y_train,char_tfid_vect.transform(X_test),y_test)

model_dict['NaiveBayes'] = NB

In [19]:
print(model_dict['NaiveBayes'])

{'count_vector': {'accuracy': 0.6628495926007487, 'recall': 0.6628495926007487, 'precision': 0.705685421810634, 'f1_score': 0.6347429129700688}, 'word_tfid': {'accuracy': 0.5221316890552742, 'recall': 0.5221316890552742, 'precision': 0.6849210138954444, 'f1_score': 0.4365085440444979}, 'ngram_tfid': {'accuracy': 0.43999119136754017, 'recall': 0.43999119136754017, 'precision': 0.6231222767727924, 'f1_score': 0.32447176194272814}, 'char_tfid': {'accuracy': 0.5091389561770535, 'recall': 0.5091389561770535, 'precision': 0.6514710048615274, 'f1_score': 0.42288955013348173}}


### SVM

In [20]:
from sklearn.linear_model import SGDClassifier
SVM = {}
SVM['count_vector'] = model_fit(SGDClassifier(),count_vect_xtrain,y_train,count_vect.transform(X_test),y_test)
SVM['word_tfid'] = model_fit(SGDClassifier(),word_tfid_vect_xtrain,y_train,word_tfid_vect.transform(X_test),y_test)
SVM['ngram_tfid'] = model_fit(SGDClassifier(),ngram_tfid_vect_xtrain,y_train,ngram_tfid_vect.transform(X_test),y_test)
SVM['char_tfid'] = model_fit(SGDClassifier(),char_tfid_vect_xtrain,y_train,char_tfid_vect.transform(X_test),y_test)
SVM['word2v'] = model_fit(SGDClassifier(),X_train_word_average,y_train,X_test_word_average,y_test)
SVM['doc2v'] = model_fit(SGDClassifier(),train_vectors_dbow,y_train,test_vectors_dbow,y_test)

model_dict['SVM'] = SVM

In [21]:
print(model_dict['SVM'])

{'count_vector': {'accuracy': 0.6643911032812156, 'recall': 0.6643911032812156, 'precision': 0.6607242526621231, 'f1_score': 0.6574610268986137}, 'word_tfid': {'accuracy': 0.6683549878881304, 'recall': 0.6683549878881304, 'precision': 0.6732033958745799, 'f1_score': 0.6497244811533924}, 'ngram_tfid': {'accuracy': 0.5961242017176833, 'recall': 0.5961242017176833, 'precision': 0.6600765616182456, 'f1_score': 0.5699298895413699}, 'char_tfid': {'accuracy': 0.6309182999339352, 'recall': 0.6309182999339352, 'precision': 0.6273450682087197, 'f1_score': 0.6057447617099508}, 'word2v': {'accuracy': 0.383395727813257, 'recall': 0.383395727813257, 'precision': 0.44659473032332553, 'f1_score': 0.28740659727033485}, 'doc2v': {'accuracy': 0.5818101739704911, 'recall': 0.5818101739704911, 'precision': 0.574610093880791, 'f1_score': 0.5709939115322595}}


### Random Forest Classifier

In [22]:
from sklearn.ensemble import RandomForestClassifier
RF={}
RF['count_vector'] = model_fit(RandomForestClassifier(),count_vect_xtrain,y_train,count_vect.transform(X_test),y_test)
RF['word_tfid'] = model_fit(RandomForestClassifier(),word_tfid_vect_xtrain,y_train,word_tfid_vect.transform(X_test),y_test)
RF['ngram_tfid'] = model_fit(RandomForestClassifier(),ngram_tfid_vect_xtrain,y_train,ngram_tfid_vect.transform(X_test),y_test)
RF['char_tfid'] = model_fit(RandomForestClassifier(),char_tfid_vect_xtrain,y_train,char_tfid_vect.transform(X_test),y_test)
RF['word2v'] = model_fit(RandomForestClassifier(),X_train_word_average,y_train,X_test_word_average,y_test)
RF['doc2v'] = model_fit(RandomForestClassifier(),train_vectors_dbow,y_train,test_vectors_dbow,y_test)

model_dict['RandomForest'] = RF

In [23]:
print(model_dict['RandomForest'])

{'count_vector': {'accuracy': 0.6260735520810394, 'recall': 0.6260735520810394, 'precision': 0.6376836701184349, 'f1_score': 0.6080098236899938}, 'word_tfid': {'accuracy': 0.6432503853776701, 'recall': 0.6432503853776701, 'precision': 0.6767240523003284, 'f1_score': 0.6190746086145794}, 'ngram_tfid': {'accuracy': 0.42105263157894735, 'recall': 0.42105263157894735, 'precision': 0.6236884370924334, 'f1_score': 0.4550963497504706}, 'char_tfid': {'accuracy': 0.5930411803567496, 'recall': 0.5930411803567496, 'precision': 0.6714304898102078, 'f1_score': 0.5435707224797897}, 'word2v': {'accuracy': 0.561770535124422, 'recall': 0.561770535124422, 'precision': 0.6119084781197427, 'f1_score': 0.5220001653072411}, 'doc2v': {'accuracy': 0.5985465756441313, 'recall': 0.5985465756441313, 'precision': 0.640228287539343, 'f1_score': 0.5405410726038635}}


### Extreme Gradient Boosting(XGB)

In [24]:
import xgboost
XGB={}
XGB['count_vector'] = model_fit(xgboost.XGBClassifier(),count_vect_xtrain,y_train,count_vect.transform(X_test),y_test)
XGB['word_tfid'] = model_fit(xgboost.XGBClassifier(),word_tfid_vect_xtrain,y_train,word_tfid_vect.transform(X_test),y_test)
XGB['ngram_tfid'] = model_fit(xgboost.XGBClassifier(),ngram_tfid_vect_xtrain,y_train,ngram_tfid_vect.transform(X_test),y_test)
XGB['char_tfid'] = model_fit(xgboost.XGBClassifier(),char_tfid_vect_xtrain,y_train,char_tfid_vect.transform(X_test),y_test)
XGB['word2v'] = model_fit(xgboost.XGBClassifier(),X_train_word_average,y_train,X_test_word_average,y_test)
XGB['doc2v'] = model_fit(xgboost.XGBClassifier(),train_vectors_dbow,y_train,test_vectors_dbow,y_test)

model_dict['XGB'] = XGB

In [25]:
print(model_dict['XGB'])

{'count_vector': {'accuracy': 0.5492182338691918, 'recall': 0.5492182338691918, 'precision': 0.6434636040879212, 'f1_score': 0.5093318825272293}, 'word_tfid': {'accuracy': 0.5516406077956397, 'recall': 0.5516406077956397, 'precision': 0.6570849133628531, 'f1_score': 0.5098584817250998}, 'ngram_tfid': {'accuracy': 0.4487998238273508, 'recall': 0.4487998238273508, 'precision': 0.6957606828452155, 'f1_score': 0.34754921252001186}, 'char_tfid': {'accuracy': 0.5769654261175953, 'recall': 0.5769654261175953, 'precision': 0.6679345422989506, 'f1_score': 0.5304686847030918}, 'word2v': {'accuracy': 0.5265360052851795, 'recall': 0.5265360052851795, 'precision': 0.5430254891197235, 'f1_score': 0.47767916325527116}, 'doc2v': {'accuracy': 0.6161638405637525, 'recall': 0.6161638405637525, 'precision': 0.6244509753435562, 'f1_score': 0.582085846360972}}


### Neural Network

In [26]:
import tensorflow as tf
from keras import layers, models, optimizers
def create_model_architecture(input_size):
    # create input layer 
    input_layer = layers.Input((input_size,), sparse=True)
    
    # create hidden layer
    hidden_layer = layers.Dense(100, activation="relu")(input_layer)
    
    # create output layer
    output_layer = layers.Dense(11, activation="sigmoid")(hidden_layer)

    classifier = models.Model(inputs = input_layer, outputs = output_layer)
    classifier.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy',metrics=['accuracy'])
    return classifier 


In [27]:
def NN_model(X_train,y_train,X_test,y_test):
  classifier = create_model_architecture(X_train.shape[1])
  classifier.fit(X_train,y_train,epochs=3)
  loss, acc = classifier.evaluate(X_test, onehot_test_y, verbose=0)
  print(acc)
  y_pred = classifier.predict(X_test)
  y_pred = y_pred.argmax(axis =-1)
  metric = {'accuracy' : metrics.accuracy_score(y_test,y_pred), 'recall' : metrics.recall_score(y_test,y_pred,average = 'weighted'), 'precision' : metrics.precision_score(y_test,y_pred, average = 'weighted') , 'f1_score' : metrics.f1_score(y_test, y_pred, average='macro') }
  return metric

In [28]:
NN = {}
NN['count_vector'] = NN_model(count_vect_xtrain, onehot_train_y,count_vect.transform(X_test),y_test)
NN['word_tfid'] = NN_model(word_tfid_vect_xtrain.toarray(),onehot_train_y,word_tfid_vect.transform(X_test).toarray(),y_test)
# NN['ngram_tfid'] = NN_model(ngram_tfid_vect_xtrain.toarray(),onehot_train_y,ngram_tfid_vect.transform(X_test).toarray(),y_test)
# NN['char_tfid'] = NN_model(char_tfid_vect_xtrain.toarray(),onehot_train_y,char_tfid_vect.transform(X_test).toarray(),y_test)
NN['word2v'] = NN_model(X_train_word_average,onehot_train_y,X_test_word_average,y_test)
NN['doc2v'] = NN_model(train_vectors_dbow,onehot_train_y,test_vectors_dbow,y_test)

model_dict['NN'] = NN

Epoch 1/3


  "shape. This may consume a large amount of memory." % value)


Epoch 2/3
Epoch 3/3
0.6881744265556335
Epoch 1/3
Epoch 2/3
Epoch 3/3
0.6994054317474365
Epoch 1/3
Epoch 2/3
Epoch 3/3
0.39396607875823975


  _warn_prf(average, modifier, msg_start, len(result))


Epoch 1/3
Epoch 2/3
Epoch 3/3
0.640828013420105


In [29]:
print(model_dict['NN'])

{'count_vector': {'accuracy': 0.6881744109227043, 'recall': 0.6881744109227043, 'precision': 0.6926543854883103, 'f1_score': 0.6100675512438144}, 'word_tfid': {'accuracy': 0.6994054173089628, 'recall': 0.6994054173089628, 'precision': 0.6978221072207442, 'f1_score': 0.6117355480652417}, 'word2v': {'accuracy': 0.39396608676502975, 'recall': 0.39396608676502975, 'precision': 0.3314684152334756, 'f1_score': 0.07766698730842095}, 'doc2v': {'accuracy': 0.6408280114512221, 'recall': 0.6408280114512221, 'precision': 0.6355252108767004, 'f1_score': 0.5576211221024399}}


###Convolutional Neural Network

In [34]:
from sklearn import metrics
def DNN_model(classify,X_train,y_train,X_test,y_test):
  classifier = classify
  classifier.fit(X_train,y_train,epochs=60,verbose=2)
  loss, acc = classifier.evaluate(test_seq_x, onehot_test_y, verbose=0)
  y_pred = classifier.predict(X_test)
  y_pred = np.argmax(y_pred,axis =-1)
  metric = {'accuracy' : metrics.accuracy_score(y_test,y_pred), 'recall' : metrics.recall_score(y_test,y_pred,average = 'weighted'), 'precision' : metrics.precision_score(y_test,y_pred, average = 'weighted') , 'f1_score' : metrics.f1_score(y_test, y_pred, average='macro') }
  return metric

In [32]:
from keras import layers
def create_cnn():
    # Add an Input Layer
    input_layer = layers.Input((50, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 100, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the convolutional Layer
    conv_layer = layers.Convolution1D(100, 3, activation="relu")(embedding_layer)

    # Add the pooling Layer
    pooling_layer = layers.GlobalMaxPool1D()(conv_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(pooling_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(11, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

classifier = create_cnn()
cnn = {}
cnn = DNN_model(classifier,train_seq_x,onehot_train_y,test_seq_x,y_test)
model_dict['CNN'] = cnn

Epoch 1/60
568/568 - 7s - loss: 1.9905 - accuracy: 0.4015
Epoch 2/60
568/568 - 5s - loss: 1.9088 - accuracy: 0.4100
Epoch 3/60
568/568 - 5s - loss: 1.8849 - accuracy: 0.4139
Epoch 4/60
568/568 - 5s - loss: 1.8569 - accuracy: 0.4236
Epoch 5/60
568/568 - 5s - loss: 1.8232 - accuracy: 0.4326
Epoch 6/60
568/568 - 5s - loss: 1.8002 - accuracy: 0.4427
Epoch 7/60
568/568 - 5s - loss: 1.7820 - accuracy: 0.4475
Epoch 8/60
568/568 - 5s - loss: 1.7725 - accuracy: 0.4492
Epoch 9/60
568/568 - 5s - loss: 1.7638 - accuracy: 0.4525
Epoch 10/60
568/568 - 5s - loss: 1.7570 - accuracy: 0.4522
Epoch 11/60
568/568 - 5s - loss: 1.7549 - accuracy: 0.4499
Epoch 12/60
568/568 - 5s - loss: 1.7403 - accuracy: 0.4551
Epoch 13/60
568/568 - 5s - loss: 1.7464 - accuracy: 0.4516
Epoch 14/60
568/568 - 5s - loss: 1.7369 - accuracy: 0.4555
Epoch 15/60
568/568 - 5s - loss: 1.7380 - accuracy: 0.4531
Epoch 16/60
568/568 - 5s - loss: 1.7387 - accuracy: 0.4537
Epoch 17/60
568/568 - 5s - loss: 1.7306 - accuracy: 0.4578
Epoch 

  _warn_prf(average, modifier, msg_start, len(result))


In [33]:
print(cnn)

{'accuracy': 0.4818321955516406, 'recall': 0.4818321955516406, 'precision': 0.4665161626154696, 'f1_score': 0.2340110028730936}


In [38]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

vocab_size=len(word_index)+1
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=50,trainable=False))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(25, activation='relu'))
model.add(Dense(11, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(train_seq_x, onehot_train_y, epochs=30, verbose=2)
# evaluate
loss, acc = model.evaluate(test_seq_x, onehot_test_y, verbose=0)
print('Test Accuracy: %f' % (acc*100))
y_pred = model.predict_classes(test_seq_x)


cnn2 = {}
cnn2['accuracy'] = acc
cnn2['recall'] = metrics.recall_score(y_test,y_pred,average = 'weighted')
cnn2['precision'] = metrics.precision_score(y_test,y_pred, average = 'weighted')
cnn2['f1_score'] = metrics.f1_score(y_test, y_pred, average='macro')


Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 50, 100)           1141000   
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 43, 32)            25632     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 21, 32)            0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 672)               0         
_________________________________________________________________
dense_16 (Dense)             (None, 25)                16825     
_________________________________________________________________
dense_17 (Dense)             (None, 11)                286       
Total params: 1,183,743
Trainable params: 42,743
Non-trainable params: 1,141,000
_______________________________________



###LSTM

In [39]:
def create_rnn_lstm():
    # Add an Input Layer
    input_layer = layers.Input((50, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 100, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the LSTM Layer
    lstm_layer = layers.LSTM(100)(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(11, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy',metrics=['accuracy'])
    
    return model

classifier = create_rnn_lstm()
lstm = {}
lstm = DNN_model(classifier,train_seq_x,onehot_train_y,test_seq_x,y_test)


Epoch 1/60
568/568 - 25s - loss: 1.9602 - accuracy: 0.4008
Epoch 2/60
568/568 - 19s - loss: 1.8975 - accuracy: 0.4153
Epoch 3/60
568/568 - 19s - loss: 1.8767 - accuracy: 0.4173
Epoch 4/60
568/568 - 19s - loss: 1.8610 - accuracy: 0.4199
Epoch 5/60
568/568 - 19s - loss: 1.8546 - accuracy: 0.4199
Epoch 6/60
568/568 - 19s - loss: 1.8359 - accuracy: 0.4267
Epoch 7/60
568/568 - 19s - loss: 1.8220 - accuracy: 0.4287
Epoch 8/60
568/568 - 19s - loss: 1.8131 - accuracy: 0.4310
Epoch 9/60
568/568 - 19s - loss: 1.7920 - accuracy: 0.4393
Epoch 10/60
568/568 - 19s - loss: 1.7828 - accuracy: 0.4378
Epoch 11/60
568/568 - 19s - loss: 1.7715 - accuracy: 0.4423
Epoch 12/60
568/568 - 19s - loss: 1.7536 - accuracy: 0.4492
Epoch 13/60
568/568 - 19s - loss: 1.7435 - accuracy: 0.4514
Epoch 14/60
568/568 - 19s - loss: 1.7269 - accuracy: 0.4589
Epoch 15/60
568/568 - 19s - loss: 1.7185 - accuracy: 0.4590
Epoch 16/60
568/568 - 19s - loss: 1.7158 - accuracy: 0.4631
Epoch 17/60
568/568 - 19s - loss: 1.7065 - accura

In [40]:
print(lstm['accuracy'])

0.48667694340453643


### Gated Recurrent Unit

In [70]:
def create_rnn_gru():
    # Add an Input Layer
    input_layer = layers.Input((50, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 100, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the GRU Layer
    lstm_layer = layers.GRU(100)(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(11, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy',metrics= ['accuracy'])
    
    return model

classifier = create_rnn_gru()
rnn_gru = {}
rnn_gru = DNN_model(classifier,train_seq_x,onehot_train_y,test_seq_x,y_test)
wor2v = {}
wor2v['word2vec'] = rnn_gru
model_dict['GRU'] = wor2v

Epoch 1/60
568/568 - 21s - loss: 1.9723 - accuracy: 0.3979
Epoch 2/60
568/568 - 18s - loss: 1.8921 - accuracy: 0.4158
Epoch 3/60
568/568 - 18s - loss: 1.8756 - accuracy: 0.4186
Epoch 4/60
568/568 - 18s - loss: 1.8604 - accuracy: 0.4212
Epoch 5/60
568/568 - 18s - loss: 1.8397 - accuracy: 0.4225
Epoch 6/60
568/568 - 18s - loss: 1.8115 - accuracy: 0.4341
Epoch 7/60
568/568 - 18s - loss: 1.7832 - accuracy: 0.4447
Epoch 8/60
568/568 - 18s - loss: 1.7621 - accuracy: 0.4520
Epoch 9/60
568/568 - 18s - loss: 1.7521 - accuracy: 0.4541
Epoch 10/60
568/568 - 18s - loss: 1.7447 - accuracy: 0.4536
Epoch 11/60
568/568 - 18s - loss: 1.7334 - accuracy: 0.4572
Epoch 12/60
568/568 - 18s - loss: 1.7277 - accuracy: 0.4573
Epoch 13/60
568/568 - 18s - loss: 1.7157 - accuracy: 0.4602
Epoch 14/60
568/568 - 18s - loss: 1.7099 - accuracy: 0.4623
Epoch 15/60
568/568 - 18s - loss: 1.7040 - accuracy: 0.4645
Epoch 16/60
568/568 - 18s - loss: 1.6978 - accuracy: 0.4655
Epoch 17/60
568/568 - 18s - loss: 1.6875 - accura

## Evaluating Models

In [71]:
temp_model_df = pd.DataFrame(columns=['Model','Feature'])
temp_eval_df = pd.DataFrame(columns=['accuracy','recall','precision','f1_score'])

In [58]:
print(model_dict)

{'LogisticRegression': {'count_vector': {'accuracy': 0.6650517507157014, 'recall': 0.6650517507157014, 'precision': 0.6773009422038596, 'f1_score': 0.6541626476882758}, 'word_tfid': {'accuracy': 0.6538207443294428, 'recall': 0.6538207443294428, 'precision': 0.7002706540977516, 'f1_score': 0.6306351615684646}, 'ngram_tfid': {'accuracy': 0.47148205241136315, 'recall': 0.47148205241136315, 'precision': 0.6986463925421136, 'f1_score': 0.3798440257847721}, 'char_tfid': {'accuracy': 0.6295970050649636, 'recall': 0.6295970050649636, 'precision': 0.6564821187599058, 'f1_score': 0.6034472687215399}, 'word2v': {'accuracy': 0.39374587095353447, 'recall': 0.39374587095353447, 'precision': 0.3196880828752352, 'f1_score': 0.24014897658836998}, 'doc2v': {'accuracy': 0.6362034794098216, 'recall': 0.6362034794098216, 'precision': 0.6297431103283946, 'f1_score': 0.6210494790348875}}, 'NaiveBayes': {'count_vector': {'accuracy': 0.6628495926007487, 'recall': 0.6628495926007487, 'precision': 0.705685421810

In [72]:
model_list = []
feature_list = []

for  mod,feature_dict in model_dict.items():
  for feature,evaluation_dict in feature_dict.items():
    model_list.append(mod)
    feature_list.append(feature)
    temp_eval_df = temp_eval_df.append(evaluation_dict,ignore_index=True)
temp_model_df['Model'] = model_list
temp_model_df['Feature'] = feature_list

In [73]:
eval_df=pd.concat([temp_model_df,temp_eval_df],axis=1)
print(eval_df)

                 Model       Feature  accuracy    recall  precision  f1_score
0   LogisticRegression  count_vector  0.665052  0.665052   0.677301  0.654163
1   LogisticRegression     word_tfid  0.653821  0.653821   0.700271  0.630635
2   LogisticRegression    ngram_tfid  0.471482  0.471482   0.698646  0.379844
3   LogisticRegression     char_tfid  0.629597  0.629597   0.656482  0.603447
4   LogisticRegression        word2v  0.393746  0.393746   0.319688  0.240149
5   LogisticRegression         doc2v  0.636203  0.636203   0.629743  0.621049
6           NaiveBayes  count_vector  0.662850  0.662850   0.705685  0.634743
7           NaiveBayes     word_tfid  0.522132  0.522132   0.684921  0.436509
8           NaiveBayes    ngram_tfid  0.439991  0.439991   0.623122  0.324472
9           NaiveBayes     char_tfid  0.509139  0.509139   0.651471  0.422890
10                 SVM  count_vector  0.664391  0.664391   0.660724  0.657461
11                 SVM     word_tfid  0.668355  0.668355   0.673