# Model building using Word2Vec Vectorization

In this file we will try different machine learning models that will trained on the vectorized text data from Word2Vec vectorization.

For the multi-label classification we need the **scikit-multilearn** library. There are different approaches to solve this problem.

1. Problem transformation - In this method we will transform the multi-label problem into a single-label problem. There are 3 techiques that does this which are:
    1. Binary relevance - In this technique it will treat each label as a separate single-label classification problem.
    2. Classifier chains - Another technique that does what binary relevance does but additionally preserves the relationship between each target label.
    3. Label powerset - This converts the multi-label problem into a multi-class problem by assigning each unique combination of labels into a class. This will preserve the correlation between the features. This is the best technique among the three.

2. Ensembles - Custom stacking of base multi-label classifiers. We can try different combinations of individual multi-label classifiers and stack them together to get results. 

3. Adaptation techniques - These are single-label classification techniques that are improvised to perform multi-label classification. Like the Multi-Label KNN classifier.

4. Neural Networks - The neural networks can be used to solve this multi-label problem. Also, we can use the LSTM models which is a modified version of the regular RNN model.

5. BERT - Bidirectional Encoder Representations from Transformers. This uses the transformer technology to learn the text data. This performs much better than the above mentioned models and is mostly used in generative AI problems.

For this multi-label classfication problem there are two metrics that should be considered. These are **the accuracy and the hamming loss**. These two parameters play a vital role in these problems.


In [1]:
!pip install scikit-multilearn

Collecting scikit-multilearn
  Downloading scikit_multilearn-0.2.0-py3-none-any.whl (89 kB)
     ---------------------------------------- 0.0/89.4 kB ? eta -:--:--
     ---------------------------------------- 89.4/89.4 kB 4.9 MB/s eta 0:00:00
Installing collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0


DEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


## Import the libraries

In [1]:
import os
import re
import numpy as np
import pandas as pd
import scipy
import string

#nltk-preprocessing
import nltk
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
from nltk.stem.wordnet import WordNetLemmatizer

#plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

#misc
import joblib
import warnings
warnings.filterwarnings("ignore")
from tqdm.notebook import tqdm
from itertools import combinations

#multi-processing
import multiprocessing
from multiprocessing import Pool,freeze_support
from multiprocessing import Process

#multi-label 
from skmultilearn.model_selection import iterative_train_test_split
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import LabelPowerset

#metrics
from sklearn.metrics import hamming_loss
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn.metrics import roc_curve, auc,roc_auc_score

#modelling
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

#Tensor flow for MLP
import tensorflow as tf
from tensorflow.keras.layers import Dense,Input,Activation,Dropout,BatchNormalization
from tensorflow.keras.models import Model,Sequential
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import LearningRateScheduler
from tensorflow.keras.callbacks import ReduceLROnPlateau
from keras.callbacks import Callback

#model loading
from tensorflow.keras.models import load_model

## Loading the w2v encoded data

In [2]:
x = pd.read_csv('encoded_text_w2v.csv', index_col = 0)
x

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-0.051367,-0.060495,0.001149,-0.049499,0.029956,-0.145452,0.086651,-0.000229,0.064938,-0.536234,...,-0.015343,-0.158155,-0.082105,0.091479,0.108467,0.030422,-0.005188,-0.003624,0.038748,0.181494
1,0.077654,0.083294,0.137685,0.078101,0.008173,-0.049521,0.008864,-0.071333,0.035709,-0.788647,...,-0.028346,0.098004,-0.125618,0.067190,0.184657,0.127138,0.205400,0.043030,0.080688,-0.006079
2,-0.284304,0.001876,0.023268,0.028583,0.042766,0.063144,-0.146772,0.056002,0.172073,-1.142111,...,0.098520,-0.078259,-0.090264,-0.095416,-0.019609,-0.153735,-0.002546,0.070028,0.077496,0.125922
3,-0.221880,0.030425,0.071590,-0.156064,-0.031576,0.028836,-0.090924,0.163952,-0.048329,-1.188217,...,0.054491,-0.033830,-0.026168,0.061329,0.075770,0.200697,0.002465,-0.116672,0.059649,-0.067360
4,0.041364,-0.055539,0.264545,-0.351085,0.183440,0.052547,-0.114125,-0.145492,-0.153485,-0.621002,...,0.048379,0.120352,0.014540,-0.138365,-0.111523,-0.004556,0.024031,-0.004104,-0.278342,0.194327
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
158738,0.024454,0.046005,-0.113831,-0.132215,0.008551,0.063240,-0.024901,0.105503,-0.006373,-1.530968,...,-0.067110,0.006946,0.118147,0.158259,0.110933,0.028245,-0.085713,-0.095958,0.007209,-0.056857
158739,0.334859,0.012415,0.153593,-0.124930,-0.073279,0.034587,-0.089429,-0.016419,0.249144,-0.831725,...,0.267108,-0.331805,-0.096106,0.000979,0.122380,-0.629693,0.056388,0.370078,-0.086796,0.377776
158740,0.132612,0.093696,0.039911,-0.208920,-0.156506,0.039680,0.092735,-0.097160,-0.355542,-0.799413,...,-0.159605,-0.004408,-0.093319,0.113729,0.163493,0.052583,0.125530,0.188881,-0.200602,-0.208021
158741,-0.226283,-0.112413,-0.169718,-0.211149,0.049545,0.164669,-0.129212,-0.046732,0.092141,-1.198321,...,0.068701,-0.116216,-0.040594,0.025550,-0.071042,-0.289346,0.053791,-0.165067,0.111406,0.173672


In [3]:
y = pd.read_csv('targets.csv', index_col = 0)
y

Unnamed: 0,Toxic,Severe Toxic,Obscene,Threat,Insult,Hatred
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
165999,0,0,0,0,0,0
166000,0,0,0,0,0,0
166001,0,0,0,0,0,0
166002,0,0,0,0,0,0


In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [5]:
x_train.shape

(126994, 300)

In [6]:
x_test.shape

(31749, 300)

## Model building

### 1. Problem transformation
### 1.1 Binary relevance

Here a list of classifiers will be prepared and all of them will be trained and evaluated.

In [10]:
clf_list=[GaussianNB(),LogisticRegression(),
          DecisionTreeClassifier(),RandomForestClassifier(),XGBClassifier()]

acc=[]
ham_loss=[]
logloss=[]
avg_auc=[]

for base_clf in tqdm(clf_list):
    clf = BinaryRelevance(base_clf)
    clf.fit(x_train, y_train)
    pred = clf.predict(x_test)
    pred_proba=clf.predict_proba(x_test)
    acc.append(accuracy_score(y_test,pred))
    avg_auc.append(np.mean(roc_auc_score(y_test, pred_proba.A, average=None)))
    ham_loss.append(hamming_loss(y_test,pred))
    logloss.append(log_loss(y_test,pred.A))

bin_rel_res=pd.DataFrame(columns=['Classifier','Exact Match Ratio (Accuracy)','Average AUC',
                                  'Hamming-Loss','Log-Loss'])
bin_rel_res['Classifier']=['Gaussian NB','Logistic Regression','Decision Tree','Random Forest','XGBoost']
bin_rel_res['Exact Match Ratio (Accuracy)']=acc
bin_rel_res['Hamming-Loss']=ham_loss
bin_rel_res['Log-Loss']=logloss
bin_rel_res['Average AUC']=avg_auc
bin_rel_res

  0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0,Classifier,Exact Match Ratio (Accuracy),Average AUC,Hamming-Loss,Log-Loss
0,Gaussian NB,0.764811,0.886629,0.11541,0.693718
1,Logistic Regression,0.901887,0.950726,0.025366,1.564307
2,Decision Tree,0.797883,0.648733,0.04829,2.403681
3,Random Forest,0.904406,0.897222,0.027245,1.503654
4,XGBoost,0.905698,0.953565,0.02345,1.594186


----------

From the above results we can see that the Logistic Regression, Random Forest, and the XGBoost models are performing much better than the others. So, for the other techniques it will be best to use only these 3 models as this is highly computationally expensive.

### 1.2 Classifier chains

In [11]:
clf_list=[LogisticRegression(),RandomForestClassifier(),XGBClassifier()]

acc=[]
ham_loss=[]
logloss=[]
avg_auc=[]

for base_clf in tqdm(clf_list):
    clf = ClassifierChain(base_clf)
    clf.fit(x_train, y_train)
    pred = clf.predict(x_test)
    pred_proba=clf.predict_proba(x_test)
    acc.append(accuracy_score(y_test,pred))
    avg_auc.append(np.mean(roc_auc_score(y_test, pred_proba.A, average=None)))
    ham_loss.append(hamming_loss(y_test,pred))
    logloss.append(log_loss(y_test,pred.A))

clf_chain_res=pd.DataFrame(columns=['Classifier','Exact Match Ratio (Accuracy)',
                                  'Average AUC','Hamming-Loss','Log-Loss'])
clf_chain_res['Classifier']=['Logistic Regression','Random Forest','XGBoost']
clf_chain_res['Exact Match Ratio (Accuracy)']=acc
clf_chain_res['Hamming-Loss']=ham_loss
clf_chain_res['Log-Loss']=logloss
clf_chain_res['Average AUC']=avg_auc
clf_chain_res

  0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,Classifier,Exact Match Ratio (Accuracy),Average AUC,Hamming-Loss,Log-Loss
0,Logistic Regression,0.90491,0.93256,0.025334,1.169596
1,Random Forest,0.90721,0.880164,0.026227,1.196832
2,XGBoost,0.90869,0.942104,0.023486,1.349585


---------

There is a negligible improvement in the results. Among the 3 the XGBoost is giving the best results.

### 1.3 Label Powerset

In [12]:
clf_list=[LogisticRegression(),RandomForestClassifier(),XGBClassifier()]

acc=[]
ham_loss=[]
logloss=[]
avg_auc=[]

for base_clf in tqdm(clf_list):
    clf = LabelPowerset(base_clf)
    clf.fit(x_train, y_train)
    pred = clf.predict(x_test)
    pred_proba=clf.predict_proba(x_test)
    acc.append(accuracy_score(y_test,pred))
    avg_auc.append(np.mean(roc_auc_score(y_test, pred_proba.A, average=None)))
    ham_loss.append(hamming_loss(y_test,pred))
    logloss.append(log_loss(y_test,pred.A))

lbl_pwr_set_res=pd.DataFrame(columns=['Classifier','Exact Match Ratio (Accuracy)',
                                  'Average AUC','Hamming-Loss','Log-Loss'])
lbl_pwr_set_res['Classifier']=['Logistic Regression','Random Forest','XGBoost']
lbl_pwr_set_res['Exact Match Ratio (Accuracy)']=acc
lbl_pwr_set_res['Hamming-Loss']=ham_loss
lbl_pwr_set_res['Log-Loss']=logloss
lbl_pwr_set_res['Average AUC']=avg_auc
lbl_pwr_set_res

  0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,Classifier,Exact Match Ratio (Accuracy),Average AUC,Hamming-Loss,Log-Loss
0,Logistic Regression,0.905855,0.952273,0.025838,1.123778
1,Random Forest,0.902895,0.906882,0.03019,0.741122
2,XGBoost,0.907367,0.953201,0.024961,1.33662


### 2. Ensemble methods
These methods are an aggregation of individual models. Either the same model is grouped together multiple times and different models can be grouped together. Different models grouped together is called stacking of the models.

The main disadvantage of these ensemble methods is that the improvement in the performance is not guaranteed but it will definitely take a lot of time to train the models. So, it is better not to use them unless highly required. So, in this project these ensemble methods are not used as there are better models like LSTM and BERT.

### 3. Adopted algorithm

### 3.1 MLKNN
The regular KNN model modified to work for the multi-label classification problem.

In [7]:
from skmultilearn.adapt import MLkNN
mlknn = MLkNN(k=3)
mlknn.fit(np.array(x_train), np.array(y_train))

In [9]:
def evaluate_score(Y_test,predict): 
    loss = hamming_loss(Y_test,predict)
    print("Hamming_loss : {}".format(loss))
    accuracy = accuracy_score(Y_test,predict)
    print("Accuracy : {}".format(accuracy*100))
    try : 
        loss = log_loss(Y_test,predict)
    except :
        loss = log_loss(Y_test,predict.toarray())
    print("Log_loss : {}".format(loss))

In [9]:
knn_pred = mlknn.predict(x_test)

In [19]:
evaluate_score(y_test, knn_pred)

Hamming_loss : 0.028300103940281585
Accuracy : 89.34139657941984
Log_loss : 1.3628663857268795


In [12]:
knn_acc = {}
knn_hamm = {}
for i in range(3, 10):
    knn = MLkNN(k=i)
    knn.fit(np.array(x_train), np.array(y_train))
    pred = knn.predict(x_test)
    knn_acc[i] = accuracy_score(y_test, pred)
    knn_hamm[i] = hamming_loss(y_test, pred)

In [13]:
knn_acc

{3: 0.8934139657941983,
 4: 0.8816340672147154,
 5: 0.9000283473495229,
 6: 0.8946423509401871,
 7: 0.9017606853759174,
 8: 0.8987054710384579,
 9: 0.896059718416328}

In [14]:
knn_hamm

{3: 0.028300103940281585,
 4: 0.029670225833884532,
 5: 0.02622654781777904,
 6: 0.027155710941027013,
 7: 0.025811836593278528,
 8: 0.026058563524310477,
 9: 0.02597982088674709}

-------------

From the above results we can see the 7 is the best choice for the k value.

In [15]:
mlknn_7 = MLkNN(k=7)
mlknn_7.fit(np.array(x_train), np.array(y_train))

In [16]:
knn_7_pred = mlknn_7.predict(x_test)

In [20]:
evaluate_score(y_test, knn_7_pred)

Hamming_loss : 0.025811836593278528
Accuracy : 90.17606853759173
Log_loss : 1.327427909080764


### 3.2 BRKNN
The regular KNN model combined with Binary Relevance technique mentioned above.

In [7]:
from skmultilearn.adapt import BRkNNaClassifier
brknn = BRkNNaClassifier(k=3)
brknn.fit(np.array(x_train), np.array(y_train))

In [8]:
brknn_pred = brknn.predict(x_test)

In [10]:
evaluate_score(y_test, brknn_pred)

Hamming_loss : 0.027827648114901255
Accuracy : 89.47998362153139
Log_loss : 1.3210079633443539


In [12]:
brknn_acc = {}
brknn_hamm = {}
for i in range(3, 10):
    brknn = BRkNNaClassifier(k=i)
    brknn.fit(np.array(x_train), np.array(y_train))
    pred = brknn.predict(x_test)
    brknn_acc[i] = accuracy_score(y_test, pred)
    brknn_hamm[i] = hamming_loss(y_test, pred)

In [13]:
brknn_acc

{3: 0.8947998362153139,
 4: 0.9068632082900249,
 5: 0.9023906264764244,
 6: 0.9067057230148982,
 7: 0.9048158997133768,
 8: 0.9081230904910391,
 9: 0.9067372200699234}

In [14]:
brknn_hamm

{3: 0.027827648114901255,
 4: 0.02554411162556301,
 5: 0.02587483070332924,
 6: 0.025176645983600532,
 7: 0.02529213518536017,
 8: 0.024819679359979842,
 9: 0.02487742396085966}

------

From the above results we can see that 8 is the best choice for the k value.

In [15]:
brknn_8 = BRkNNaClassifier(k=8)
brknn_8.fit(np.array(x_train), np.array(y_train))

In [16]:
brknn_8_pred = brknn_8.predict(x_test)
evaluate_score(y_test, brknn_8_pred)

Hamming_loss : 0.024819679359979842
Accuracy : 90.81230904910392
Log_loss : 1.3319683407697038


In [7]:
x_train.shape

(126994, 300)

### 4. Neural Networks

In [90]:
from sklearn.model_selection import GridSearchCV, KFold
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.optimizers import Adam
from keras.layers import Dropout

# Defining the model

def create_model():
    model = Sequential()
    model.add(Dense(300,input_dim = 300,kernel_initializer = 'he_uniform',activation = 'relu'))
    model.add(Dropout(0.1))
    model.add(Dense(150,kernel_initializer = 'he_uniform',activation = 'relu'))
    model.add(Dropout(0.1))
    model.add(Dense(75,kernel_initializer = 'he_uniform',activation = 'relu'))
    model.add(Dropout(0.1))
    model.add(Dense(6,activation = 'sigmoid'))
    
    adam = Adam(lr = 0.01) #sgd = SGD(lr=learning_rate, momentum=momentum, decay=decay_rate, nesterov=False)
    model.compile(loss = 'binary_crossentropy',optimizer = adam,metrics = ['binary_accuracy'])
    return model

In [42]:
x_train.shape[1]/4

75.0

In [43]:
#ann = KerasClassifier(build_fn = create_model,verbose = 0,batch_size = 2048,epochs = 20)

In [91]:
ann = create_model()

In [92]:
ann.fit(x_train, y_train, batch_size=64, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1dd9c80b400>

In [93]:
y_pred_train = ann.predict(x_train)
y_pred_test = ann.predict(x_test)



In [94]:
ann_train_acc = accuracy_score(y_train, y_pred_train.round())
ann_test_acc = accuracy_score(y_test, y_pred_test.round())

In [95]:
ann_test_acc

0.9075246464455573

In [96]:
ann_train_acc

0.9197442398853489

In [97]:
ann_train_loss = hamming_loss(y_train, y_pred_train.round())
ann_test_loss = hamming_loss(y_test, y_pred_test.round())

In [98]:
print(ann_train_loss, ann_test_loss)

0.018223958087258716 0.0224626497422491
